joellubi commented on issue #1997: URL: https://github.com/apache/arrow-adbc/issues/1997#issuecomment-2237386995
I've been able to reproduce the issue, both via the script provided (thanks @Zan-L) and in a self-contained unit test as well. It does appear that switching to the BufferedWriter is in fact the culprit. The current approach counts the bytes written to the output buffer when deciding whether the writer should be closed and the file sent across the network. In the case of the buffered writer, no bytes are written to the output buffer at all until the writer's internal buffer is flushed. For this reason the initial set of concurrent writers will just keep writing until they hit EOF on the incoming record stream. A better approach is to compare `targetSize` to `pqWriter.RowGroupTotalBytesWritten()` instead of the size of the output buffer. This unfortunately has some issues as well. It will stop the writer, but potentially after it's written much more than the `targetSize`. The reason for this is that the row group bytes only get updated after the constituent column writers flush their pages. The default data page size is 1MB, so with 100 columns this wouldn't occur until after 100MB were written to the file. I have successfully gotten this to flush closer to the `targetSize` by setting the max page size to a much lower value, so either exposing this setting to the end-user or coming up with a reasonable heuristic for page size seems like it could be a possible solution. Another option that is perhaps simpler and will help further reduce resource usage would be to go back to the unbuffered writer and manually account for records with zero rows (the original issue that led is to use the buffered writer). I'm working this out and should have a PR with one of the approaches soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
