joellubi commented on issue #1997:
URL: https://github.com/apache/arrow-adbc/issues/1997#issuecomment-2237386995

   I've been able to reproduce the issue, both via the script provided (thanks 
@Zan-L) and in a self-contained unit test as well. It does appear that 
switching to the BufferedWriter is in fact the culprit. The current approach 
counts the bytes written to the output buffer when deciding whether the writer 
should be closed and the file sent across the network. In the case of the 
buffered writer, no bytes are written to the output buffer at all until the 
writer's internal buffer is flushed. For this reason the initial set of 
concurrent writers will just keep writing until they hit EOF on the incoming 
record stream.
   
   A better approach is to compare `targetSize` to 
`pqWriter.RowGroupTotalBytesWritten()` instead of the size of the output 
buffer. This unfortunately has some issues as well. It will stop the writer, 
but potentially after it's written much more than the `targetSize`. The reason 
for this is that the row group bytes only get updated after the constituent 
column writers flush their pages. The default data page size is 1MB, so with 
100 columns this wouldn't occur until after 100MB were written to the file. I 
have successfully gotten this to flush closer to the `targetSize` by setting 
the max page size to a much lower value, so either exposing this setting to the 
end-user or coming up with a reasonable heuristic for page size seems like it 
could be a possible solution.
   
   Another option that is perhaps simpler and will help further reduce resource 
usage would be to go back to the unbuffered writer and manually account for 
records with zero rows (the original issue that led is to use the buffered 
writer). I'm working this out and should have a PR with one of the approaches 
soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to