zeroshade commented on issue #37976:
URL: https://github.com/apache/arrow/issues/37976#issuecomment-1750938270

   Hey @frbvianna,
   
   By default there is no compression enabled, so the bytes in the buffers of 
the columns would be a relatively accurate representation of the size of the 
resulting IPC messages (record batch messages are just a very small flatbuffer 
header + the raw buffer bytes for all of the columns). You could certainly 
enable compression if desired via the `WithZstd()` or `WithLZ4()` options when 
you create the ipc writer, which will compress at a buffer level and if the 
compression doesn't actually shrink the buffer, then it will send it 
uncompressed, which as you said would make estimating the size pretty 
difficult, but you'd be still guaranteed that any estimate calculated by 
determining the size of the buffers would be an upper-end. The final size of 
the message would never be larger than the flatbuffer header + the estimated 
buffers.
   
   All that being said, I would probably not attempt to calculate a *per row* 
estimate like was suggested here, but instead do it in batches of some amount 
(probably a power of two for easy computation). The easier / better solution 
would be for whatever is producing the arrow records in the first place to 
chunk the data better before even getting to  your writer, if that is feasible. 
   
   > Each data chunk would end up in multiple Records in the receiver side, 
which does not seem ideal.
   
   That actually depends on what you're actually *doing* with the data on the 
receiver side. You can use an `arrow.Table` to treat a slice of records as a 
single large table for most cases and operations. The fact that the data is 
chunked into multiple records wouldn't actually be an issue unless there's some 
reason you truly need the *entire* dataset as a single contiguous record batch, 
which is rare. Arrow as a format is intended to be easily chunked and streamed, 
offhand I honestly can't really think of a scenario where you really need 
everything in a single contiguous record other than to reduce a little bit of 
complexity (having to iterate records instead of operating on a single large 
one), and in most cases operating on a group of smaller records is more 
performant than operating on a single large one as you can easily parallelize 
whatever work you're doing with each record being an easily parallelizable unit 
of work.
   
   I hope this explanation above helps, if you have further questions feel free 
to either file a new issue or join the conversation on the Zulip / the Gopher 
slack (there's an #apache-arrow-parquet channel there which I am part of).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to