zeroshade commented on issue #37976: URL: https://github.com/apache/arrow/issues/37976#issuecomment-1750938270
Hey @frbvianna, By default there is no compression enabled, so the bytes in the buffers of the columns would be a relatively accurate representation of the size of the resulting IPC messages (record batch messages are just a very small flatbuffer header + the raw buffer bytes for all of the columns). You could certainly enable compression if desired via the `WithZstd()` or `WithLZ4()` options when you create the ipc writer, which will compress at a buffer level and if the compression doesn't actually shrink the buffer, then it will send it uncompressed, which as you said would make estimating the size pretty difficult, but you'd be still guaranteed that any estimate calculated by determining the size of the buffers would be an upper-end. The final size of the message would never be larger than the flatbuffer header + the estimated buffers. All that being said, I would probably not attempt to calculate a *per row* estimate like was suggested here, but instead do it in batches of some amount (probably a power of two for easy computation). The easier / better solution would be for whatever is producing the arrow records in the first place to chunk the data better before even getting to your writer, if that is feasible. > Each data chunk would end up in multiple Records in the receiver side, which does not seem ideal. That actually depends on what you're actually *doing* with the data on the receiver side. You can use an `arrow.Table` to treat a slice of records as a single large table for most cases and operations. The fact that the data is chunked into multiple records wouldn't actually be an issue unless there's some reason you truly need the *entire* dataset as a single contiguous record batch, which is rare. Arrow as a format is intended to be easily chunked and streamed, offhand I honestly can't really think of a scenario where you really need everything in a single contiguous record other than to reduce a little bit of complexity (having to iterate records instead of operating on a single large one), and in most cases operating on a group of smaller records is more performant than operating on a single large one as you can easily parallelize whatever work you're doing with each record being an easily parallelizable unit of work. I hope this explanation above helps, if you have further questions feel free to either file a new issue or join the conversation on the Zulip / the Gopher slack (there's an #apache-arrow-parquet channel there which I am part of). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
