zeroshade commented on issue #48883: URL: https://github.com/apache/arrow/issues/48883#issuecomment-3769187569
Many (possibly most) implementations of Arrow will pad the buffers to a particular alignment for vectorization reasons, commonly 32 or 64 bits. And they don't always *fully* truncate the buffers when writing the IPC files. Given the extensive integration tests that we do for IPC compatibility between implementations I would say that if the current implementations accept option (b), then that is the standard we should allow, which currently seems to be the case. Particularly looking at the linked polars issue: If PyArrow can read the ipc file without issue, that means the C++ implementation allows it, which means that all of the implementations which are based on it allow it (R, ruby, gobject, etc...). In particularly, given the current IPC integration tests don't fail on the ipc files generated by arrow-go, I would wager that all the major implementations allow for the case where the uncompressed buffer is larger than it might necessarily have to be. While it's likely a good suggestion for me to update the Go implementation to better truncate the validity bitmaps, I would argue that polars should allow the IPC files generated since by all accounts, they seem to be considered valid IPC files. That said, it would also be equally valid for polars to only utilize/reference the necessary bytes. e.g., in the case of 5 rows, if the file says the uncompressed size is 4 bytes (because of the padding) it would be perfectly valid for the actual buffer that polars uses to have a length of 1 byte, and polars just ignores the extra 3 bytes (which should all be zeroed anyways) disclaimer: I'm the primary developer/maintainer of the Go implementation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
