hi Micah, This is definitely unfortunate, I wish we had realized the potential implications of having the Flatbuffer message start on a 4-byte (rather than 8-byte) boundary. The cost of making such a change now would be pretty high since all readers and writers in all languages would have to be changed. That being said, the 0.14.0 -> 1.0.0 version bump is the last opportunity we have to make a change like this, so we might as well discuss it now. Note that particular implementations could implement compatibility functions to handle the 4 to 8 byte change so that old clients can still be understood. We'd probably want to do this in C++, for example, since users would pretty quickly acquire a new pyarrow version in Spark applications while they are stuck on an old version of the Java libraries.
- Wes On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > While working on trying to fix undefined behavior for unaligned memory > accesses [1], I ran into an issue with the IPC specification [2] which > prevents us from ever achieving zero-copy memory mapping and having aligned > accesses (i.e. clean UBSan runs). > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses. > > In the IPC format we align each message to 8-byte boundaries. We then > write a int32_t integer to to denote the size of flat buffer metadata, > followed immediately by the flatbuffer metadata. This means the > flatbuffer metadata will never be 8 byte aligned. > > Do people care? A simple fix would be to use int64_t instead of int32_t > for length. However, any fix essentially breaks all previous client > library versions or incurs a memory copy. > > [1] https://github.com/apache/arrow/pull/4757 > [2] https://arrow.apache.org/docs/ipc.html