Are we going to say that Arrow 1.0 is not compatible with any version before? My concern is that Spark 2.4.x might get stuck on Arrow Java 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not work. In Spark 3.0.0, though it will be no problem to update both Java and Python to 1.0. Having a compatibility mode so that new readers/writers can work with old readers using a 4-byte prefix would solve the problem, but if we don't want to do this will pyarrow be able to raise an error that clearly the new version does not support the old protocol? For example, would a pyarrow reader see the 0xFFFFFFFF and raise something like "PyArrow detected an old protocol and cannot continue, please use a version < 1.0.0"?
On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney <wesmck...@gmail.com> wrote: > Hi Francois -- copying the metadata into memory isn't the end of the world > but it's a pretty ugly wart. This affects every IPC protocol message > everywhere. > > We have an opportunity to address the wart now but such a fix post-1.0.0 > will be much more difficult. > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques < > fsaintjacq...@gmail.com> wrote: > > > If the data buffers are still aligned, then I don't think we should > > add a breaking change just for avoiding the copy on the metadata? I'd > > expect said metadata to be small enough that zero-copy doesn't really > > affect performance. > > > > François > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > > While working on trying to fix undefined behavior for unaligned memory > > > accesses [1], I ran into an issue with the IPC specification [2] which > > > prevents us from ever achieving zero-copy memory mapping and having > > aligned > > > accesses (i.e. clean UBSan runs). > > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned > accesses. > > > > > > In the IPC format we align each message to 8-byte boundaries. We then > > > write a int32_t integer to to denote the size of flat buffer metadata, > > > followed immediately by the flatbuffer metadata. This means the > > > flatbuffer metadata will never be 8 byte aligned. > > > > > > Do people care? A simple fix would be to use int64_t instead of > int32_t > > > for length. However, any fix essentially breaks all previous client > > > library versions or incurs a memory copy. > > > > > > [1] https://github.com/apache/arrow/pull/4757 > > > [2] https://arrow.apache.org/docs/ipc.html > > >