On Tue, Jul 2, 2019 at 4:23 AM Antoine Pitrou <anto...@python.org> wrote: > > > Le 02/07/2019 à 00:20, Wes McKinney a écrit : > > Thanks for the references. > > > > If we decided to make a change around this, we could call the first 4 > > bytes a stream continuation marker to make it slightly less ugly > > > > * 0xFFFFFFFF: continue > > * 0x00000000: stop > > Do you mean it would be a separate IPC message?
No, I think this is only about how we could change the message prefix from 4 bytes to 8 bytes https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format Currently a 0x00000000 (0 metadata size) is used as an end-of-stream marker. So what I was saying is that the first 8 bytes could be <4 bytes: stream continuation><int32_t metadata size> > > > > > > On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > >> > >> Hi Wes, > >> I'm not an expert on this either, my inclination mostly comes from some > >> research I've done. I think it is important to distinguish two cases: > >> 1. unaligned access at the processor instruction level > >> 2. undefined behavior > >> > >> From my reading unaligned access is fine on most modern architectures and > >> it seems the performance penalty has mostly been eliminated. > >> > >> Undefined behavior is a compiler/language concept. The problem is the > >> compiler can choose to do anything in UB scenarios, not just the "obvious" > >> translation. Specifically, the compiler is under no obligation to > >> generate the unaligned access instructions, and if it doesn't SEGVs ensue. > >> Two examples, both of which relate to SIMD optimizations are linked below. > >> > >> I tend to be on the conservative side with this type of thing but if we > >> have experts on the the ML that can offer a more informed opinion, I would > >> love to hear it. > >> > >> [1] http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html > >> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709 > >> > >> On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <wesmck...@gmail.com> wrote: > >>> > >>> The <0xffffffff><int32_t size> solution is downright ugly but I think > >>> it's one of the only ways that achieves > >>> > >>> * backward compatibility (new clients can read old data) > >>> * opt-in forward compatibility (if we want to go to the labor of doing > >>> so, sort of dangerous) > >>> * old clients receiving new data do not blow up (they will see a > >>> metadata length of -1) > >>> > >>> NB 0xFFFFFFFF <length> would look like: > >>> > >>> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32) > >>> Out[13]: array([4294967295, 128], dtype=uint32) > >>> > >>> In [14]: np.array([(2 << 32) - 1, 128], > >>> dtype=np.uint32).view(np.int32) > >>> Out[14]: array([ -1, 128], dtype=int32) > >>> > >>> In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8) > >>> Out[15]: array([255, 255, 255, 255, 128, 0, 0, 0], dtype=uint8) > >>> > >>> Flatbuffers are 32-bit limited so we don't need all 64 bits. > >>> > >>> Do you know in what circumstances unaligned reads from Flatbuffers > >>> might cause an issue? I do not know enough about UB but my > >>> understanding is that it causes issues on some specialized platforms > >>> where for most modern x86-64 processors and compilers it is not really > >>> an issue (though perhaps a performance issue) > >>> > >>> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <emkornfi...@gmail.com> > >>> wrote: > >>>> > >>>> At least on the read-side we can make this detectable by using something > >>>> like <0xffffffff><int32_t size> instead of int64_t. On the write side > >>>> we would need some sort of default mode that we could flip on/off if we > >>>> wanted to maintain compatibility. > >>>> > >>>> I should say I think we should fix it. Undefined behavior is unpaid > >>>> debt that might never be collected or might cause things to fail in > >>>> difficult to diagnose ways. And pre-1.0.0 is definitely the time. > >>>> > >>>> -Micah > >>>> > >>>> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <wesmck...@gmail.com> wrote: > >>>>> > >>>>> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <wesmck...@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>> hi Micah, > >>>>>> > >>>>>> This is definitely unfortunate, I wish we had realized the potential > >>>>>> implications of having the Flatbuffer message start on a 4-byte > >>>>>> (rather than 8-byte) boundary. The cost of making such a change now > >>>>>> would be pretty high since all readers and writers in all languages > >>>>>> would have to be changed. That being said, the 0.14.0 -> 1.0.0 version > >>>>>> bump is the last opportunity we have to make a change like this, so we > >>>>>> might as well discuss it now. Note that particular implementations > >>>>>> could implement compatibility functions to handle the 4 to 8 byte > >>>>>> change so that old clients can still be understood. We'd probably want > >>>>>> to do this in C++, for example, since users would pretty quickly > >>>>>> acquire a new pyarrow version in Spark applications while they are > >>>>>> stuck on an old version of the Java libraries. > >>>>> > >>>>> NB such a backwards compatibility fix would not be forward-compatible, > >>>>> so the PySpark users would need to use a pinned version of pyarrow > >>>>> until Spark upgraded to Arrow 1.0.0. Maybe that's OK > >>>>> > >>>>>> > >>>>>> - Wes > >>>>>> > >>>>>> On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield > >>>>>> <emkornfi...@gmail.com> wrote: > >>>>>>> > >>>>>>> While working on trying to fix undefined behavior for unaligned memory > >>>>>>> accesses [1], I ran into an issue with the IPC specification [2] which > >>>>>>> prevents us from ever achieving zero-copy memory mapping and having > >>>>>>> aligned > >>>>>>> accesses (i.e. clean UBSan runs). > >>>>>>> > >>>>>>> Flatbuffer metadata needs 8-byte alignment to guarantee aligned > >>>>>>> accesses. > >>>>>>> > >>>>>>> In the IPC format we align each message to 8-byte boundaries. We then > >>>>>>> write a int32_t integer to to denote the size of flat buffer metadata, > >>>>>>> followed immediately by the flatbuffer metadata. This means the > >>>>>>> flatbuffer metadata will never be 8 byte aligned. > >>>>>>> > >>>>>>> Do people care? A simple fix would be to use int64_t instead of > >>>>>>> int32_t > >>>>>>> for length. However, any fix essentially breaks all previous client > >>>>>>> library versions or incurs a memory copy. > >>>>>>> > >>>>>>> [1] https://github.com/apache/arrow/pull/4757 > >>>>>>> [2] https://arrow.apache.org/docs/ipc.html