Thanks Micah. I'll check in the test file that has the V6 metadata and open a PR later today
On Mon, Jul 13, 2020 at 5:53 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > To clarify on UBSAN and enums. My understanding is: > > enum A { a = 1, b =2, c = 3}; > class enum B : int16_t { a = 1, b = 2, c = 3}; > > A a = static_cast<A>(4); // UB > B b = static_cast<B>(4); // Not UB. Declaring the holding type makes this > allowable. > > > On Mon, Jul 13, 2020 at 3:44 PM Micah Kornfield <emkornfi...@gmail.com> wrote: >> >> Please see [1]. I ran this arrow-ipc-read-write-test with UBSAN enabled and >> it passed (this isn't my normal dev environment so please double check). >> >> >> https://github.com/emkornfield/arrow/commit/7fbd0fb95f7ea164284720428c7974b87b4b2443 >> >> On Mon, Jul 13, 2020 at 3:12 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >>> >>> I think this might be more complicated, let me see if i can write a test >>> that demonstrates what I'm talking about. >>> >>> On Mon, Jul 13, 2020 at 3:10 PM Wes McKinney <wesmck...@gmail.com> wrote: >>>> >>>> Here's a patch that does the check >>>> >>>> https://github.com/wesm/arrow/commit/5bfdb4255a66a4ec62b1c36ba07682fad47df9a7 >>>> >>>> Here is a serialized schema that uses a V6 version >>>> >>>> https://drive.google.com/file/d/1GiWh5yKXdMaLRWU5K4cnGW2ilybF0LF_/view?usp=sharing >>>> >>>> See in action https://gist.github.com/wesm/f9621a626d56491b0bd6c8a131acf518 >>>> >>>> This seems hacky to me, but maybe it's OK? >>>> >>>> On Mon, Jul 13, 2020 at 4:53 PM Wes McKinney <wesmck...@gmail.com> wrote: >>>> > >>>> > On Mon, Jul 13, 2020 at 4:43 PM Micah Kornfield <emkornfi...@gmail.com> >>>> > wrote: >>>> > >> >>>> > >> We don't have any test cases that have a future metadata version. I >>>> > >> made a branch where I added V6 and wrote an IPC message, then found >>>> > >> that I was unable to determine that it was out of bounds (presumably >>>> > >> UBSAN would error, though, but we need a runtime error outside of >>>> > >> ASAN/UBSAN). >>>> > > >>>> > > To clarify I don't think UBSAN will error on the existing generated >>>> > > code on future versions. I believe we had issues with parquet because >>>> > > the enums did not have an explicit type (compare [1] to [2]) . The >>>> > > version check needs to be done in our code (comparing against MAX [3]). >>>> > > >>>> > > Does that align with your expectations? So we don't get this for >>>> > > free, but I'm not sure I understand why this is difficult? >>>> > >>>> > If the metadata version comes through as the int16_t value 5 >>>> > (currently 4 == V5), how do you get to a runtime error? The generated >>>> > Flatbuffers code is doing a static_cast of 5 to the enum which is UB. >>>> > Maybe I just don't know what I'm doing. It does not appear to be >>>> > possible to obtain the raw int16_t value without doing some kind of >>>> > hacking (e.g. reinterpret_cast of Message* to flatbuffers::Table* and >>>> > using GetField<int16_t>(VT_VERSION, 0)) >>>> > >>>> > I can make a binary file that uses the currently non-existent V6 so >>>> > you can try to detect it and >>>> > raise an error >>>> > >>>> > >>>> > >>>> > > [1] >>>> > > https://github.com/apache/arrow/blob/master/cpp/src/generated/parquet_types.h#L26 >>>> > > [2] >>>> > > https://github.com/apache/arrow/blob/master/cpp/src/generated/Schema_generated.h#L91 >>>> > > [3] >>>> > > https://github.com/apache/arrow/blob/master/cpp/src/generated/Schema_generated.h#L109 >>>> > > >>>> > > On Mon, Jul 13, 2020 at 2:34 PM Wes McKinney <wesmck...@gmail.com> >>>> > > wrote: >>>> > >> >>>> > >> On Mon, Jul 13, 2020 at 4:31 PM Micah Kornfield >>>> > >> <emkornfi...@gmail.com> wrote: >>>> > >> > >>>> > >> > > >>>> > >> > > >>>> > >> > > That static cast is currently undefined behavior. >>>> > >> > >>>> > >> > Is ubsan reporting this? When looking into the feature enum I >>>> > >> > tried to >>>> > >> > understand if that was valid. At the time I read the C++ spec* if >>>> > >> > the enum >>>> > >> > has an explicitly declared type, all values in that types range are >>>> > >> > supported. >>>> > >> >>>> > >> We don't have any test cases that have a future metadata version. I >>>> > >> made a branch where I added V6 and wrote an IPC message, then found >>>> > >> that I was unable to determine that it was out of bounds (presumably >>>> > >> UBSAN would error, though, but we need a runtime error outside of >>>> > >> ASAN/UBSAN). >>>> > >> >>>> > >> > The generated enums provide a "max" [1] value that should be >>>> > >> > comparable >>>> > >> > against. >>>> > >> > <https://github.com/apache/arrow/blob/master/cpp/src/generated/Schema_generated.h#L109> >>>> > >> > >>>> > >> > >>>> > >> > * I am not a C++ lawyer >>>> > >> > >>>> > >> > [1] >>>> > >> > https://github.com/apache/arrow/blob/master/cpp/src/generated/Schema_generated.h#L109 >>>> > >> > >>>> > >> > On Mon, Jul 13, 2020 at 2:19 PM Wes McKinney <wesmck...@gmail.com> >>>> > >> > wrote: >>>> > >> > >>>> > >> > > I've discovered while working on ARROW-9399 that it is very >>>> > >> > > difficult >>>> > >> > > with the Flatbuffers API in C++ to detect a MetadataVersion [1] >>>> > >> > > that >>>> > >> > > is higher than the current version. >>>> > >> > > >>>> > >> > > For example, suppose that 3 or 4 years from now we move from >>>> > >> > > version >>>> > >> > > V5 to version V6. The generated Flatbuffers code looks like this >>>> > >> > > >>>> > >> > > org::apache::arrow::flatbuf::MetadataVersion version() const { >>>> > >> > > return >>>> > >> > > static_cast<org::apache::arrow::flatbuf::MetadataVersion>(GetField<int16_t>(VT_VERSION, >>>> > >> > > 0)); >>>> > >> > > } >>>> > >> > > >>>> > >> > > That static cast is currently undefined behavior. >>>> > >> > > >>>> > >> > > One way to deal with this would be to add placeholder future >>>> > >> > > versions >>>> > >> > > (e.g. V6 and V7). >>>> > >> > > >>>> > >> > > Another backward-and-forward-compatible option would be to return >>>> > >> > > the >>>> > >> > > version as a short (int16_t) rather than the enum value, which is >>>> > >> > > subject to this brittleness. >>>> > >> > > >>>> > >> > > Either way unfortunately I think adding forward compatibility >>>> > >> > > checks >>>> > >> > > is out of scope for 1.0.0 and the risk is low since we don't >>>> > >> > > anticipate bumping the version anytime soon. >>>> > >> > > >>>> > >> > > Thanks, >>>> > >> > > Wes >>>> > >> > > >>>> > >> > > [1]: >>>> > >> > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22 >>>> > >> > >