Here's a patch that does the check

https://github.com/wesm/arrow/commit/5bfdb4255a66a4ec62b1c36ba07682fad47df9a7

Here is a serialized schema that uses a V6 version

https://drive.google.com/file/d/1GiWh5yKXdMaLRWU5K4cnGW2ilybF0LF_/view?usp=sharing

See in action https://gist.github.com/wesm/f9621a626d56491b0bd6c8a131acf518

This seems hacky to me, but maybe it's OK?

On Mon, Jul 13, 2020 at 4:53 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 4:43 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
> >>
> >> We don't have any test cases that have a future metadata version. I
> >> made a branch where I added V6 and wrote an IPC message, then found
> >> that I was unable to determine that it was out of bounds (presumably
> >> UBSAN would error, though, but we need a runtime error outside of
> >> ASAN/UBSAN).
> >
> > To clarify I don't think UBSAN will error on the existing generated code on 
> > future versions. I believe we had issues with parquet because the enums did 
> > not have an explicit type (compare [1] to [2]) .  The version check needs 
> > to be done in our code (comparing against MAX [3]).
> >
> > Does that align with your expectations?  So we don't get this for free, but 
> > I'm not sure I understand why this is difficult?
>
> If the metadata version comes through as the int16_t value 5
> (currently 4 == V5), how do you get to a runtime error? The generated
> Flatbuffers code is doing a static_cast of 5 to the enum which is UB.
> Maybe I just don't know what I'm doing. It does not appear to be
> possible to obtain the raw int16_t value without doing some kind of
> hacking (e.g. reinterpret_cast of Message* to flatbuffers::Table* and
> using GetField<int16_t>(VT_VERSION, 0))
>
> I can make a binary file that uses the currently non-existent V6 so
> you can try to detect it and
> raise an error
>
>
>
> > [1] 
> > https://github.com/apache/arrow/blob/master/cpp/src/generated/parquet_types.h#L26
> > [2] 
> > https://github.com/apache/arrow/blob/master/cpp/src/generated/Schema_generated.h#L91
> > [3] 
> > https://github.com/apache/arrow/blob/master/cpp/src/generated/Schema_generated.h#L109
> >
> > On Mon, Jul 13, 2020 at 2:34 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >>
> >> On Mon, Jul 13, 2020 at 4:31 PM Micah Kornfield <emkornfi...@gmail.com> 
> >> wrote:
> >> >
> >> > >
> >> > >
> >> > > That static cast is currently undefined behavior.
> >> >
> >> > Is ubsan reporting this?  When looking into the feature enum I tried to
> >> > understand if that was valid. At the time I read the C++ spec* if the 
> >> > enum
> >> > has an explicitly declared type, all values in that types range are
> >> > supported.
> >>
> >> We don't have any test cases that have a future metadata version. I
> >> made a branch where I added V6 and wrote an IPC message, then found
> >> that I was unable to determine that it was out of bounds (presumably
> >> UBSAN would error, though, but we need a runtime error outside of
> >> ASAN/UBSAN).
> >>
> >> > The generated enums provide  a "max" [1] value that should be comparable
> >> > against.
> >> > <https://github.com/apache/arrow/blob/master/cpp/src/generated/Schema_generated.h#L109>
> >> >
> >> >
> >> > * I am not a C++ lawyer
> >> >
> >> > [1]
> >> > https://github.com/apache/arrow/blob/master/cpp/src/generated/Schema_generated.h#L109
> >> >
> >> > On Mon, Jul 13, 2020 at 2:19 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >> >
> >> > > I've discovered while working on ARROW-9399 that it is very difficult
> >> > > with the Flatbuffers API in C++ to detect a MetadataVersion [1] that
> >> > > is higher than the current version.
> >> > >
> >> > > For example, suppose that 3 or 4 years from now we move from version
> >> > > V5 to version V6. The generated Flatbuffers code looks like this
> >> > >
> >> > > org::apache::arrow::flatbuf::MetadataVersion version() const {
> >> > >   return
> >> > > static_cast<org::apache::arrow::flatbuf::MetadataVersion>(GetField<int16_t>(VT_VERSION,
> >> > > 0));
> >> > > }
> >> > >
> >> > > That static cast is currently undefined behavior.
> >> > >
> >> > > One way to deal with this would be to add placeholder future versions
> >> > > (e.g. V6 and V7).
> >> > >
> >> > > Another backward-and-forward-compatible option would be to return the
> >> > > version as a short (int16_t) rather than the enum value, which is
> >> > > subject to this brittleness.
> >> > >
> >> > > Either way unfortunately I think adding forward compatibility checks
> >> > > is out of scope for 1.0.0 and the risk is low since we don't
> >> > > anticipate bumping the version anytime soon.
> >> > >
> >> > > Thanks,
> >> > > Wes
> >> > >
> >> > > [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
> >> > >

Reply via email to