Please see [1].  I ran this arrow-ipc-read-write-test with UBSAN enabled
and it passed (this isn't my normal dev environment so please double check).


https://github.com/emkornfield/arrow/commit/7fbd0fb95f7ea164284720428c7974b87b4b2443

On Mon, Jul 13, 2020 at 3:12 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I think this might be more complicated, let me see if i can write a test
> that demonstrates what I'm talking about.
>
> On Mon, Jul 13, 2020 at 3:10 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> Here's a patch that does the check
>>
>>
>> https://github.com/wesm/arrow/commit/5bfdb4255a66a4ec62b1c36ba07682fad47df9a7
>>
>> Here is a serialized schema that uses a V6 version
>>
>>
>> https://drive.google.com/file/d/1GiWh5yKXdMaLRWU5K4cnGW2ilybF0LF_/view?usp=sharing
>>
>> See in action
>> https://gist.github.com/wesm/f9621a626d56491b0bd6c8a131acf518
>>
>> This seems hacky to me, but maybe it's OK?
>>
>> On Mon, Jul 13, 2020 at 4:53 PM Wes McKinney <wesmck...@gmail.com> wrote:
>> >
>> > On Mon, Jul 13, 2020 at 4:43 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>> > >>
>> > >> We don't have any test cases that have a future metadata version. I
>> > >> made a branch where I added V6 and wrote an IPC message, then found
>> > >> that I was unable to determine that it was out of bounds (presumably
>> > >> UBSAN would error, though, but we need a runtime error outside of
>> > >> ASAN/UBSAN).
>> > >
>> > > To clarify I don't think UBSAN will error on the existing generated
>> code on future versions. I believe we had issues with parquet because the
>> enums did not have an explicit type (compare [1] to [2]) .  The version
>> check needs to be done in our code (comparing against MAX [3]).
>> > >
>> > > Does that align with your expectations?  So we don't get this for
>> free, but I'm not sure I understand why this is difficult?
>> >
>> > If the metadata version comes through as the int16_t value 5
>> > (currently 4 == V5), how do you get to a runtime error? The generated
>> > Flatbuffers code is doing a static_cast of 5 to the enum which is UB.
>> > Maybe I just don't know what I'm doing. It does not appear to be
>> > possible to obtain the raw int16_t value without doing some kind of
>> > hacking (e.g. reinterpret_cast of Message* to flatbuffers::Table* and
>> > using GetField<int16_t>(VT_VERSION, 0))
>> >
>> > I can make a binary file that uses the currently non-existent V6 so
>> > you can try to detect it and
>> > raise an error
>> >
>> >
>> >
>> > > [1]
>> https://github.com/apache/arrow/blob/master/cpp/src/generated/parquet_types.h#L26
>> > > [2]
>> https://github.com/apache/arrow/blob/master/cpp/src/generated/Schema_generated.h#L91
>> > > [3]
>> https://github.com/apache/arrow/blob/master/cpp/src/generated/Schema_generated.h#L109
>> > >
>> > > On Mon, Jul 13, 2020 at 2:34 PM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> > >>
>> > >> On Mon, Jul 13, 2020 at 4:31 PM Micah Kornfield <
>> emkornfi...@gmail.com> wrote:
>> > >> >
>> > >> > >
>> > >> > >
>> > >> > > That static cast is currently undefined behavior.
>> > >> >
>> > >> > Is ubsan reporting this?  When looking into the feature enum I
>> tried to
>> > >> > understand if that was valid. At the time I read the C++ spec* if
>> the enum
>> > >> > has an explicitly declared type, all values in that types range are
>> > >> > supported.
>> > >>
>> > >> We don't have any test cases that have a future metadata version. I
>> > >> made a branch where I added V6 and wrote an IPC message, then found
>> > >> that I was unable to determine that it was out of bounds (presumably
>> > >> UBSAN would error, though, but we need a runtime error outside of
>> > >> ASAN/UBSAN).
>> > >>
>> > >> > The generated enums provide  a "max" [1] value that should be
>> comparable
>> > >> > against.
>> > >> > <
>> https://github.com/apache/arrow/blob/master/cpp/src/generated/Schema_generated.h#L109
>> >
>> > >> >
>> > >> >
>> > >> > * I am not a C++ lawyer
>> > >> >
>> > >> > [1]
>> > >> >
>> https://github.com/apache/arrow/blob/master/cpp/src/generated/Schema_generated.h#L109
>> > >> >
>> > >> > On Mon, Jul 13, 2020 at 2:19 PM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> > >> >
>> > >> > > I've discovered while working on ARROW-9399 that it is very
>> difficult
>> > >> > > with the Flatbuffers API in C++ to detect a MetadataVersion [1]
>> that
>> > >> > > is higher than the current version.
>> > >> > >
>> > >> > > For example, suppose that 3 or 4 years from now we move from
>> version
>> > >> > > V5 to version V6. The generated Flatbuffers code looks like this
>> > >> > >
>> > >> > > org::apache::arrow::flatbuf::MetadataVersion version() const {
>> > >> > >   return
>> > >> > >
>> static_cast<org::apache::arrow::flatbuf::MetadataVersion>(GetField<int16_t>(VT_VERSION,
>> > >> > > 0));
>> > >> > > }
>> > >> > >
>> > >> > > That static cast is currently undefined behavior.
>> > >> > >
>> > >> > > One way to deal with this would be to add placeholder future
>> versions
>> > >> > > (e.g. V6 and V7).
>> > >> > >
>> > >> > > Another backward-and-forward-compatible option would be to
>> return the
>> > >> > > version as a short (int16_t) rather than the enum value, which is
>> > >> > > subject to this brittleness.
>> > >> > >
>> > >> > > Either way unfortunately I think adding forward compatibility
>> checks
>> > >> > > is out of scope for 1.0.0 and the risk is low since we don't
>> > >> > > anticipate bumping the version anytime soon.
>> > >> > >
>> > >> > > Thanks,
>> > >> > > Wes
>> > >> > >
>> > >> > > [1]:
>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
>> > >> > >
>>
>

Reply via email to