On Tue, 8 Feb 2022 at 17:37, Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote:
> ... > > Wrt to binary, imo the challenge is: > * we state that backward incompatible changes to the c data interface > require a new spec [1] > Note that this discussion wouldn't change anything about the C Data Interface spec itself. The discussion is only about the *value* that is put in one of the key-value metadata fields. The C Data Interface spec defines how the metadata needs to be stored, but doesn't specify anything about the actual value of one of the key-value metadata fields. > * we state that the metadata is a binary string [2] > * a valid string is a subset of all valid byte arrays and thus removing " > *string*" from the spec is backward incompatible > > If we write invalid utf8 to it and a reader assumes utf8 when reading it, > we trigger undefined behavior. > > I was a bit surprised by ARROW-15613 - my understanding is that the c++ > implementation is not following the spec, and if we at arrow2 were not be > checking for utf8, we would be exposing a vulnerability (at least according > to Rust's standards). We just checked it out of luck (it is O(1), so why > not). > Yes, the C++ implementation is indeed not following the spec. See the "[DISCUSS] Binary Values in Key value pairs" thread ( https://lists.apache.org/thread/blmj0cgv34dgdxqd3ow60ln68khnz0qr). Let's maybe keep this part of the discussion there?