Another way forward might be to relax the value type to [byte], but also require implementations to null-terminate binary values regardless. The C++ Flatbuffers implementation does this already [1] (though not the Java one [2]). Old implementations validating UTF8-ness would still be unable to read anything with binary metadata (which is the case they're already in), but implementations just validating for null-termination would continue to work. And again, I believe this is in-spec for Flatbuffers since string lengths don't include the null terminator.
[1]: https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670 [2]: https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637 Though it seems we could emulate C++ here by calling addByte(0) ourselves before createByteVector. -David On Wed, Aug 18, 2021, at 13:20, David Li wrote: > This isn't too thought out yet but: > 1. Any file which stuffs binary data into the value is already unreadable > for anyone directly using Flatbuffers. So we can specify that the field must > be valid UTF-8, but implementations can permit relaxed validation/reading as > binary data instead in order to read old files for the current metadata > version. > 2. For the new version, we can add a [byte] field to KeyValueMetadata > alongside the string field. Since strings in Flatbuffers don't include the > null terminator in their length, implementations can reuse the offset to > allow values which are valid UTF-8 to be placed in both fields at a cost of > only 4 bytes, so old consumers can still see UTF-8 metadata values. Actual > binary values would only exist in the new field, and old consumers would see > a null value (since we didn't declare it required); if both fields have a > value, the binary value takes precedence. I think this at least leaves our > read APIs alone since they already use bytestrings. > A current extension type that uses binary metadata would still be a problem, > though… > > -David > > On Tue, Aug 17, 2021, at 00:08, Micah Kornfield wrote: > > I agree with you any thoughts on a way forward for at least hardening the > > spec (or should this be done at the same time as adding the new field)? > > > > On Mon, Aug 16, 2021 at 1:45 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > I've been poking around the project, and I'm growing concerned that > > > our use of the KeyValue field has already been non-compliant in many > > > cases since we do not validate UTF8-ness. Since we also use KeyValue > > > to handle opaque data serialization for extension types [1], the fact > > > that the specification does not clarify that binary data (such as the > > > output of ExtensionType::Serialize) must be base64-encoded or similar > > > makes this a bit of a minefield at the moment. > > > > > > It seems that there are no particularly excellent solutions, and > > > maintaining the status quo (having now identified these > > > inconsistencies / vagueries in at least the C++ implementation) is > > > probably not a good idea either. In Parquet, where we store a > > > serialized binary Arrow schema, we have to base64-encode [2] > > > > > > [1]: > > > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata > > > [2]: > > > https://github.com/apache/arrow/blob/e990d177b1f1dec962315487682f613d46be573c/cpp/src/parquet/arrow/writer.cc#L423 > > > > > > On Tue, Aug 10, 2021 at 3:27 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > > > > Ah, that's definitely a no-go then (I believe we verify messages > > > > unconditionally in C++). That's unfortunate (and I feel responsible > > > > for missing this, but I suppose we had a lot of opportunities to fix > > > > it prior to the 1.0.0 format version) — so to have actual binary > > > > values (which was the intention in the first place for the metadata) > > > > we would need to add a new metadata field. > > > > > > > > On Tue, Aug 10, 2021 at 6:53 AM Micah Kornfield <emkornfi...@gmail.com> > > > wrote: > > > > > > > > > > One issue with changing it to byte is it would effectively break any > > > reader that is validating flatbuffer data, because flatbuffers verifies > > > null termination [1]. While this might comply with forward compatibility > > > guarantees it seems like a pretty large blast radius. > > > > > > > > > > [1] > > > https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L2457 > > > > > > > > > > On Mon, Jul 12, 2021 at 12:38 PM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > >> > > > > >> pyarrow at least treats the KeyValue values as binary and not UTF-8. > > > > >> > > > > >> On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield < > > > emkornfi...@gmail.com> wrote: > > > > >> > > > > > >> > I think other languages (e.g. java, python) might make more of > > > distinction between utf-8 compatible strings and raw bytes. For python it > > > might be less of a concern if the c++ wrapper already makes the value > > > field > > > look like a bytes field > > > > >> > > > > > >> > On Sunday, July 11, 2021, Wes McKinney <wesmck...@gmail.com> wrote: > > > > >> >> > > > > >> >> We could certainly "upgrade" KeyValue to have a binary value field > > > > >> >> everywhere KeyValue is used, but there is some risk of code in the > > > > >> >> wild expecting there to be a null terminator after the string > > > > >> >> data. > > > > >> >> The Flatbuffers-generated accessor APIs do not depend on the > > > existence > > > > >> >> of the null terminator, though. Not ideal, but I would not be > > > thrilled > > > > >> >> about adding an extra [ BinaryKeyValue ] everyplace we currently > > > have > > > > >> >> [ KeyValue ]. > > > > >> >> > > > > >> >> That said, I doubt that we have any endogenous forward > > > compatibility > > > > >> >> problems related to this in Apache Arrow-maintained libraries, the > > > > >> >> risk would come from users who are interacting with the > > > > >> >> Flatbuffers > > > > >> >> data manually / without using one of our libraries. We could > > > implement > > > > >> >> the changes and run a set of forward compatibility integration > > > tests > > > > >> >> to see if anyone of our released libraries have an issue. > > > > >> >> > > > > >> >> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > >> >> > > > > > >> >> > The cost of an empty vector in Flatbuffers appears to be 4 > > > > >> >> > bytes. > > > > >> >> > > > > > >> >> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield < > > > emkornfi...@gmail.com> wrote: > > > > >> >> > > > > > > >> >> > > Retitling and forking the discussion to talk about key value > > > pairs. > > > > >> >> > > > > > > >> >> > > What is the byte cost of an empty list? Another option would > > > be to > > > > >> >> > > introduce a new BinaryKeyValue table and add binary metadata. > > > > >> >> > > > > > > >> >> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind < > > > > >> >> > > natebauernfe...@deephaven.io> wrote: > > > > >> >> > > > > > > >> >> > > > Deephaven and I are very supportive of "upgrading" the value > > > half of the kv > > > > >> >> > > > pair to a byte vector. What is the best way to find out if > > > there is > > > > >> >> > > > sufficient interest? > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > I've been stewing on the ideas here around schema evolution, > > > and I realize > > > > >> >> > > > the specific feature I am missing is the ability to encode > > > that a field > > > > >> >> > > > (i.e. its FieldNode and accompanying Buffers in the > > > RecordBatch) is > > > > >> >> > > > empty/has-no-data in O(0) cost (yes; for free). > > > > >> >> > > > > > > > >> >> > > > Might there be interest in adding a "field_id" to the > > > FieldNode (which is > > > > >> >> > > > encoded on the RecordBatch flatbuffer)? I see a simple > > > forward-compatible > > > > >> >> > > > upgrade (by either keying off of 0, or explicitly set the > > > field default to > > > > >> >> > > > -1) which would allow the sender to "skip" fields that have > > > 1) FieldNode > > > > >> >> > > > length of zero, and 2) all Buffer's associated at that level > > > (and further > > > > >> >> > > > nested) are also equally empty (i.e. Buffer length is zero). > > > > >> >> > > > > > > > >> >> > > > I understand this concept slightly interferes with > > > RecordBatch's `length` > > > > >> >> > > > field, and that many implementations use that length to > > > resize the > > > > >> >> > > > root-level FieldNodes. The use-case I have in mind has > > > different logical > > > > >> >> > > > lengths per field node; current implementations require > > > sending a > > > > >> >> > > > RecordBatch length of the max length across all root level > > > field nodes. I > > > > >> >> > > > believe this requires a copy of data whenever a field node > > > is too short; I > > > > >> >> > > > don't know if there is a decent solution to this slight > > > inefficiency. I am > > > > >> >> > > > bringing it up because if "skipping a field node when it is > > > empty" is a > > > > >> >> > > > feature, then we may not want to allocate space for those > > > nodes given that > > > > >> >> > > > the record batch length will likely be greater than zero. > > > > >> >> > > > > > > > >> >> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney < > > > wesmck...@gmail.com> wrote: > > > > >> >> > > > > > > > >> >> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li < > > > apa...@lidavidm.me> wrote: > > > > >> >> > > > > > > > > > >> >> > > > > > From the Flatbuffers internals doc[1] it appears they > > > are the same: > > > > >> >> > > > > "Strings are simply a vector of bytes, and are always > > > null-terminated." > > > > >> >> > > > > > > > > >> >> > > > > I see. I took a look at flatbuffers.h, and it appears that > > > changing > > > > >> >> > > > > this field from string to [byte] would be > > > backward-compatible and > > > > >> >> > > > > forward-compatible except with code that expects a null > > > terminator. > > > > >> >> > > > > This is something we could discuss separately if there > > > were enough > > > > >> >> > > > > interest. > > > > >> >> > > > > > > > > >> >> > > > > > [1]: > > > https://google.github.io/flatbuffers/flatbuffers_internals.html > > > > >> >> > > > > > > > > > >> >> > > > > > -David > > > > >> >> > > > > > > > > > >> >> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote: > > > > >> >> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield < > > > > >> >> > > > emkornfi...@gmail.com> > > > > >> >> > > > > wrote: > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > Right, I had wanted to focus the discussion on > > > Flight as I think > > > > >> >> > > > > schema > > > > >> >> > > > > > > > > evolution or multiplexing streams (more so the > > > latter) is a > > > > >> >> > > > > property of the > > > > >> >> > > > > > > > > transport and not the stream format itself. If we > > > are leaning > > > > >> >> > > > > towards just > > > > >> >> > > > > > > > > schema evolution then maybe it makes sense to > > > discuss it for the > > > > >> >> > > > > IPC stream > > > > >> >> > > > > > > > > format and leverage that in Flight. I'd be > > > interested in what > > > > >> >> > > > > others think. > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > I tend to agree, I think stream multiplexing is > > > likely a transport > > > > >> >> > > > > level > > > > >> >> > > > > > > > issue. IMO I think schema evolution should be > > > consistent with the > > > > >> >> > > > > IPC > > > > >> >> > > > > > > > stream format and flight. > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > Nate: it may be worth starting a separate > > > discussion about more > > > > >> >> > > > > general > > > > >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why > > > key-value > > > > >> >> > > > > metadata was > > > > >> >> > > > > > > > > chosen/if opaque bytes were considered in the > > > > >> >> > > > > > > > > past. > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > I think this was an unfortunate design of the key > > > value metadata > > > > >> >> > > > in > > > > >> >> > > > > > > > Schema.fbs, but I don't think I was around when this > > > decision was > > > > >> >> > > > > made. > > > > >> >> > > > > > > > > > > >> >> > > > > > > I agree that it's unfortunate that we did not use [ > > > byte ] instead of > > > > >> >> > > > > > > string for the value in the KeyValue metadata — I > > > think this was more > > > > >> >> > > > > > > of an oversight than a deliberate choice (e.g. it was > > > not our intent > > > > >> >> > > > > > > to require binary data to be base64-encoded — this is > > > something that > > > > >> >> > > > > > > we have to do when encoding binary data in Thrift > > > KeyValue metadata > > > > >> >> > > > > > > for Parquet, for example). Is the binary > > > representation of [byte] > > > > >> >> > > > > > > different from string? > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > Side Question: Why isn't the IPC stream format a > > > series of the > > > > >> >> > > > flight > > > > >> >> > > > > > > > > protobufs? > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > In addition to what David said, protobufs can't be > > > read directly > > > > >> >> > > > > from a > > > > >> >> > > > > > > > memory-mapped file (they need decoding). This was > > > one of the > > > > >> >> > > > design > > > > >> >> > > > > > > > considerations of using flatbuffers and IPC > > > Stream/File format. > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > I was thinking Micah's comment is more that whatever > > > we do, it > > > > >> >> > > > > should be > > > > >> >> > > > > > > > > clearly specified and edge cases should be > > > considered, especially > > > > >> >> > > > > if we > > > > >> >> > > > > > > > > might want to 'backport' this into the stream > > > format later. > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > Yes, for dictionaries we just need to be careful to > > > define > > > > >> >> > > > semantics > > > > >> >> > > > > and > > > > >> >> > > > > > > > ensure implementations are validating them with > > > regards to > > > > >> >> > > > > dictionaries. > > > > >> >> > > > > > > > There likely isn't any need to change current > > > implementations > > > > >> >> > > > though. > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li < > > > lidav...@apache.org> > > > > >> >> > > > > wrote: > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > Right, I had wanted to focus the discussion on > > > Flight as I think > > > > >> >> > > > > schema > > > > >> >> > > > > > > > > evolution or multiplexing streams (more so the > > > latter) is a > > > > >> >> > > > > property of the > > > > >> >> > > > > > > > > transport and not the stream format itself. If we > > > are leaning > > > > >> >> > > > > towards just > > > > >> >> > > > > > > > > schema evolution then maybe it makes sense to > > > discuss it for the > > > > >> >> > > > > IPC stream > > > > >> >> > > > > > > > > format and leverage that in Flight. I'd be > > > interested in what > > > > >> >> > > > > others think. > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > Especially if we are looking at multiplexing > > > streams - I would > > > > >> >> > > > > wonder if > > > > >> >> > > > > > > > > that's actually better served by making it easier > > > to implement > > > > >> >> > > > > using the > > > > >> >> > > > > > > > > Flight implementation as it stands (by managing > > > concurrent RPC > > > > >> >> > > > > calls and/or > > > > >> >> > > > > > > > > performing the union-of-structs encoding trick for > > > you), instead > > > > >> >> > > > > of having > > > > >> >> > > > > > > > > to change the protocol. > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > Nate: it may be worth starting a separate > > > discussion about more > > > > >> >> > > > > general > > > > >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why > > > key-value > > > > >> >> > > > > metadata was > > > > >> >> > > > > > > > > chosen/if opaque bytes were considered in the > > > past. Off the top > > > > >> >> > > > of > > > > >> >> > > > > my head > > > > >> >> > > > > > > > > if it's for on-disk storage and fully > > > application-defined it may > > > > >> >> > > > > make sense > > > > >> >> > > > > > > > > to store as a separate file alongside the Arrow > > > file (indexed by > > > > >> >> > > > > record > > > > >> >> > > > > > > > > batch index) where you can take advantage of > > > whatever format is > > > > >> >> > > > > most > > > > >> >> > > > > > > > > suitable. > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > -David > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan > > > wrote: > > > > >> >> > > > > > > > > > Hi guys, > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial > > > suggestion was > > > > >> >> > > > to > > > > >> >> > > > > add this > > > > >> >> > > > > > > > > > feature starting from the IPC(I moved initial > > > write up steps to > > > > >> >> > > > > the > > > > >> >> > > > > > > > > bottom > > > > >> >> > > > > > > > > > of the doc). Afterwards David suggested focusing > > > on Flight and > > > > >> >> > > > > that's how > > > > >> >> > > > > > > > > > we ended up with the protobufs change in the > > > proposal. This > > > > >> >> > > > > being said I > > > > >> >> > > > > > > > > do > > > > >> >> > > > > > > > > > think that the place where this should be > > > impemented is a good > > > > >> >> > > > > question > > > > >> >> > > > > > > > > on > > > > >> >> > > > > > > > > > its own. Maybe it makes sense to have this kind > > > of a feature in > > > > >> >> > > > > IPC and > > > > >> >> > > > > > > > > > somehow use it in Flight, maybe not. > > > > >> >> > > > > > > > > > 2. The point about dictionaries deserves a > > > dedicated section in > > > > >> >> > > > > the > > > > >> >> > > > > > > > > > proposal. Nate and David brought it up and > > > shared some > > > > >> >> > > > insights. > > > > >> >> > > > > I'll try > > > > >> >> > > > > > > > > > to aggregate them and we can continue the > > > discussion form > > > > >> >> > > > there. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Cheers, > > > > >> >> > > > > > > > > > Gosh > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, < > > > > >> >> > > > > > > > > natebauernfe...@deephaven.io> > > > > >> >> > > > > > > > > > wrote: > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > makes it more difficult to bring schema > > > evolution back > > > > >> >> > > > > into the > > > > >> >> > > > > > > > > > > > > > IPC Stream format (i.e. it would live > > > only in flight) > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer > > > structures not the > > > > >> >> > > > > > > > > protobufs. > > > > >> >> > > > > > > > > > > Can > > > > >> >> > > > > > > > > > > > > you help me understand how difficult it > > > would be to bring > > > > >> >> > > > > the > > > > >> >> > > > > > > > > > > `schema_id` > > > > >> >> > > > > > > > > > > > > approach to the IPC stream format? > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > I thought we were talking solely about the > > > Flight Protobuf > > > > >> >> > > > > > > > > definitions - > > > > >> >> > > > > > > > > > > > not the Flatbuffers (and the Google doc at > > > least only talks > > > > >> >> > > > > about the > > > > >> >> > > > > > > > > > > > Protobufs). > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > I somehow missed that schema_id is being added > > > to protobuf in > > > > >> >> > > > > the > > > > >> >> > > > > > > > > document. > > > > >> >> > > > > > > > > > > It feels to me that the schema_id is a > > > property that would > > > > >> >> > > > > ideally only > > > > >> >> > > > > > > > > > > apply to the RecordBatch. I better understand > > > Micah's > > > > >> >> > > > > dictionary > > > > >> >> > > > > > > > > concerns, > > > > >> >> > > > > > > > > > > now, too. > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > Side Question: Why isn't the IPC stream > > > format a series of > > > > >> >> > > > > the flight > > > > >> >> > > > > > > > > > > > > protobufs? It's a real shame that there is > > > no standard > > > > >> >> > > > way > > > > >> >> > > > > to > > > > >> >> > > > > > > > > > > > > capture/replay a stream with app_metadata. > > > (Obviously > > > > >> >> > > > > ignoring the > > > > >> >> > > > > > > > > > > > > annoyances around protobuf wrapping > > > flatbuffers.) > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > The IPC format was defined long before > > > Flight, and Flight's > > > > >> >> > > > > > > > > app_metadata > > > > >> >> > > > > > > > > > > > was added after Flight's initial definition. > > > Note an IPC > > > > >> >> > > > > message does > > > > >> >> > > > > > > > > > > have > > > > >> >> > > > > > > > > > > > a provision for key-value metadata, though I > > > think APIs for > > > > >> >> > > > > that are > > > > >> >> > > > > > > > > not > > > > >> >> > > > > > > > > > > > fully exposed. (See ARROW-6940: > > > > >> >> > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and > > > > >> >> > > > > despite my > > > > >> >> > > > > > > > > comments > > > > >> >> > > > > > > > > > > > there perhaps we need to unify or at least > > > consider how > > > > >> >> > > > > Flight's > > > > >> >> > > > > > > > > > > > app_metadata relates to the IPC message > > > custom_metadata. > > > > >> >> > > > Also > > > > >> >> > > > > > > > > perhaps see > > > > >> >> > > > > > > > > > > > ARROW-1059.) > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > KeyValue unfortunately is string to string. In > > > flatbuffer > > > > >> >> > > > > strings are > > > > >> >> > > > > > > > > only > > > > >> >> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the > > > other hand is > > > > >> >> > > > > opaque > > > > >> >> > > > > > > > > bytes. > > > > >> >> > > > > > > > > > > The latter is a bit more useful. > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > -- > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > -- > > > > >> >> > > > > > > > > >