This isn't too thought out yet but: 1. Any file which stuffs binary data into the value is already unreadable for anyone directly using Flatbuffers. So we can specify that the field must be valid UTF-8, but implementations can permit relaxed validation/reading as binary data instead in order to read old files for the current metadata version. 2. For the new version, we can add a [byte] field to KeyValueMetadata alongside the string field. Since strings in Flatbuffers don't include the null terminator in their length, implementations can reuse the offset to allow values which are valid UTF-8 to be placed in both fields at a cost of only 4 bytes, so old consumers can still see UTF-8 metadata values. Actual binary values would only exist in the new field, and old consumers would see a null value (since we didn't declare it required); if both fields have a value, the binary value takes precedence. I think this at least leaves our read APIs alone since they already use bytestrings. A current extension type that uses binary metadata would still be a problem, though…
-David On Tue, Aug 17, 2021, at 00:08, Micah Kornfield wrote: > I agree with you any thoughts on a way forward for at least hardening the > spec (or should this be done at the same time as adding the new field)? > > On Mon, Aug 16, 2021 at 1:45 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > I've been poking around the project, and I'm growing concerned that > > our use of the KeyValue field has already been non-compliant in many > > cases since we do not validate UTF8-ness. Since we also use KeyValue > > to handle opaque data serialization for extension types [1], the fact > > that the specification does not clarify that binary data (such as the > > output of ExtensionType::Serialize) must be base64-encoded or similar > > makes this a bit of a minefield at the moment. > > > > It seems that there are no particularly excellent solutions, and > > maintaining the status quo (having now identified these > > inconsistencies / vagueries in at least the C++ implementation) is > > probably not a good idea either. In Parquet, where we store a > > serialized binary Arrow schema, we have to base64-encode [2] > > > > [1]: > > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata > > [2]: > > https://github.com/apache/arrow/blob/e990d177b1f1dec962315487682f613d46be573c/cpp/src/parquet/arrow/writer.cc#L423 > > > > On Tue, Aug 10, 2021 at 3:27 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > > Ah, that's definitely a no-go then (I believe we verify messages > > > unconditionally in C++). That's unfortunate (and I feel responsible > > > for missing this, but I suppose we had a lot of opportunities to fix > > > it prior to the 1.0.0 format version) — so to have actual binary > > > values (which was the intention in the first place for the metadata) > > > we would need to add a new metadata field. > > > > > > On Tue, Aug 10, 2021 at 6:53 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > > > > One issue with changing it to byte is it would effectively break any > > reader that is validating flatbuffer data, because flatbuffers verifies > > null termination [1]. While this might comply with forward compatibility > > guarantees it seems like a pretty large blast radius. > > > > > > > > [1] > > https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L2457 > > > > > > > > On Mon, Jul 12, 2021 at 12:38 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > >> > > > >> pyarrow at least treats the KeyValue values as binary and not UTF-8. > > > >> > > > >> On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield < > > emkornfi...@gmail.com> wrote: > > > >> > > > > >> > I think other languages (e.g. java, python) might make more of > > distinction between utf-8 compatible strings and raw bytes. For python it > > might be less of a concern if the c++ wrapper already makes the value field > > look like a bytes field > > > >> > > > > >> > On Sunday, July 11, 2021, Wes McKinney <wesmck...@gmail.com> wrote: > > > >> >> > > > >> >> We could certainly "upgrade" KeyValue to have a binary value field > > > >> >> everywhere KeyValue is used, but there is some risk of code in the > > > >> >> wild expecting there to be a null terminator after the string data. > > > >> >> The Flatbuffers-generated accessor APIs do not depend on the > > existence > > > >> >> of the null terminator, though. Not ideal, but I would not be > > thrilled > > > >> >> about adding an extra [ BinaryKeyValue ] everyplace we currently > > have > > > >> >> [ KeyValue ]. > > > >> >> > > > >> >> That said, I doubt that we have any endogenous forward > > compatibility > > > >> >> problems related to this in Apache Arrow-maintained libraries, the > > > >> >> risk would come from users who are interacting with the Flatbuffers > > > >> >> data manually / without using one of our libraries. We could > > implement > > > >> >> the changes and run a set of forward compatibility integration > > tests > > > >> >> to see if anyone of our released libraries have an issue. > > > >> >> > > > >> >> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > >> >> > > > > >> >> > The cost of an empty vector in Flatbuffers appears to be 4 bytes. > > > >> >> > > > > >> >> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield < > > emkornfi...@gmail.com> wrote: > > > >> >> > > > > > >> >> > > Retitling and forking the discussion to talk about key value > > pairs. > > > >> >> > > > > > >> >> > > What is the byte cost of an empty list? Another option would > > be to > > > >> >> > > introduce a new BinaryKeyValue table and add binary metadata. > > > >> >> > > > > > >> >> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind < > > > >> >> > > natebauernfe...@deephaven.io> wrote: > > > >> >> > > > > > >> >> > > > Deephaven and I are very supportive of "upgrading" the value > > half of the kv > > > >> >> > > > pair to a byte vector. What is the best way to find out if > > there is > > > >> >> > > > sufficient interest? > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > I've been stewing on the ideas here around schema evolution, > > and I realize > > > >> >> > > > the specific feature I am missing is the ability to encode > > that a field > > > >> >> > > > (i.e. its FieldNode and accompanying Buffers in the > > RecordBatch) is > > > >> >> > > > empty/has-no-data in O(0) cost (yes; for free). > > > >> >> > > > > > > >> >> > > > Might there be interest in adding a "field_id" to the > > FieldNode (which is > > > >> >> > > > encoded on the RecordBatch flatbuffer)? I see a simple > > forward-compatible > > > >> >> > > > upgrade (by either keying off of 0, or explicitly set the > > field default to > > > >> >> > > > -1) which would allow the sender to "skip" fields that have > > 1) FieldNode > > > >> >> > > > length of zero, and 2) all Buffer's associated at that level > > (and further > > > >> >> > > > nested) are also equally empty (i.e. Buffer length is zero). > > > >> >> > > > > > > >> >> > > > I understand this concept slightly interferes with > > RecordBatch's `length` > > > >> >> > > > field, and that many implementations use that length to > > resize the > > > >> >> > > > root-level FieldNodes. The use-case I have in mind has > > different logical > > > >> >> > > > lengths per field node; current implementations require > > sending a > > > >> >> > > > RecordBatch length of the max length across all root level > > field nodes. I > > > >> >> > > > believe this requires a copy of data whenever a field node > > is too short; I > > > >> >> > > > don't know if there is a decent solution to this slight > > inefficiency. I am > > > >> >> > > > bringing it up because if "skipping a field node when it is > > empty" is a > > > >> >> > > > feature, then we may not want to allocate space for those > > nodes given that > > > >> >> > > > the record batch length will likely be greater than zero. > > > >> >> > > > > > > >> >> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney < > > wesmck...@gmail.com> wrote: > > > >> >> > > > > > > >> >> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li < > > apa...@lidavidm.me> wrote: > > > >> >> > > > > > > > > >> >> > > > > > From the Flatbuffers internals doc[1] it appears they > > are the same: > > > >> >> > > > > "Strings are simply a vector of bytes, and are always > > null-terminated." > > > >> >> > > > > > > > >> >> > > > > I see. I took a look at flatbuffers.h, and it appears that > > changing > > > >> >> > > > > this field from string to [byte] would be > > backward-compatible and > > > >> >> > > > > forward-compatible except with code that expects a null > > terminator. > > > >> >> > > > > This is something we could discuss separately if there > > were enough > > > >> >> > > > > interest. > > > >> >> > > > > > > > >> >> > > > > > [1]: > > https://google.github.io/flatbuffers/flatbuffers_internals.html > > > >> >> > > > > > > > > >> >> > > > > > -David > > > >> >> > > > > > > > > >> >> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote: > > > >> >> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield < > > > >> >> > > > emkornfi...@gmail.com> > > > >> >> > > > > wrote: > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > Right, I had wanted to focus the discussion on > > Flight as I think > > > >> >> > > > > schema > > > >> >> > > > > > > > > evolution or multiplexing streams (more so the > > latter) is a > > > >> >> > > > > property of the > > > >> >> > > > > > > > > transport and not the stream format itself. If we > > are leaning > > > >> >> > > > > towards just > > > >> >> > > > > > > > > schema evolution then maybe it makes sense to > > discuss it for the > > > >> >> > > > > IPC stream > > > >> >> > > > > > > > > format and leverage that in Flight. I'd be > > interested in what > > > >> >> > > > > others think. > > > >> >> > > > > > > > > > > >> >> > > > > > > > I tend to agree, I think stream multiplexing is > > likely a transport > > > >> >> > > > > level > > > >> >> > > > > > > > issue. IMO I think schema evolution should be > > consistent with the > > > >> >> > > > > IPC > > > >> >> > > > > > > > stream format and flight. > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > Nate: it may be worth starting a separate > > discussion about more > > > >> >> > > > > general > > > >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why > > key-value > > > >> >> > > > > metadata was > > > >> >> > > > > > > > > chosen/if opaque bytes were considered in the past. > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > I think this was an unfortunate design of the key > > value metadata > > > >> >> > > > in > > > >> >> > > > > > > > Schema.fbs, but I don't think I was around when this > > decision was > > > >> >> > > > > made. > > > >> >> > > > > > > > > > >> >> > > > > > > I agree that it's unfortunate that we did not use [ > > byte ] instead of > > > >> >> > > > > > > string for the value in the KeyValue metadata — I > > think this was more > > > >> >> > > > > > > of an oversight than a deliberate choice (e.g. it was > > not our intent > > > >> >> > > > > > > to require binary data to be base64-encoded — this is > > something that > > > >> >> > > > > > > we have to do when encoding binary data in Thrift > > KeyValue metadata > > > >> >> > > > > > > for Parquet, for example). Is the binary > > representation of [byte] > > > >> >> > > > > > > different from string? > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > Side Question: Why isn't the IPC stream format a > > series of the > > > >> >> > > > flight > > > >> >> > > > > > > > > protobufs? > > > >> >> > > > > > > > > > > >> >> > > > > > > > In addition to what David said, protobufs can't be > > read directly > > > >> >> > > > > from a > > > >> >> > > > > > > > memory-mapped file (they need decoding). This was > > one of the > > > >> >> > > > design > > > >> >> > > > > > > > considerations of using flatbuffers and IPC > > Stream/File format. > > > >> >> > > > > > > > > > > >> >> > > > > > > > I was thinking Micah's comment is more that whatever > > we do, it > > > >> >> > > > > should be > > > >> >> > > > > > > > > clearly specified and edge cases should be > > considered, especially > > > >> >> > > > > if we > > > >> >> > > > > > > > > might want to 'backport' this into the stream > > format later. > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > Yes, for dictionaries we just need to be careful to > > define > > > >> >> > > > semantics > > > >> >> > > > > and > > > >> >> > > > > > > > ensure implementations are validating them with > > regards to > > > >> >> > > > > dictionaries. > > > >> >> > > > > > > > There likely isn't any need to change current > > implementations > > > >> >> > > > though. > > > >> >> > > > > > > > > > > >> >> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li < > > lidav...@apache.org> > > > >> >> > > > > wrote: > > > >> >> > > > > > > > > > > >> >> > > > > > > > > Right, I had wanted to focus the discussion on > > Flight as I think > > > >> >> > > > > schema > > > >> >> > > > > > > > > evolution or multiplexing streams (more so the > > latter) is a > > > >> >> > > > > property of the > > > >> >> > > > > > > > > transport and not the stream format itself. If we > > are leaning > > > >> >> > > > > towards just > > > >> >> > > > > > > > > schema evolution then maybe it makes sense to > > discuss it for the > > > >> >> > > > > IPC stream > > > >> >> > > > > > > > > format and leverage that in Flight. I'd be > > interested in what > > > >> >> > > > > others think. > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > Especially if we are looking at multiplexing > > streams - I would > > > >> >> > > > > wonder if > > > >> >> > > > > > > > > that's actually better served by making it easier > > to implement > > > >> >> > > > > using the > > > >> >> > > > > > > > > Flight implementation as it stands (by managing > > concurrent RPC > > > >> >> > > > > calls and/or > > > >> >> > > > > > > > > performing the union-of-structs encoding trick for > > you), instead > > > >> >> > > > > of having > > > >> >> > > > > > > > > to change the protocol. > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > Nate: it may be worth starting a separate > > discussion about more > > > >> >> > > > > general > > > >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why > > key-value > > > >> >> > > > > metadata was > > > >> >> > > > > > > > > chosen/if opaque bytes were considered in the > > past. Off the top > > > >> >> > > > of > > > >> >> > > > > my head > > > >> >> > > > > > > > > if it's for on-disk storage and fully > > application-defined it may > > > >> >> > > > > make sense > > > >> >> > > > > > > > > to store as a separate file alongside the Arrow > > file (indexed by > > > >> >> > > > > record > > > >> >> > > > > > > > > batch index) where you can take advantage of > > whatever format is > > > >> >> > > > > most > > > >> >> > > > > > > > > suitable. > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > -David > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan > > wrote: > > > >> >> > > > > > > > > > Hi guys, > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial > > suggestion was > > > >> >> > > > to > > > >> >> > > > > add this > > > >> >> > > > > > > > > > feature starting from the IPC(I moved initial > > write up steps to > > > >> >> > > > > the > > > >> >> > > > > > > > > bottom > > > >> >> > > > > > > > > > of the doc). Afterwards David suggested focusing > > on Flight and > > > >> >> > > > > that's how > > > >> >> > > > > > > > > > we ended up with the protobufs change in the > > proposal. This > > > >> >> > > > > being said I > > > >> >> > > > > > > > > do > > > >> >> > > > > > > > > > think that the place where this should be > > impemented is a good > > > >> >> > > > > question > > > >> >> > > > > > > > > on > > > >> >> > > > > > > > > > its own. Maybe it makes sense to have this kind > > of a feature in > > > >> >> > > > > IPC and > > > >> >> > > > > > > > > > somehow use it in Flight, maybe not. > > > >> >> > > > > > > > > > 2. The point about dictionaries deserves a > > dedicated section in > > > >> >> > > > > the > > > >> >> > > > > > > > > > proposal. Nate and David brought it up and > > shared some > > > >> >> > > > insights. > > > >> >> > > > > I'll try > > > >> >> > > > > > > > > > to aggregate them and we can continue the > > discussion form > > > >> >> > > > there. > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > Cheers, > > > >> >> > > > > > > > > > Gosh > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, < > > > >> >> > > > > > > > > natebauernfe...@deephaven.io> > > > >> >> > > > > > > > > > wrote: > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > makes it more difficult to bring schema > > evolution back > > > >> >> > > > > into the > > > >> >> > > > > > > > > > > > > > IPC Stream format (i.e. it would live > > only in flight) > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer > > structures not the > > > >> >> > > > > > > > > protobufs. > > > >> >> > > > > > > > > > > Can > > > >> >> > > > > > > > > > > > > you help me understand how difficult it > > would be to bring > > > >> >> > > > > the > > > >> >> > > > > > > > > > > `schema_id` > > > >> >> > > > > > > > > > > > > approach to the IPC stream format? > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > I thought we were talking solely about the > > Flight Protobuf > > > >> >> > > > > > > > > definitions - > > > >> >> > > > > > > > > > > > not the Flatbuffers (and the Google doc at > > least only talks > > > >> >> > > > > about the > > > >> >> > > > > > > > > > > > Protobufs). > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > I somehow missed that schema_id is being added > > to protobuf in > > > >> >> > > > > the > > > >> >> > > > > > > > > document. > > > >> >> > > > > > > > > > > It feels to me that the schema_id is a > > property that would > > > >> >> > > > > ideally only > > > >> >> > > > > > > > > > > apply to the RecordBatch. I better understand > > Micah's > > > >> >> > > > > dictionary > > > >> >> > > > > > > > > concerns, > > > >> >> > > > > > > > > > > now, too. > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > Side Question: Why isn't the IPC stream > > format a series of > > > >> >> > > > > the flight > > > >> >> > > > > > > > > > > > > protobufs? It's a real shame that there is > > no standard > > > >> >> > > > way > > > >> >> > > > > to > > > >> >> > > > > > > > > > > > > capture/replay a stream with app_metadata. > > (Obviously > > > >> >> > > > > ignoring the > > > >> >> > > > > > > > > > > > > annoyances around protobuf wrapping > > flatbuffers.) > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > The IPC format was defined long before > > Flight, and Flight's > > > >> >> > > > > > > > > app_metadata > > > >> >> > > > > > > > > > > > was added after Flight's initial definition. > > Note an IPC > > > >> >> > > > > message does > > > >> >> > > > > > > > > > > have > > > >> >> > > > > > > > > > > > a provision for key-value metadata, though I > > think APIs for > > > >> >> > > > > that are > > > >> >> > > > > > > > > not > > > >> >> > > > > > > > > > > > fully exposed. (See ARROW-6940: > > > >> >> > > > > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and > > > >> >> > > > > despite my > > > >> >> > > > > > > > > comments > > > >> >> > > > > > > > > > > > there perhaps we need to unify or at least > > consider how > > > >> >> > > > > Flight's > > > >> >> > > > > > > > > > > > app_metadata relates to the IPC message > > custom_metadata. > > > >> >> > > > Also > > > >> >> > > > > > > > > perhaps see > > > >> >> > > > > > > > > > > > ARROW-1059.) > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > KeyValue unfortunately is string to string. In > > flatbuffer > > > >> >> > > > > strings are > > > >> >> > > > > > > > > only > > > >> >> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the > > other hand is > > > >> >> > > > > opaque > > > >> >> > > > > > > > > bytes. > > > >> >> > > > > > > > > > > The latter is a bit more useful. > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > -- > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > -- > > > >> >> > > > > > >