Ah, that's definitely a no-go then (I believe we verify messages unconditionally in C++). That's unfortunate (and I feel responsible for missing this, but I suppose we had a lot of opportunities to fix it prior to the 1.0.0 format version) — so to have actual binary values (which was the intention in the first place for the metadata) we would need to add a new metadata field.
On Tue, Aug 10, 2021 at 6:53 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > One issue with changing it to byte is it would effectively break any reader > that is validating flatbuffer data, because flatbuffers verifies null > termination [1]. While this might comply with forward compatibility > guarantees it seems like a pretty large blast radius. > > [1] > https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L2457 > > On Mon, Jul 12, 2021 at 12:38 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >> pyarrow at least treats the KeyValue values as binary and not UTF-8. >> >> On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > >> > I think other languages (e.g. java, python) might make more of distinction >> > between utf-8 compatible strings and raw bytes. For python it might be >> > less of a concern if the c++ wrapper already makes the value field look >> > like a bytes field >> > >> > On Sunday, July 11, 2021, Wes McKinney <wesmck...@gmail.com> wrote: >> >> >> >> We could certainly "upgrade" KeyValue to have a binary value field >> >> everywhere KeyValue is used, but there is some risk of code in the >> >> wild expecting there to be a null terminator after the string data. >> >> The Flatbuffers-generated accessor APIs do not depend on the existence >> >> of the null terminator, though. Not ideal, but I would not be thrilled >> >> about adding an extra [ BinaryKeyValue ] everyplace we currently have >> >> [ KeyValue ]. >> >> >> >> That said, I doubt that we have any endogenous forward compatibility >> >> problems related to this in Apache Arrow-maintained libraries, the >> >> risk would come from users who are interacting with the Flatbuffers >> >> data manually / without using one of our libraries. We could implement >> >> the changes and run a set of forward compatibility integration tests >> >> to see if anyone of our released libraries have an issue. >> >> >> >> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <wesmck...@gmail.com> wrote: >> >> > >> >> > The cost of an empty vector in Flatbuffers appears to be 4 bytes. >> >> > >> >> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <emkornfi...@gmail.com> >> >> > wrote: >> >> > > >> >> > > Retitling and forking the discussion to talk about key value pairs. >> >> > > >> >> > > What is the byte cost of an empty list? Another option would be to >> >> > > introduce a new BinaryKeyValue table and add binary metadata. >> >> > > >> >> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind < >> >> > > natebauernfe...@deephaven.io> wrote: >> >> > > >> >> > > > Deephaven and I are very supportive of "upgrading" the value half >> >> > > > of the kv >> >> > > > pair to a byte vector. What is the best way to find out if there is >> >> > > > sufficient interest? >> >> > > > >> >> > > > >> >> > > > I've been stewing on the ideas here around schema evolution, and I >> >> > > > realize >> >> > > > the specific feature I am missing is the ability to encode that a >> >> > > > field >> >> > > > (i.e. its FieldNode and accompanying Buffers in the RecordBatch) is >> >> > > > empty/has-no-data in O(0) cost (yes; for free). >> >> > > > >> >> > > > Might there be interest in adding a "field_id" to the FieldNode >> >> > > > (which is >> >> > > > encoded on the RecordBatch flatbuffer)? I see a simple >> >> > > > forward-compatible >> >> > > > upgrade (by either keying off of 0, or explicitly set the field >> >> > > > default to >> >> > > > -1) which would allow the sender to "skip" fields that have 1) >> >> > > > FieldNode >> >> > > > length of zero, and 2) all Buffer's associated at that level (and >> >> > > > further >> >> > > > nested) are also equally empty (i.e. Buffer length is zero). >> >> > > > >> >> > > > I understand this concept slightly interferes with RecordBatch's >> >> > > > `length` >> >> > > > field, and that many implementations use that length to resize the >> >> > > > root-level FieldNodes. The use-case I have in mind has different >> >> > > > logical >> >> > > > lengths per field node; current implementations require sending a >> >> > > > RecordBatch length of the max length across all root level field >> >> > > > nodes. I >> >> > > > believe this requires a copy of data whenever a field node is too >> >> > > > short; I >> >> > > > don't know if there is a decent solution to this slight >> >> > > > inefficiency. I am >> >> > > > bringing it up because if "skipping a field node when it is empty" >> >> > > > is a >> >> > > > feature, then we may not want to allocate space for those nodes >> >> > > > given that >> >> > > > the record batch length will likely be greater than zero. >> >> > > > >> >> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <wesmck...@gmail.com> >> >> > > > wrote: >> >> > > > >> >> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li <apa...@lidavidm.me> >> >> > > > > wrote: >> >> > > > > > >> >> > > > > > From the Flatbuffers internals doc[1] it appears they are the >> >> > > > > > same: >> >> > > > > "Strings are simply a vector of bytes, and are always >> >> > > > > null-terminated." >> >> > > > > >> >> > > > > I see. I took a look at flatbuffers.h, and it appears that >> >> > > > > changing >> >> > > > > this field from string to [byte] would be backward-compatible and >> >> > > > > forward-compatible except with code that expects a null >> >> > > > > terminator. >> >> > > > > This is something we could discuss separately if there were enough >> >> > > > > interest. >> >> > > > > >> >> > > > > > [1]: >> >> > > > > > https://google.github.io/flatbuffers/flatbuffers_internals.html >> >> > > > > > >> >> > > > > > -David >> >> > > > > > >> >> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote: >> >> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield < >> >> > > > emkornfi...@gmail.com> >> >> > > > > wrote: >> >> > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > Right, I had wanted to focus the discussion on Flight as >> >> > > > > > > > > I think >> >> > > > > schema >> >> > > > > > > > > evolution or multiplexing streams (more so the latter) is >> >> > > > > > > > > a >> >> > > > > property of the >> >> > > > > > > > > transport and not the stream format itself. If we are >> >> > > > > > > > > leaning >> >> > > > > towards just >> >> > > > > > > > > schema evolution then maybe it makes sense to discuss it >> >> > > > > > > > > for the >> >> > > > > IPC stream >> >> > > > > > > > > format and leverage that in Flight. I'd be interested in >> >> > > > > > > > > what >> >> > > > > others think. >> >> > > > > > > > >> >> > > > > > > > I tend to agree, I think stream multiplexing is likely a >> >> > > > > > > > transport >> >> > > > > level >> >> > > > > > > > issue. IMO I think schema evolution should be consistent >> >> > > > > > > > with the >> >> > > > > IPC >> >> > > > > > > > stream format and flight. >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > > Nate: it may be worth starting a separate discussion >> >> > > > > > > > > about more >> >> > > > > general >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why >> >> > > > > > > > > key-value >> >> > > > > metadata was >> >> > > > > > > > > chosen/if opaque bytes were considered in the past. >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > I think this was an unfortunate design of the key value >> >> > > > > > > > metadata >> >> > > > in >> >> > > > > > > > Schema.fbs, but I don't think I was around when this >> >> > > > > > > > decision was >> >> > > > > made. >> >> > > > > > > >> >> > > > > > > I agree that it's unfortunate that we did not use [ byte ] >> >> > > > > > > instead of >> >> > > > > > > string for the value in the KeyValue metadata — I think this >> >> > > > > > > was more >> >> > > > > > > of an oversight than a deliberate choice (e.g. it was not our >> >> > > > > > > intent >> >> > > > > > > to require binary data to be base64-encoded — this is >> >> > > > > > > something that >> >> > > > > > > we have to do when encoding binary data in Thrift KeyValue >> >> > > > > > > metadata >> >> > > > > > > for Parquet, for example). Is the binary representation of >> >> > > > > > > [byte] >> >> > > > > > > different from string? >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > > Side Question: Why isn't the IPC stream format a series of >> >> > > > > > > > the >> >> > > > flight >> >> > > > > > > > > protobufs? >> >> > > > > > > > >> >> > > > > > > > In addition to what David said, protobufs can't be read >> >> > > > > > > > directly >> >> > > > > from a >> >> > > > > > > > memory-mapped file (they need decoding). This was one of >> >> > > > > > > > the >> >> > > > design >> >> > > > > > > > considerations of using flatbuffers and IPC Stream/File >> >> > > > > > > > format. >> >> > > > > > > > >> >> > > > > > > > I was thinking Micah's comment is more that whatever we do, >> >> > > > > > > > it >> >> > > > > should be >> >> > > > > > > > > clearly specified and edge cases should be considered, >> >> > > > > > > > > especially >> >> > > > > if we >> >> > > > > > > > > might want to 'backport' this into the stream format >> >> > > > > > > > > later. >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > Yes, for dictionaries we just need to be careful to define >> >> > > > semantics >> >> > > > > and >> >> > > > > > > > ensure implementations are validating them with regards to >> >> > > > > dictionaries. >> >> > > > > > > > There likely isn't any need to change current >> >> > > > > > > > implementations >> >> > > > though. >> >> > > > > > > > >> >> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li >> >> > > > > > > > <lidav...@apache.org> >> >> > > > > wrote: >> >> > > > > > > > >> >> > > > > > > > > Right, I had wanted to focus the discussion on Flight as >> >> > > > > > > > > I think >> >> > > > > schema >> >> > > > > > > > > evolution or multiplexing streams (more so the latter) is >> >> > > > > > > > > a >> >> > > > > property of the >> >> > > > > > > > > transport and not the stream format itself. If we are >> >> > > > > > > > > leaning >> >> > > > > towards just >> >> > > > > > > > > schema evolution then maybe it makes sense to discuss it >> >> > > > > > > > > for the >> >> > > > > IPC stream >> >> > > > > > > > > format and leverage that in Flight. I'd be interested in >> >> > > > > > > > > what >> >> > > > > others think. >> >> > > > > > > > > >> >> > > > > > > > > Especially if we are looking at multiplexing streams - I >> >> > > > > > > > > would >> >> > > > > wonder if >> >> > > > > > > > > that's actually better served by making it easier to >> >> > > > > > > > > implement >> >> > > > > using the >> >> > > > > > > > > Flight implementation as it stands (by managing >> >> > > > > > > > > concurrent RPC >> >> > > > > calls and/or >> >> > > > > > > > > performing the union-of-structs encoding trick for you), >> >> > > > > > > > > instead >> >> > > > > of having >> >> > > > > > > > > to change the protocol. >> >> > > > > > > > > >> >> > > > > > > > > Nate: it may be worth starting a separate discussion >> >> > > > > > > > > about more >> >> > > > > general >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why >> >> > > > > > > > > key-value >> >> > > > > metadata was >> >> > > > > > > > > chosen/if opaque bytes were considered in the past. Off >> >> > > > > > > > > the top >> >> > > > of >> >> > > > > my head >> >> > > > > > > > > if it's for on-disk storage and fully application-defined >> >> > > > > > > > > it may >> >> > > > > make sense >> >> > > > > > > > > to store as a separate file alongside the Arrow file >> >> > > > > > > > > (indexed by >> >> > > > > record >> >> > > > > > > > > batch index) where you can take advantage of whatever >> >> > > > > > > > > format is >> >> > > > > most >> >> > > > > > > > > suitable. >> >> > > > > > > > > >> >> > > > > > > > > -David >> >> > > > > > > > > >> >> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote: >> >> > > > > > > > > > Hi guys, >> >> > > > > > > > > > >> >> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial >> >> > > > > > > > > > suggestion was >> >> > > > to >> >> > > > > add this >> >> > > > > > > > > > feature starting from the IPC(I moved initial write up >> >> > > > > > > > > > steps to >> >> > > > > the >> >> > > > > > > > > bottom >> >> > > > > > > > > > of the doc). Afterwards David suggested focusing on >> >> > > > > > > > > > Flight and >> >> > > > > that's how >> >> > > > > > > > > > we ended up with the protobufs change in the proposal. >> >> > > > > > > > > > This >> >> > > > > being said I >> >> > > > > > > > > do >> >> > > > > > > > > > think that the place where this should be impemented is >> >> > > > > > > > > > a good >> >> > > > > question >> >> > > > > > > > > on >> >> > > > > > > > > > its own. Maybe it makes sense to have this kind of a >> >> > > > > > > > > > feature in >> >> > > > > IPC and >> >> > > > > > > > > > somehow use it in Flight, maybe not. >> >> > > > > > > > > > 2. The point about dictionaries deserves a dedicated >> >> > > > > > > > > > section in >> >> > > > > the >> >> > > > > > > > > > proposal. Nate and David brought it up and shared some >> >> > > > insights. >> >> > > > > I'll try >> >> > > > > > > > > > to aggregate them and we can continue the discussion >> >> > > > > > > > > > form >> >> > > > there. >> >> > > > > > > > > > >> >> > > > > > > > > > Cheers, >> >> > > > > > > > > > Gosh >> >> > > > > > > > > > >> >> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, < >> >> > > > > > > > > natebauernfe...@deephaven.io> >> >> > > > > > > > > > wrote: >> >> > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > > > makes it more difficult to bring schema >> >> > > > > > > > > > > > > > evolution back >> >> > > > > into the >> >> > > > > > > > > > > > > > IPC Stream format (i.e. it would live only in >> >> > > > > > > > > > > > > > flight) >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer structures >> >> > > > > > > > > > > > > not the >> >> > > > > > > > > protobufs. >> >> > > > > > > > > > > Can >> >> > > > > > > > > > > > > you help me understand how difficult it would be >> >> > > > > > > > > > > > > to bring >> >> > > > > the >> >> > > > > > > > > > > `schema_id` >> >> > > > > > > > > > > > > approach to the IPC stream format? >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > I thought we were talking solely about the Flight >> >> > > > > > > > > > > > Protobuf >> >> > > > > > > > > definitions - >> >> > > > > > > > > > > > not the Flatbuffers (and the Google doc at least >> >> > > > > > > > > > > > only talks >> >> > > > > about the >> >> > > > > > > > > > > > Protobufs). >> >> > > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > I somehow missed that schema_id is being added to >> >> > > > > > > > > > > protobuf in >> >> > > > > the >> >> > > > > > > > > document. >> >> > > > > > > > > > > It feels to me that the schema_id is a property that >> >> > > > > > > > > > > would >> >> > > > > ideally only >> >> > > > > > > > > > > apply to the RecordBatch. I better understand Micah's >> >> > > > > dictionary >> >> > > > > > > > > concerns, >> >> > > > > > > > > > > now, too. >> >> > > > > > > > > > > >> >> > > > > > > > > > > > Side Question: Why isn't the IPC stream format a >> >> > > > > > > > > > > > series of >> >> > > > > the flight >> >> > > > > > > > > > > > > protobufs? It's a real shame that there is no >> >> > > > > > > > > > > > > standard >> >> > > > way >> >> > > > > to >> >> > > > > > > > > > > > > capture/replay a stream with app_metadata. >> >> > > > > > > > > > > > > (Obviously >> >> > > > > ignoring the >> >> > > > > > > > > > > > > annoyances around protobuf wrapping flatbuffers.) >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > The IPC format was defined long before Flight, and >> >> > > > > > > > > > > > Flight's >> >> > > > > > > > > app_metadata >> >> > > > > > > > > > > > was added after Flight's initial definition. Note >> >> > > > > > > > > > > > an IPC >> >> > > > > message does >> >> > > > > > > > > > > have >> >> > > > > > > > > > > > a provision for key-value metadata, though I think >> >> > > > > > > > > > > > APIs for >> >> > > > > that are >> >> > > > > > > > > not >> >> > > > > > > > > > > > fully exposed. (See ARROW-6940: >> >> > > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and >> >> > > > > despite my >> >> > > > > > > > > comments >> >> > > > > > > > > > > > there perhaps we need to unify or at least consider >> >> > > > > > > > > > > > how >> >> > > > > Flight's >> >> > > > > > > > > > > > app_metadata relates to the IPC message >> >> > > > > > > > > > > > custom_metadata. >> >> > > > Also >> >> > > > > > > > > perhaps see >> >> > > > > > > > > > > > ARROW-1059.) >> >> > > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > KeyValue unfortunately is string to string. In >> >> > > > > > > > > > > flatbuffer >> >> > > > > strings are >> >> > > > > > > > > only >> >> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the other >> >> > > > > > > > > > > hand is >> >> > > > > opaque >> >> > > > > > > > > bytes. >> >> > > > > > > > > > > The latter is a bit more useful. >> >> > > > > > > > > > > >> >> > > > > > > > > > > -- >> >> > > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > >> >> > > > > >> >> > > > >> >> > > > >> >> > > > -- >> >> > > >