Right, I had wanted to focus the discussion on Flight as I think schema 
evolution or multiplexing streams (more so the latter) is a property of the 
transport and not the stream format itself. If we are leaning towards just 
schema evolution then maybe it makes sense to discuss it for the IPC stream 
format and leverage that in Flight. I'd be interested in what others think.

Especially if we are looking at multiplexing streams - I would wonder if that's 
actually better served by making it easier to implement using the Flight 
implementation as it stands (by managing concurrent RPC calls and/or performing 
the union-of-structs encoding trick for you), instead of having to change the 
protocol.

Nate: it may be worth starting a separate discussion about more general 
metadata in the IPC message. I'm not aware of why key-value metadata was 
chosen/if opaque bytes were considered in the past. Off the top of my head if 
it's for on-disk storage and fully application-defined it may make sense to 
store as a separate file alongside the Arrow file (indexed by record batch 
index) where you can take advantage of whatever format is most suitable.

-David

On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote:
> Hi guys,
> 
> 1. Regarding IPC vs Flight: in fact my initial suggestion was to add this
> feature starting from the IPC(I moved initial write up steps to the bottom
> of the doc). Afterwards David suggested focusing on Flight and that's how
> we ended up with the protobufs change in the proposal. This being said I do
> think that the place where this should be impemented is a good question on
> its own. Maybe it makes sense to have this kind of a feature in IPC and
> somehow use it in Flight, maybe not.
> 2. The point about dictionaries deserves a dedicated section in the
> proposal. Nate and David brought it up and shared some insights. I'll try
> to aggregate them and we can continue the discussion form there.
> 
> Cheers,
> Gosh
> 
> On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <natebauernfe...@deephaven.io>
> wrote:
> 
> > >
> > > > > makes it more difficult to bring schema evolution back into the
> > > > > IPC Stream format (i.e. it would live only in flight)
> > > >
> > > > Gosh's proposal extends the flatbuffer structures not the protobufs.
> > Can
> > > > you help me understand how difficult it would be to bring the
> > `schema_id`
> > > > approach to the IPC stream format?
> > >
> > > I thought we were talking solely about the Flight Protobuf definitions -
> > > not the Flatbuffers (and the Google doc at least only talks about the
> > > Protobufs).
> > >
> >
> > I somehow missed that schema_id is being added to protobuf in the document.
> > It feels to me that the schema_id is a property that would ideally only
> > apply to the RecordBatch. I better understand Micah's dictionary concerns,
> > now, too.
> >
> > > Side Question: Why isn't the IPC stream format a series of the flight
> > > > protobufs? It's a real shame that there is no standard way to
> > > > capture/replay a stream with app_metadata. (Obviously ignoring the
> > > > annoyances around protobuf wrapping flatbuffers.)
> > >
> > > The IPC format was defined long before Flight, and Flight's app_metadata
> > > was added after Flight's initial definition. Note an IPC message does
> > have
> > > a provision for key-value metadata, though I think APIs for that are not
> > > fully exposed. (See ARROW-6940:
> > > https://issues.apache.org/jira/browse/ARROW-6940 and despite my comments
> > > there perhaps we need to unify or at least consider how Flight's
> > > app_metadata relates to the IPC message custom_metadata. Also perhaps see
> > > ARROW-1059.)
> > >
> >
> > KeyValue unfortunately is string to string. In flatbuffer strings are only
> > UTF-8 or 7-bit ASCII. The app_metadata on the other hand is opaque bytes.
> > The latter is a bit more useful.
> >
> > --
> >
> 

Reply via email to