Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Nate Bauernfeind Tue, 13 Apr 2021 09:55:12 -0700

> possibly in coordination with the Deephaven/Barrage team, if they're also
still interested


Good opportunity for me to chime in =). I think we still have interest in
this feature. On the other thread, it took a little cajoling, but I've come
around to agree with the conclusions of taking a RecordBatch and splitting
it up (a set of RecordBatches for added rows followed by a set of
RecordBatches for modifications). In this case I think it's best not to
evolve the schema between added row RecordBatches and modified row
RecordBatches (sending empty buffer nodes and field nodes will be
significantly cheaper). However, the schema evolution would be very useful
for when the rpc client changes the set of columns that they are subscribed
to (which is relatively rare compared to when the subscribed table itself
ticks).

That said, schema evolution is not yet particularly high in our queue.

On Tue, Apr 13, 2021 at 9:12 AM David Li <lidav...@apache.org> wrote:

> Thanks for the details. I'll note a few things, but adding schema
> evolution to Flight is reasonable, if you'd like to put together a
> proposal for discussion (possibly in coordination with the
> Deephaven/Barrage team, if they're also still interested).
>
> >    3. Assume that there is a strong reason to query A1,..,AK together.
>
> While I don't know the details here, at least with Flight/gRPC, it's
> not necessarily expensive to make several requests to the same server,
> as gRPC will consolidate them into the same underlying network
> connection. You could issue one GetFlightInfo request for all streams
> at once, and get back a list of endpoints for each individual
> subquery, which you could then issue separate DoGet requests for.
>
> There's a slight mismatch there in that GetFlightInfo returns a
> FlightInfo, which assumes all endpoints have the same schema. But for
> a specific application, you could ignore that field (nothing in Flight
> checks that schema against the actual data).
>
> Of course, if said strong reason is that all the data is really
> retrieved together despite being distinct datasets, then this would
> complicate the server side implementation quite a bit. But it's one
> option.
>
> > A potential way to address this(with the existing tools) could be having
> a
> > union schema of all fields across all entities(potentially prefixed with
> > the field name just like in sql joins) and setting the values to NA which
> > do not belong to an entity.
>
> I had a similar use case in the past, and it was suggested to use
> Arrow's Union type which handles this directly. A Union of Struct
> types essentially lets you have multiple distinct schemas all encoded
> in the same overall table, with explicit information about which
> schema is currently in use. But as you point out this isn't helpful if
> you don't know all the schemas up front.
>
> Best,
> David
>
> On 2021/04/13 11:21:20, Gosh Arzumanyan <gosh...@gmail.com> wrote:
> > Hi David,
> >
> > Thanks for sharing the link!
> >
> > Here is how a potential use case might look like:
> >
> >    1. Assume that we have a service S which accepts expressions in some
> >    language X.
> >    2. Assume that a typical query to this service requests entities A_1,
> >    A_2,..,A_K. Each of those entities generates a stream of record
> batches.
> >    Record batches for a single A_I share the same schema, yet there is no
> >    guarantee that schemas are equal across all streams.
> >    3. Assume that there is a strong reason to query A1,..,AK together.
> >    4. Service generates record batches(concurrently), tags those(e.g.
> with
> >    schema level metadata) and sends them over.
> >
> > A potential way to address this(with the existing tools) could be having
> a
> > union schema of all fields across all entities(potentially prefixed with
> > the field name just like in sql joins) and setting the values to NA which
> > do not belong to an entity. However this solution might not work in cases
> > where we are not able to construct the unified schema before opening the
> > stream(e.g. in case of changes in the schema for a specific entity upon
> > realtime input feeding or an unpredictable generator expression).
> >
> > Cheers,
> > Gosh
> >
> >
> > On Mon., 12 Apr. 2021, 13:45 David Li, <lidav...@apache.org> wrote:
> >
> > > Hi Gosh,
> > >
> > > There was indeed a discussion where schema evolution was proposed as a
> > > solution for another use case:
> > >
> > >
> https://lists.apache.org/thread.html/re800c63f0eb08022c8cd5e1b2236fd69a2e85afdc34daf6b75e3b7b3%40%3Cdev.arrow.apache.org%3E
> > >
> > > I am curious though, what is your use case here?
> > >
> > > Best,
> > > David
> > >
> > > On 2021/04/12 10:49:00, Gosh Arzumanyan <gosh...@gmail.com> wrote:
> > > > Hi guys, hope you are well!
> > > >
> > > > Judging from the Flight API
> > > > <
> > >
> https://github.com/apache/arrow/blob/5b08205f7e864ed29f53ed3d836845fed62d5d4a/cpp/src/arrow/flight/types.h#L461
> > > >
> > > > and
> > > > from the documentation/examples out there, it seems like data schema
> is
> > > > supposed to be fixed per stream in ArrowFlight(which is also aligned
> with
> > > > corresponding IPC stream writers/readers).
> > > > Wondering if the community has evaluated the necessity/possibility of
> > > > supporting schema changes within a single stream(I do recall seeing a
> > > > discussion on this somewhere but can't find it)?
> > > >
> > > > Cheers,
> > > > Gosh
> > > >
> > >
> >
>


--

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Reply via email to