Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Gosh Arzumanyan Fri, 16 Apr 2021 07:59:43 -0700

Hi guys!

Thanks for the feedback/info.
Let me try to put a proposal together. Though I guess I'll need some
assistance on crafting it both in terms of the structure of a proposal
expected in the Arrow community as well as technical guidance.


WIll share a doc with some ideas shortly so that we can start to iterate
over it.

Cheers,
Gosh

On Tue, Apr 13, 2021 at 6:55 PM Nate Bauernfeind <
natebauernfe...@deephaven.io> wrote:

> > possibly in coordination with the Deephaven/Barrage team, if they're also
> still interested
>
> Good opportunity for me to chime in =). I think we still have interest in
> this feature. On the other thread, it took a little cajoling, but I've come
> around to agree with the conclusions of taking a RecordBatch and splitting
> it up (a set of RecordBatches for added rows followed by a set of
> RecordBatches for modifications). In this case I think it's best not to
> evolve the schema between added row RecordBatches and modified row
> RecordBatches (sending empty buffer nodes and field nodes will be
> significantly cheaper). However, the schema evolution would be very useful
> for when the rpc client changes the set of columns that they are subscribed
> to (which is relatively rare compared to when the subscribed table itself
> ticks).
>
> That said, schema evolution is not yet particularly high in our queue.
>
> On Tue, Apr 13, 2021 at 9:12 AM David Li <lidav...@apache.org> wrote:
>
> > Thanks for the details. I'll note a few things, but adding schema
> > evolution to Flight is reasonable, if you'd like to put together a
> > proposal for discussion (possibly in coordination with the
> > Deephaven/Barrage team, if they're also still interested).
> >
> > >    3. Assume that there is a strong reason to query A1,..,AK together.
> >
> > While I don't know the details here, at least with Flight/gRPC, it's
> > not necessarily expensive to make several requests to the same server,
> > as gRPC will consolidate them into the same underlying network
> > connection. You could issue one GetFlightInfo request for all streams
> > at once, and get back a list of endpoints for each individual
> > subquery, which you could then issue separate DoGet requests for.
> >
> > There's a slight mismatch there in that GetFlightInfo returns a
> > FlightInfo, which assumes all endpoints have the same schema. But for
> > a specific application, you could ignore that field (nothing in Flight
> > checks that schema against the actual data).
> >
> > Of course, if said strong reason is that all the data is really
> > retrieved together despite being distinct datasets, then this would
> > complicate the server side implementation quite a bit. But it's one
> > option.
> >
> > > A potential way to address this(with the existing tools) could be
> having
> > a
> > > union schema of all fields across all entities(potentially prefixed
> with
> > > the field name just like in sql joins) and setting the values to NA
> which
> > > do not belong to an entity.
> >
> > I had a similar use case in the past, and it was suggested to use
> > Arrow's Union type which handles this directly. A Union of Struct
> > types essentially lets you have multiple distinct schemas all encoded
> > in the same overall table, with explicit information about which
> > schema is currently in use. But as you point out this isn't helpful if
> > you don't know all the schemas up front.
> >
> > Best,
> > David
> >
> > On 2021/04/13 11:21:20, Gosh Arzumanyan <gosh...@gmail.com> wrote:
> > > Hi David,
> > >
> > > Thanks for sharing the link!
> > >
> > > Here is how a potential use case might look like:
> > >
> > >    1. Assume that we have a service S which accepts expressions in some
> > >    language X.
> > >    2. Assume that a typical query to this service requests entities
> A_1,
> > >    A_2,..,A_K. Each of those entities generates a stream of record
> > batches.
> > >    Record batches for a single A_I share the same schema, yet there is
> no
> > >    guarantee that schemas are equal across all streams.
> > >    3. Assume that there is a strong reason to query A1,..,AK together.
> > >    4. Service generates record batches(concurrently), tags those(e.g.
> > with
> > >    schema level metadata) and sends them over.
> > >
> > > A potential way to address this(with the existing tools) could be
> having
> > a
> > > union schema of all fields across all entities(potentially prefixed
> with
> > > the field name just like in sql joins) and setting the values to NA
> which
> > > do not belong to an entity. However this solution might not work in
> cases
> > > where we are not able to construct the unified schema before opening
> the
> > > stream(e.g. in case of changes in the schema for a specific entity upon
> > > realtime input feeding or an unpredictable generator expression).
> > >
> > > Cheers,
> > > Gosh
> > >
> > >
> > > On Mon., 12 Apr. 2021, 13:45 David Li, <lidav...@apache.org> wrote:
> > >
> > > > Hi Gosh,
> > > >
> > > > There was indeed a discussion where schema evolution was proposed as
> a
> > > > solution for another use case:
> > > >
> > > >
> >
> https://lists.apache.org/thread.html/re800c63f0eb08022c8cd5e1b2236fd69a2e85afdc34daf6b75e3b7b3%40%3Cdev.arrow.apache.org%3E
> > > >
> > > > I am curious though, what is your use case here?
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On 2021/04/12 10:49:00, Gosh Arzumanyan <gosh...@gmail.com> wrote:
> > > > > Hi guys, hope you are well!
> > > > >
> > > > > Judging from the Flight API
> > > > > <
> > > >
> >
> https://github.com/apache/arrow/blob/5b08205f7e864ed29f53ed3d836845fed62d5d4a/cpp/src/arrow/flight/types.h#L461
> > > > >
> > > > > and
> > > > > from the documentation/examples out there, it seems like data
> schema
> > is
> > > > > supposed to be fixed per stream in ArrowFlight(which is also
> aligned
> > with
> > > > > corresponding IPC stream writers/readers).
> > > > > Wondering if the community has evaluated the necessity/possibility
> of
> > > > > supporting schema changes within a single stream(I do recall
> seeing a
> > > > > discussion on this somewhere but can't find it)?
> > > > >
> > > > > Cheers,
> > > > > Gosh
> > > > >
> > > >
> > >
> >
>
>
> --
>

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Reply via email to