Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

David Li Fri, 18 Jun 2021 09:11:58 -0700

Following up here - Gosh, did you get a chance to put something together? Do 
you need/want help on this? This would also potentially be useful for 
FlightSQL. (See the discussion on GitHub: 
https://github.com/apache/arrow/pull/9368#discussion_r572941765)


Best,
David

On Fri, Apr 16, 2021, at 10:59, Gosh Arzumanyan wrote:
> Hi guys!
> 
> Thanks for the feedback/info.
> Let me try to put a proposal together. Though I guess I'll need some
> assistance on crafting it both in terms of the structure of a proposal
> expected in the Arrow community as well as technical guidance.
> 
> WIll share a doc with some ideas shortly so that we can start to iterate
> over it.
> 
> Cheers,
> Gosh
> 
> On Tue, Apr 13, 2021 at 6:55 PM Nate Bauernfeind <
> [email protected] <mailto:natebauernfeind%40deephaven.io>> wrote:
> 
> > > possibly in coordination with the Deephaven/Barrage team, if they're also
> > still interested
> >
> > Good opportunity for me to chime in =). I think we still have interest in
> > this feature. On the other thread, it took a little cajoling, but I've come
> > around to agree with the conclusions of taking a RecordBatch and splitting
> > it up (a set of RecordBatches for added rows followed by a set of
> > RecordBatches for modifications). In this case I think it's best not to
> > evolve the schema between added row RecordBatches and modified row
> > RecordBatches (sending empty buffer nodes and field nodes will be
> > significantly cheaper). However, the schema evolution would be very useful
> > for when the rpc client changes the set of columns that they are subscribed
> > to (which is relatively rare compared to when the subscribed table itself
> > ticks).
> >
> > That said, schema evolution is not yet particularly high in our queue.
> >
> > On Tue, Apr 13, 2021 at 9:12 AM David Li <[email protected] 
> > <mailto:lidavidm%40apache.org>> wrote:
> >
> > > Thanks for the details. I'll note a few things, but adding schema
> > > evolution to Flight is reasonable, if you'd like to put together a
> > > proposal for discussion (possibly in coordination with the
> > > Deephaven/Barrage team, if they're also still interested).
> > >
> > > >    3. Assume that there is a strong reason to query A1,..,AK together.
> > >
> > > While I don't know the details here, at least with Flight/gRPC, it's
> > > not necessarily expensive to make several requests to the same server,
> > > as gRPC will consolidate them into the same underlying network
> > > connection. You could issue one GetFlightInfo request for all streams
> > > at once, and get back a list of endpoints for each individual
> > > subquery, which you could then issue separate DoGet requests for.
> > >
> > > There's a slight mismatch there in that GetFlightInfo returns a
> > > FlightInfo, which assumes all endpoints have the same schema. But for
> > > a specific application, you could ignore that field (nothing in Flight
> > > checks that schema against the actual data).
> > >
> > > Of course, if said strong reason is that all the data is really
> > > retrieved together despite being distinct datasets, then this would
> > > complicate the server side implementation quite a bit. But it's one
> > > option.
> > >
> > > > A potential way to address this(with the existing tools) could be
> > having
> > > a
> > > > union schema of all fields across all entities(potentially prefixed
> > with
> > > > the field name just like in sql joins) and setting the values to NA
> > which
> > > > do not belong to an entity.
> > >
> > > I had a similar use case in the past, and it was suggested to use
> > > Arrow's Union type which handles this directly. A Union of Struct
> > > types essentially lets you have multiple distinct schemas all encoded
> > > in the same overall table, with explicit information about which
> > > schema is currently in use. But as you point out this isn't helpful if
> > > you don't know all the schemas up front.
> > >
> > > Best,
> > > David
> > >
> > > On 2021/04/13 11:21:20, Gosh Arzumanyan <[email protected] 
> > > <mailto:gosharz%40gmail.com>> wrote:
> > > > Hi David,
> > > >
> > > > Thanks for sharing the link!
> > > >
> > > > Here is how a potential use case might look like:
> > > >
> > > >    1. Assume that we have a service S which accepts expressions in some
> > > >    language X.
> > > >    2. Assume that a typical query to this service requests entities
> > A_1,
> > > >    A_2,..,A_K. Each of those entities generates a stream of record
> > > batches.
> > > >    Record batches for a single A_I share the same schema, yet there is
> > no
> > > >    guarantee that schemas are equal across all streams.
> > > >    3. Assume that there is a strong reason to query A1,..,AK together.
> > > >    4. Service generates record batches(concurrently), tags those(e.g.
> > > with
> > > >    schema level metadata) and sends them over.
> > > >
> > > > A potential way to address this(with the existing tools) could be
> > having
> > > a
> > > > union schema of all fields across all entities(potentially prefixed
> > with
> > > > the field name just like in sql joins) and setting the values to NA
> > which
> > > > do not belong to an entity. However this solution might not work in
> > cases
> > > > where we are not able to construct the unified schema before opening
> > the
> > > > stream(e.g. in case of changes in the schema for a specific entity upon
> > > > realtime input feeding or an unpredictable generator expression).
> > > >
> > > > Cheers,
> > > > Gosh
> > > >
> > > >
> > > > On Mon., 12 Apr. 2021, 13:45 David Li, <[email protected] 
> > > > <mailto:lidavidm%40apache.org>> wrote:
> > > >
> > > > > Hi Gosh,
> > > > >
> > > > > There was indeed a discussion where schema evolution was proposed as
> > a
> > > > > solution for another use case:
> > > > >
> > > > >
> > >
> > https://lists.apache.org/thread.html/re800c63f0eb08022c8cd5e1b2236fd69a2e85afdc34daf6b75e3b7b3%40%3Cdev.arrow.apache.org%3E
> > > > >
> > > > > I am curious though, what is your use case here?
> > > > >
> > > > > Best,
> > > > > David
> > > > >
> > > > > On 2021/04/12 10:49:00, Gosh Arzumanyan <[email protected] 
> > > > > <mailto:gosharz%40gmail.com>> wrote:
> > > > > > Hi guys, hope you are well!
> > > > > >
> > > > > > Judging from the Flight API
> > > > > > <
> > > > >
> > >
> > https://github.com/apache/arrow/blob/5b08205f7e864ed29f53ed3d836845fed62d5d4a/cpp/src/arrow/flight/types.h#L461
> > > > > >
> > > > > > and
> > > > > > from the documentation/examples out there, it seems like data
> > schema
> > > is
> > > > > > supposed to be fixed per stream in ArrowFlight(which is also
> > aligned
> > > with
> > > > > > corresponding IPC stream writers/readers).
> > > > > > Wondering if the community has evaluated the necessity/possibility
> > of
> > > > > > supporting schema changes within a single stream(I do recall
> > seeing a
> > > > > > discussion on this somewhere but can't find it)?
> > > > > >
> > > > > > Cheers,
> > > > > > Gosh
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> >
>

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Reply via email to