Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Gosh Arzumanyan Fri, 18 Jun 2021 11:38:59 -0700

Hi David,

Thanks for poking me on this. I have been thinking it out but have not got
to crafting a doc. Let me put together a rough proposal this weekend.
Afterwards I'll do need your help for bringing it to a reviewable state.


Cheers,
Gosh

On Fri., 18 Jun. 2021, 18:11 David Li, <[email protected]> wrote:

> Following up here - Gosh, did you get a chance to put something together?
> Do you need/want help on this? This would also potentially be useful for
> FlightSQL. (See the discussion on GitHub:
> https://github.com/apache/arrow/pull/9368#discussion_r572941765)
>
> Best,
> David
>
> On Fri, Apr 16, 2021, at 10:59, Gosh Arzumanyan wrote:
> > Hi guys!
> >
> > Thanks for the feedback/info.
> > Let me try to put a proposal together. Though I guess I'll need some
> > assistance on crafting it both in terms of the structure of a proposal
> > expected in the Arrow community as well as technical guidance.
> >
> > WIll share a doc with some ideas shortly so that we can start to iterate
> > over it.
> >
> > Cheers,
> > Gosh
> >
> > On Tue, Apr 13, 2021 at 6:55 PM Nate Bauernfeind <
> > [email protected] <mailto:natebauernfeind%40deephaven.io>>
> wrote:
> >
> > > > possibly in coordination with the Deephaven/Barrage team, if they're
> also
> > > still interested
> > >
> > > Good opportunity for me to chime in =). I think we still have interest
> in
> > > this feature. On the other thread, it took a little cajoling, but I've
> come
> > > around to agree with the conclusions of taking a RecordBatch and
> splitting
> > > it up (a set of RecordBatches for added rows followed by a set of
> > > RecordBatches for modifications). In this case I think it's best not to
> > > evolve the schema between added row RecordBatches and modified row
> > > RecordBatches (sending empty buffer nodes and field nodes will be
> > > significantly cheaper). However, the schema evolution would be very
> useful
> > > for when the rpc client changes the set of columns that they are
> subscribed
> > > to (which is relatively rare compared to when the subscribed table
> itself
> > > ticks).
> > >
> > > That said, schema evolution is not yet particularly high in our queue.
> > >
> > > On Tue, Apr 13, 2021 at 9:12 AM David Li <[email protected] <mailto:
> lidavidm%40apache.org>> wrote:
> > >
> > > > Thanks for the details. I'll note a few things, but adding schema
> > > > evolution to Flight is reasonable, if you'd like to put together a
> > > > proposal for discussion (possibly in coordination with the
> > > > Deephaven/Barrage team, if they're also still interested).
> > > >
> > > > >    3. Assume that there is a strong reason to query A1,..,AK
> together.
> > > >
> > > > While I don't know the details here, at least with Flight/gRPC, it's
> > > > not necessarily expensive to make several requests to the same
> server,
> > > > as gRPC will consolidate them into the same underlying network
> > > > connection. You could issue one GetFlightInfo request for all streams
> > > > at once, and get back a list of endpoints for each individual
> > > > subquery, which you could then issue separate DoGet requests for.
> > > >
> > > > There's a slight mismatch there in that GetFlightInfo returns a
> > > > FlightInfo, which assumes all endpoints have the same schema. But for
> > > > a specific application, you could ignore that field (nothing in
> Flight
> > > > checks that schema against the actual data).
> > > >
> > > > Of course, if said strong reason is that all the data is really
> > > > retrieved together despite being distinct datasets, then this would
> > > > complicate the server side implementation quite a bit. But it's one
> > > > option.
> > > >
> > > > > A potential way to address this(with the existing tools) could be
> > > having
> > > > a
> > > > > union schema of all fields across all entities(potentially prefixed
> > > with
> > > > > the field name just like in sql joins) and setting the values to NA
> > > which
> > > > > do not belong to an entity.
> > > >
> > > > I had a similar use case in the past, and it was suggested to use
> > > > Arrow's Union type which handles this directly. A Union of Struct
> > > > types essentially lets you have multiple distinct schemas all encoded
> > > > in the same overall table, with explicit information about which
> > > > schema is currently in use. But as you point out this isn't helpful
> if
> > > > you don't know all the schemas up front.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On 2021/04/13 11:21:20, Gosh Arzumanyan <[email protected] <mailto:
> gosharz%40gmail.com>> wrote:
> > > > > Hi David,
> > > > >
> > > > > Thanks for sharing the link!
> > > > >
> > > > > Here is how a potential use case might look like:
> > > > >
> > > > >    1. Assume that we have a service S which accepts expressions in
> some
> > > > >    language X.
> > > > >    2. Assume that a typical query to this service requests entities
> > > A_1,
> > > > >    A_2,..,A_K. Each of those entities generates a stream of record
> > > > batches.
> > > > >    Record batches for a single A_I share the same schema, yet
> there is
> > > no
> > > > >    guarantee that schemas are equal across all streams.
> > > > >    3. Assume that there is a strong reason to query A1,..,AK
> together.
> > > > >    4. Service generates record batches(concurrently), tags
> those(e.g.
> > > > with
> > > > >    schema level metadata) and sends them over.
> > > > >
> > > > > A potential way to address this(with the existing tools) could be
> > > having
> > > > a
> > > > > union schema of all fields across all entities(potentially prefixed
> > > with
> > > > > the field name just like in sql joins) and setting the values to NA
> > > which
> > > > > do not belong to an entity. However this solution might not work in
> > > cases
> > > > > where we are not able to construct the unified schema before
> opening
> > > the
> > > > > stream(e.g. in case of changes in the schema for a specific entity
> upon
> > > > > realtime input feeding or an unpredictable generator expression).
> > > > >
> > > > > Cheers,
> > > > > Gosh
> > > > >
> > > > >
> > > > > On Mon., 12 Apr. 2021, 13:45 David Li, <[email protected]
> <mailto:lidavidm%40apache.org>> wrote:
> > > > >
> > > > > > Hi Gosh,
> > > > > >
> > > > > > There was indeed a discussion where schema evolution was
> proposed as
> > > a
> > > > > > solution for another use case:
> > > > > >
> > > > > >
> > > >
> > >
> https://lists.apache.org/thread.html/re800c63f0eb08022c8cd5e1b2236fd69a2e85afdc34daf6b75e3b7b3%40%3Cdev.arrow.apache.org%3E
> > > > > >
> > > > > > I am curious though, what is your use case here?
> > > > > >
> > > > > > Best,
> > > > > > David
> > > > > >
> > > > > > On 2021/04/12 10:49:00, Gosh Arzumanyan <[email protected]
> <mailto:gosharz%40gmail.com>> wrote:
> > > > > > > Hi guys, hope you are well!
> > > > > > >
> > > > > > > Judging from the Flight API
> > > > > > > <
> > > > > >
> > > >
> > >
> https://github.com/apache/arrow/blob/5b08205f7e864ed29f53ed3d836845fed62d5d4a/cpp/src/arrow/flight/types.h#L461
> > > > > > >
> > > > > > > and
> > > > > > > from the documentation/examples out there, it seems like data
> > > schema
> > > > is
> > > > > > > supposed to be fixed per stream in ArrowFlight(which is also
> > > aligned
> > > > with
> > > > > > > corresponding IPC stream writers/readers).
> > > > > > > Wondering if the community has evaluated the
> necessity/possibility
> > > of
> > > > > > > supporting schema changes within a single stream(I do recall
> > > seeing a
> > > > > > > discussion on this somewhere but can't find it)?
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gosh
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> >
>

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Reply via email to