Following up here - Gosh, did you get a chance to put something together? Do you need/want help on this? This would also potentially be useful for FlightSQL. (See the discussion on GitHub: https://github.com/apache/arrow/pull/9368#discussion_r572941765)
Best, David On Fri, Apr 16, 2021, at 10:59, Gosh Arzumanyan wrote: > Hi guys! > > Thanks for the feedback/info. > Let me try to put a proposal together. Though I guess I'll need some > assistance on crafting it both in terms of the structure of a proposal > expected in the Arrow community as well as technical guidance. > > WIll share a doc with some ideas shortly so that we can start to iterate > over it. > > Cheers, > Gosh > > On Tue, Apr 13, 2021 at 6:55 PM Nate Bauernfeind < > natebauernfe...@deephaven.io <mailto:natebauernfeind%40deephaven.io>> wrote: > > > > possibly in coordination with the Deephaven/Barrage team, if they're also > > still interested > > > > Good opportunity for me to chime in =). I think we still have interest in > > this feature. On the other thread, it took a little cajoling, but I've come > > around to agree with the conclusions of taking a RecordBatch and splitting > > it up (a set of RecordBatches for added rows followed by a set of > > RecordBatches for modifications). In this case I think it's best not to > > evolve the schema between added row RecordBatches and modified row > > RecordBatches (sending empty buffer nodes and field nodes will be > > significantly cheaper). However, the schema evolution would be very useful > > for when the rpc client changes the set of columns that they are subscribed > > to (which is relatively rare compared to when the subscribed table itself > > ticks). > > > > That said, schema evolution is not yet particularly high in our queue. > > > > On Tue, Apr 13, 2021 at 9:12 AM David Li <lidav...@apache.org > > <mailto:lidavidm%40apache.org>> wrote: > > > > > Thanks for the details. I'll note a few things, but adding schema > > > evolution to Flight is reasonable, if you'd like to put together a > > > proposal for discussion (possibly in coordination with the > > > Deephaven/Barrage team, if they're also still interested). > > > > > > > 3. Assume that there is a strong reason to query A1,..,AK together. > > > > > > While I don't know the details here, at least with Flight/gRPC, it's > > > not necessarily expensive to make several requests to the same server, > > > as gRPC will consolidate them into the same underlying network > > > connection. You could issue one GetFlightInfo request for all streams > > > at once, and get back a list of endpoints for each individual > > > subquery, which you could then issue separate DoGet requests for. > > > > > > There's a slight mismatch there in that GetFlightInfo returns a > > > FlightInfo, which assumes all endpoints have the same schema. But for > > > a specific application, you could ignore that field (nothing in Flight > > > checks that schema against the actual data). > > > > > > Of course, if said strong reason is that all the data is really > > > retrieved together despite being distinct datasets, then this would > > > complicate the server side implementation quite a bit. But it's one > > > option. > > > > > > > A potential way to address this(with the existing tools) could be > > having > > > a > > > > union schema of all fields across all entities(potentially prefixed > > with > > > > the field name just like in sql joins) and setting the values to NA > > which > > > > do not belong to an entity. > > > > > > I had a similar use case in the past, and it was suggested to use > > > Arrow's Union type which handles this directly. A Union of Struct > > > types essentially lets you have multiple distinct schemas all encoded > > > in the same overall table, with explicit information about which > > > schema is currently in use. But as you point out this isn't helpful if > > > you don't know all the schemas up front. > > > > > > Best, > > > David > > > > > > On 2021/04/13 11:21:20, Gosh Arzumanyan <gosh...@gmail.com > > > <mailto:gosharz%40gmail.com>> wrote: > > > > Hi David, > > > > > > > > Thanks for sharing the link! > > > > > > > > Here is how a potential use case might look like: > > > > > > > > 1. Assume that we have a service S which accepts expressions in some > > > > language X. > > > > 2. Assume that a typical query to this service requests entities > > A_1, > > > > A_2,..,A_K. Each of those entities generates a stream of record > > > batches. > > > > Record batches for a single A_I share the same schema, yet there is > > no > > > > guarantee that schemas are equal across all streams. > > > > 3. Assume that there is a strong reason to query A1,..,AK together. > > > > 4. Service generates record batches(concurrently), tags those(e.g. > > > with > > > > schema level metadata) and sends them over. > > > > > > > > A potential way to address this(with the existing tools) could be > > having > > > a > > > > union schema of all fields across all entities(potentially prefixed > > with > > > > the field name just like in sql joins) and setting the values to NA > > which > > > > do not belong to an entity. However this solution might not work in > > cases > > > > where we are not able to construct the unified schema before opening > > the > > > > stream(e.g. in case of changes in the schema for a specific entity upon > > > > realtime input feeding or an unpredictable generator expression). > > > > > > > > Cheers, > > > > Gosh > > > > > > > > > > > > On Mon., 12 Apr. 2021, 13:45 David Li, <lidav...@apache.org > > > > <mailto:lidavidm%40apache.org>> wrote: > > > > > > > > > Hi Gosh, > > > > > > > > > > There was indeed a discussion where schema evolution was proposed as > > a > > > > > solution for another use case: > > > > > > > > > > > > > > > https://lists.apache.org/thread.html/re800c63f0eb08022c8cd5e1b2236fd69a2e85afdc34daf6b75e3b7b3%40%3Cdev.arrow.apache.org%3E > > > > > > > > > > I am curious though, what is your use case here? > > > > > > > > > > Best, > > > > > David > > > > > > > > > > On 2021/04/12 10:49:00, Gosh Arzumanyan <gosh...@gmail.com > > > > > <mailto:gosharz%40gmail.com>> wrote: > > > > > > Hi guys, hope you are well! > > > > > > > > > > > > Judging from the Flight API > > > > > > < > > > > > > > > > > https://github.com/apache/arrow/blob/5b08205f7e864ed29f53ed3d836845fed62d5d4a/cpp/src/arrow/flight/types.h#L461 > > > > > > > > > > > > and > > > > > > from the documentation/examples out there, it seems like data > > schema > > > is > > > > > > supposed to be fixed per stream in ArrowFlight(which is also > > aligned > > > with > > > > > > corresponding IPC stream writers/readers). > > > > > > Wondering if the community has evaluated the necessity/possibility > > of > > > > > > supporting schema changes within a single stream(I do recall > > seeing a > > > > > > discussion on this somewhere but can't find it)? > > > > > > > > > > > > Cheers, > > > > > > Gosh > > > > > > > > > > > > > > > > > > > > > > > > -- > > >