Hi guys! Thanks for the feedback/info. Let me try to put a proposal together. Though I guess I'll need some assistance on crafting it both in terms of the structure of a proposal expected in the Arrow community as well as technical guidance.
WIll share a doc with some ideas shortly so that we can start to iterate over it. Cheers, Gosh On Tue, Apr 13, 2021 at 6:55 PM Nate Bauernfeind < natebauernfe...@deephaven.io> wrote: > > possibly in coordination with the Deephaven/Barrage team, if they're also > still interested > > Good opportunity for me to chime in =). I think we still have interest in > this feature. On the other thread, it took a little cajoling, but I've come > around to agree with the conclusions of taking a RecordBatch and splitting > it up (a set of RecordBatches for added rows followed by a set of > RecordBatches for modifications). In this case I think it's best not to > evolve the schema between added row RecordBatches and modified row > RecordBatches (sending empty buffer nodes and field nodes will be > significantly cheaper). However, the schema evolution would be very useful > for when the rpc client changes the set of columns that they are subscribed > to (which is relatively rare compared to when the subscribed table itself > ticks). > > That said, schema evolution is not yet particularly high in our queue. > > On Tue, Apr 13, 2021 at 9:12 AM David Li <lidav...@apache.org> wrote: > > > Thanks for the details. I'll note a few things, but adding schema > > evolution to Flight is reasonable, if you'd like to put together a > > proposal for discussion (possibly in coordination with the > > Deephaven/Barrage team, if they're also still interested). > > > > > 3. Assume that there is a strong reason to query A1,..,AK together. > > > > While I don't know the details here, at least with Flight/gRPC, it's > > not necessarily expensive to make several requests to the same server, > > as gRPC will consolidate them into the same underlying network > > connection. You could issue one GetFlightInfo request for all streams > > at once, and get back a list of endpoints for each individual > > subquery, which you could then issue separate DoGet requests for. > > > > There's a slight mismatch there in that GetFlightInfo returns a > > FlightInfo, which assumes all endpoints have the same schema. But for > > a specific application, you could ignore that field (nothing in Flight > > checks that schema against the actual data). > > > > Of course, if said strong reason is that all the data is really > > retrieved together despite being distinct datasets, then this would > > complicate the server side implementation quite a bit. But it's one > > option. > > > > > A potential way to address this(with the existing tools) could be > having > > a > > > union schema of all fields across all entities(potentially prefixed > with > > > the field name just like in sql joins) and setting the values to NA > which > > > do not belong to an entity. > > > > I had a similar use case in the past, and it was suggested to use > > Arrow's Union type which handles this directly. A Union of Struct > > types essentially lets you have multiple distinct schemas all encoded > > in the same overall table, with explicit information about which > > schema is currently in use. But as you point out this isn't helpful if > > you don't know all the schemas up front. > > > > Best, > > David > > > > On 2021/04/13 11:21:20, Gosh Arzumanyan <gosh...@gmail.com> wrote: > > > Hi David, > > > > > > Thanks for sharing the link! > > > > > > Here is how a potential use case might look like: > > > > > > 1. Assume that we have a service S which accepts expressions in some > > > language X. > > > 2. Assume that a typical query to this service requests entities > A_1, > > > A_2,..,A_K. Each of those entities generates a stream of record > > batches. > > > Record batches for a single A_I share the same schema, yet there is > no > > > guarantee that schemas are equal across all streams. > > > 3. Assume that there is a strong reason to query A1,..,AK together. > > > 4. Service generates record batches(concurrently), tags those(e.g. > > with > > > schema level metadata) and sends them over. > > > > > > A potential way to address this(with the existing tools) could be > having > > a > > > union schema of all fields across all entities(potentially prefixed > with > > > the field name just like in sql joins) and setting the values to NA > which > > > do not belong to an entity. However this solution might not work in > cases > > > where we are not able to construct the unified schema before opening > the > > > stream(e.g. in case of changes in the schema for a specific entity upon > > > realtime input feeding or an unpredictable generator expression). > > > > > > Cheers, > > > Gosh > > > > > > > > > On Mon., 12 Apr. 2021, 13:45 David Li, <lidav...@apache.org> wrote: > > > > > > > Hi Gosh, > > > > > > > > There was indeed a discussion where schema evolution was proposed as > a > > > > solution for another use case: > > > > > > > > > > > https://lists.apache.org/thread.html/re800c63f0eb08022c8cd5e1b2236fd69a2e85afdc34daf6b75e3b7b3%40%3Cdev.arrow.apache.org%3E > > > > > > > > I am curious though, what is your use case here? > > > > > > > > Best, > > > > David > > > > > > > > On 2021/04/12 10:49:00, Gosh Arzumanyan <gosh...@gmail.com> wrote: > > > > > Hi guys, hope you are well! > > > > > > > > > > Judging from the Flight API > > > > > < > > > > > > > https://github.com/apache/arrow/blob/5b08205f7e864ed29f53ed3d836845fed62d5d4a/cpp/src/arrow/flight/types.h#L461 > > > > > > > > > > and > > > > > from the documentation/examples out there, it seems like data > schema > > is > > > > > supposed to be fixed per stream in ArrowFlight(which is also > aligned > > with > > > > > corresponding IPC stream writers/readers). > > > > > Wondering if the community has evaluated the necessity/possibility > of > > > > > supporting schema changes within a single stream(I do recall > seeing a > > > > > discussion on this somewhere but can't find it)? > > > > > > > > > > Cheers, > > > > > Gosh > > > > > > > > > > > > > > > > > -- >