Some people have already left comments on https://github.com/apache/arrow/pull/12571 More eyes on it would be appreciated. If there aren't more comments, I'll try to start implementing this feature in Flight next week, and hopefully have a vote after it is supported in Java and C++/Python.
Thanks, Micah On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > I put together straw-man proposal in PR [1] for the Flight changes. > Ultimately, it seemed based on the use-cases discussed inlining the data on > the Ticket made the most sense. This might be overly complex (I'm not sure > how I feel about a enum indicating partial vs full results) but welcome > feedback. Once we get consensus on this proposal, I can add changes to > Flight SQL and try to provide reference implementations. > > [1] https://github.com/apache/arrow/pull/12571 > > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Would it make sense to make this part of DoGet since it >>> still would be returning a record batch >> >> I would lean against this. I think in many cases the client doesn't know >> the size of the data that it expects. Leaving the flexibility on the >> server side to send back inlined data when it thinks it makes sense, or a >> bunch of tickets when there is in fact a lot of data seems like the best >> option here. >> >> For cases like previewing data, you usually just want to get a small >>> amount >>> of data quickly. >> >> This is interesting and might be an additional use case. If we did >> decide to extend FlightInfo we might also want a way of annotating inlined >> data with its corresponding ticket. That way even for large results, you >> could still send back a small preview if desired. >> >> After considering it a little bit I think I'm sold that inlined data >> should not replace a ticket. So in my mind the open question is whether >> the client needs to actively opt-in to inlined data. The scenarios I could >> come with where inlined data isn't useful are: >> 1. The client is an old client and isn't aware inline data might be >> returned. In this case the main cost is of extra data on the wire and >> storing it as unknown fields [1]. >> 2. The client is a new client but still doesn't want to get inline data >> (it might want to distribute all consumption to other processes). Same >> cost is paid as option 1. >> >> Are there other scenarios? If servers choose reasonable limits on what >> data to inline, the extra complexity of negotiating with the client in this >> case might not be worth the benefits. >> >> Cheers, >> Micah >> >> >> [1] https://developers.google.com/protocol-buffers/docs/proto3#unknowns >> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler <cutl...@gmail.com> wrote: >> >>> I think this would be a useful feature and be nice to have in Flight >>> core. >>> For cases like previewing data, you usually just want to get a small >>> amount >>> of data quickly. Would it make sense to make this part of DoGet since it >>> still would be returning a record batch? Perhaps a Ticket could be made >>> to >>> have an optional FlightDescriptor that would serve as an all-in-one shot? >>> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li <lidav...@apache.org> wrote: >>> >>> > I agree with something along Antoine's proposal, though: maybe we >>> should >>> > be more structured with the flags (akin to what Micah mentioned with >>> the >>> > Feature enum). >>> > >>> > Also, the flag could be embedded into the Flight SQL messages instead. >>> (So >>> > in effect, Flight would only add the capability to return data with >>> > FlightInfo, and it's up to applications, like Flight SQL, to decide how >>> > they want to take advantage of that.) >>> > >>> > I think having a completely separate method and return type and having >>> to >>> > poll for it beforehand somewhat defeats the purpose of having it/would >>> be >>> > much harder of a transition. >>> > >>> > Also: it should be `repeated FlightInfo inline_data` right? In case we >>> > also need dictionary batches? >>> > >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote: >>> > > Can we just add the following field to the FlightDescriptor message: >>> > > >>> > > bool accept_inline_data = 4; >>> > > >>> > > and this one to the FlightInfo message: >>> > > >>> > > FlightData inline_data = 100; >>> > > >>> > > Then new clients can `accept_inline_data` to true (the default being >>> > > false if omitted) to signal servers that they can put the data if >>> > > `inline_data` if deemed small enough. >>> > > >>> > > (the `accept_inline_data` field could also be used to the Criteria >>> > > message) >>> > > >>> > > >>> > > Alternatively, if the FlightDescriptor expansion looks a bit dirty >>> > > (FlightDescriptor being used in other contexts where >>> > > `accept_inline_data` makes no sense), we can instead define a new >>> > > method: >>> > > >>> > > rpc GetFlightInfoEx(GetFlightInfoRequest) returns (FlightInfo) {} >>> > > >>> > > with: >>> > > >>> > > message GetFlightInfoRequest { >>> > > FlightDescriptor flight_descriptor = 1; >>> > > bool accept_inline_data = 2; >>> > > } >>> > > >>> > > Regards >>> > > >>> > > Antoine. >>> > > >>> > > >>> > > On Mon, 28 Feb 2022 11:29:12 -0800 >>> > > James Duong <jam...@bitquilltech.com.INVALID> wrote: >>> > >> This seems reasonable, however we need to account for existing >>> Flight >>> > >> clients that were written before this. >>> > >> >>> > >> It seems like the server will need to still handle the ticket >>> returned >>> > for >>> > >> getStream() for clients that are unaware of the small result >>> > optimization. >>> > >> >>> > >> On Mon, Feb 28, 2022 at 11:26 AM David Li <lidav...@apache.org> >>> wrote: >>> > >> >>> > >> > Ah, that makes more sense, that would be a reasonable extension to >>> > Flight >>> > >> > overall. (While we're at it, I think it would help to have an >>> > app_metadata >>> > >> > field in FlightInfo as well.) >>> > >> > >>> > >> > On Mon, Feb 28, 2022, at 14:24, Micah Kornfield wrote: >>> > >> > >> >>> > >> > >> But it seems reasonable to add a one-shot query path using >>> DoGet. >>> > >> > > >>> > >> > > >>> > >> > > I was thinking more of adding a bytes field to FlightInfo that >>> > could >>> > >> > store >>> > >> > > arrow data. That way GetFlightInfo would be the only RPC >>> necessary >>> > for >>> > >> > > small results when executing a CMD. The client doesn't >>> necessarily >>> > know >>> > >> > > whether a query will return large or small results. >>> > >> > > >>> > >> > > On Mon, Feb 28, 2022 at 11:04 AM David Li <lidav...@apache.org> >>> > wrote: >>> > >> > > >>> > >> > >> I think the focus was on large result sets (though I don't >>> recall >>> > this >>> > >> > >> being discussed before) and supporting multi-node setups (hence >>> > >> > >> GetFlightInfo/DoGet are separated). But it seems reasonable to >>> add >>> > a >>> > >> > >> one-shot query path using DoGet. >>> > >> > >> >>> > >> > >> On Mon, Feb 28, 2022, at 13:32, Adam Lippai wrote: >>> > >> > >> > I saw the same. A small, stateless query ability would be >>> nice >>> > >> > >> (connection >>> > >> > >> > open, initialization, query in one message, the resultset in >>> > the >>> > >> > response >>> > >> > >> > in one message) >>> > >> > >> > >>> > >> > >> > On Mon, Feb 28, 2022, 13:12 Micah Kornfield < >>> > emkornfi...@gmail.com> >>> > >> > >> wrote: >>> > >> > >> > >>> > >> > >> >> I'm rereviewing the Flight SQL interfaces, and I'm not sure >>> if >>> > I'm >>> > >> > >> missing >>> > >> > >> >> it but is there any optimization for small results? My >>> concern >>> > is >>> > >> > that >>> > >> > >> the >>> > >> > >> >> overhead of the RPCs for the DoGet after executing the query >>> > could >>> > >> > add >>> > >> > >> >> non-trivial latency for smaller results. >>> > >> > >> >> >>> > >> > >> >> Has anybody else thought about this/investigated it? Am I >>> > >> > understanding >>> > >> > >> >> this correctly? >>> > >> > >> >> >>> > >> > >> >> Thanks, >>> > >> > >> >> Micah >>> > >> > >> >> >>> > >> > >> >>> > >> > >>> > >> >>> > >> >>> > >>> >>