I'm sorry I haven't had the time I wanted to spend on implementation, I'd still like to get to it but cannot commit to a timeline for a little bit. if anybody would like to take on the implementation work, I'm happy to review.
On Wed, Apr 27, 2022 at 9:06 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > Yes, next step is implementation which I've been delayed on. I hope to > have a little time this week to work on it and will post an update. > > On Wed, Apr 27, 2022 at 7:48 AM David Li <lidav...@apache.org> wrote: > >> Following up here - what are the next steps? The RFC PR looks fairly >> complete, maybe we can help build out implementations in C++/Java/other >> languages in preparation for a vote? >> >> On Wed, Mar 9, 2022, at 00:23, Micah Kornfield wrote: >> >> >> >> The operation flow would be like this, or what would it look like? >> >> Client ---> GetFlightInfo (query/update operation in payload) ---> >> Server >> >> ---> Results (non-streamed) >> > >> > >> > This is roughly the flow I was imagining if the server chooses to send >> back >> > inlined data. >> > >> > -Micah >> > >> > On Tue, Mar 8, 2022 at 11:27 AM Gavin Ray <ray.gavi...@gmail.com> >> wrote: >> > >> >> Thank you for doing this, left a few questions on the GH issue >> >> >> >> I would adopt this proposal as soon as it makes it into nightlies >> >> (or possibly earlier if it's just a matter of regenerating the proto >> >> definitions) >> >> >> >> The operation flow would be like this, or what would it look like? >> >> >> >> Client ---> GetFlightInfo (query/update operation in payload) ---> >> Server >> >> ---> Results (non-streamed) >> >> >> >> >> >> >> >> >> >> On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield <emkornfi...@gmail.com> >> >> wrote: >> >> >> >>> Some people have already left comments on >> >>> https://github.com/apache/arrow/pull/12571 More eyes on it would be >> >>> appreciated. If there aren't more comments, I'll try to start >> >>> implementing >> >>> this feature in Flight next week, and hopefully have a vote after it >> is >> >>> supported in Java and C++/Python. >> >>> >> >>> >> >>> Thanks, >> >>> Micah >> >>> >> >>> On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield < >> emkornfi...@gmail.com> >> >>> wrote: >> >>> >> >>> > I put together straw-man proposal in PR [1] for the Flight changes. >> >>> > Ultimately, it seemed based on the use-cases discussed inlining the >> >>> data on >> >>> > the Ticket made the most sense. This might be overly complex (I'm >> not >> >>> sure >> >>> > how I feel about a enum indicating partial vs full results) but >> welcome >> >>> > feedback. Once we get consensus on this proposal, I can add >> changes to >> >>> > Flight SQL and try to provide reference implementations. >> >>> > >> >>> > [1] https://github.com/apache/arrow/pull/12571 >> >>> > >> >>> > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield < >> emkornfi...@gmail.com> >> >>> > wrote: >> >>> > >> >>> >> Would it make sense to make this part of DoGet since it >> >>> >>> still would be returning a record batch >> >>> >> >> >>> >> I would lean against this. I think in many cases the client doesn't >> >>> know >> >>> >> the size of the data that it expects. Leaving the flexibility on >> the >> >>> >> server side to send back inlined data when it thinks it makes >> sense, >> >>> or a >> >>> >> bunch of tickets when there is in fact a lot of data seems like the >> >>> best >> >>> >> option here. >> >>> >> >> >>> >> For cases like previewing data, you usually just want to get a >> small >> >>> >>> amount >> >>> >>> of data quickly. >> >>> >> >> >>> >> This is interesting and might be an additional use case. If we did >> >>> >> decide to extend FlightInfo we might also want a way of annotating >> >>> inlined >> >>> >> data with its corresponding ticket. That way even for large >> results, >> >>> you >> >>> >> could still send back a small preview if desired. >> >>> >> >> >>> >> After considering it a little bit I think I'm sold that inlined >> data >> >>> >> should not replace a ticket. So in my mind the open question is >> >>> whether >> >>> >> the client needs to actively opt-in to inlined data. The >> scenarios I >> >>> could >> >>> >> come with where inlined data isn't useful are: >> >>> >> 1. The client is an old client and isn't aware inline data might >> be >> >>> >> returned. In this case the main cost is of extra data on the wire >> and >> >>> >> storing it as unknown fields [1]. >> >>> >> 2. The client is a new client but still doesn't want to get inline >> >>> data >> >>> >> (it might want to distribute all consumption to other processes). >> Same >> >>> >> cost is paid as option 1. >> >>> >> >> >>> >> Are there other scenarios? If servers choose reasonable limits on >> what >> >>> >> data to inline, the extra complexity of negotiating with the >> client in >> >>> this >> >>> >> case might not be worth the benefits. >> >>> >> >> >>> >> Cheers, >> >>> >> Micah >> >>> >> >> >>> >> >> >>> >> [1] >> >>> https://developers.google.com/protocol-buffers/docs/proto3#unknowns >> >>> >> >> >>> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler <cutl...@gmail.com> >> >>> wrote: >> >>> >> >> >>> >>> I think this would be a useful feature and be nice to have in >> Flight >> >>> >>> core. >> >>> >>> For cases like previewing data, you usually just want to get a >> small >> >>> >>> amount >> >>> >>> of data quickly. Would it make sense to make this part of DoGet >> since >> >>> it >> >>> >>> still would be returning a record batch? Perhaps a Ticket could be >> >>> made >> >>> >>> to >> >>> >>> have an optional FlightDescriptor that would serve as an >> all-in-one >> >>> shot? >> >>> >>> >> >>> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li <lidav...@apache.org> >> wrote: >> >>> >>> >> >>> >>> > I agree with something along Antoine's proposal, though: maybe >> we >> >>> >>> should >> >>> >>> > be more structured with the flags (akin to what Micah mentioned >> with >> >>> >>> the >> >>> >>> > Feature enum). >> >>> >>> > >> >>> >>> > Also, the flag could be embedded into the Flight SQL messages >> >>> instead. >> >>> >>> (So >> >>> >>> > in effect, Flight would only add the capability to return data >> with >> >>> >>> > FlightInfo, and it's up to applications, like Flight SQL, to >> decide >> >>> how >> >>> >>> > they want to take advantage of that.) >> >>> >>> > >> >>> >>> > I think having a completely separate method and return type and >> >>> having >> >>> >>> to >> >>> >>> > poll for it beforehand somewhat defeats the purpose of having >> >>> it/would >> >>> >>> be >> >>> >>> > much harder of a transition. >> >>> >>> > >> >>> >>> > Also: it should be `repeated FlightInfo inline_data` right? In >> case >> >>> we >> >>> >>> > also need dictionary batches? >> >>> >>> > >> >>> >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote: >> >>> >>> > > Can we just add the following field to the FlightDescriptor >> >>> message: >> >>> >>> > > >> >>> >>> > > bool accept_inline_data = 4; >> >>> >>> > > >> >>> >>> > > and this one to the FlightInfo message: >> >>> >>> > > >> >>> >>> > > FlightData inline_data = 100; >> >>> >>> > > >> >>> >>> > > Then new clients can `accept_inline_data` to true (the default >> >>> being >> >>> >>> > > false if omitted) to signal servers that they can put the >> data if >> >>> >>> > > `inline_data` if deemed small enough. >> >>> >>> > > >> >>> >>> > > (the `accept_inline_data` field could also be used to the >> Criteria >> >>> >>> > > message) >> >>> >>> > > >> >>> >>> > > >> >>> >>> > > Alternatively, if the FlightDescriptor expansion looks a bit >> dirty >> >>> >>> > > (FlightDescriptor being used in other contexts where >> >>> >>> > > `accept_inline_data` makes no sense), we can instead define a >> new >> >>> >>> > > method: >> >>> >>> > > >> >>> >>> > > rpc GetFlightInfoEx(GetFlightInfoRequest) returns >> (FlightInfo) >> >>> {} >> >>> >>> > > >> >>> >>> > > with: >> >>> >>> > > >> >>> >>> > > message GetFlightInfoRequest { >> >>> >>> > > FlightDescriptor flight_descriptor = 1; >> >>> >>> > > bool accept_inline_data = 2; >> >>> >>> > > } >> >>> >>> > > >> >>> >>> > > Regards >> >>> >>> > > >> >>> >>> > > Antoine. >> >>> >>> > > >> >>> >>> > > >> >>> >>> > > On Mon, 28 Feb 2022 11:29:12 -0800 >> >>> >>> > > James Duong <jam...@bitquilltech.com.INVALID> wrote: >> >>> >>> > >> This seems reasonable, however we need to account for >> existing >> >>> >>> Flight >> >>> >>> > >> clients that were written before this. >> >>> >>> > >> >> >>> >>> > >> It seems like the server will need to still handle the ticket >> >>> >>> returned >> >>> >>> > for >> >>> >>> > >> getStream() for clients that are unaware of the small result >> >>> >>> > optimization. >> >>> >>> > >> >> >>> >>> > >> On Mon, Feb 28, 2022 at 11:26 AM David Li < >> lidav...@apache.org> >> >>> >>> wrote: >> >>> >>> > >> >> >>> >>> > >> > Ah, that makes more sense, that would be a reasonable >> >>> extension to >> >>> >>> > Flight >> >>> >>> > >> > overall. (While we're at it, I think it would help to have >> an >> >>> >>> > app_metadata >> >>> >>> > >> > field in FlightInfo as well.) >> >>> >>> > >> > >> >>> >>> > >> > On Mon, Feb 28, 2022, at 14:24, Micah Kornfield wrote: >> >>> >>> > >> > >> >> >>> >>> > >> > >> But it seems reasonable to add a one-shot query path >> using >> >>> >>> DoGet. >> >>> >>> > >> > > >> >>> >>> > >> > > >> >>> >>> > >> > > I was thinking more of adding a bytes field to FlightInfo >> >>> that >> >>> >>> > could >> >>> >>> > >> > store >> >>> >>> > >> > > arrow data. That way GetFlightInfo would be the only RPC >> >>> >>> necessary >> >>> >>> > for >> >>> >>> > >> > > small results when executing a CMD. The client doesn't >> >>> >>> necessarily >> >>> >>> > know >> >>> >>> > >> > > whether a query will return large or small results. >> >>> >>> > >> > > >> >>> >>> > >> > > On Mon, Feb 28, 2022 at 11:04 AM David Li < >> >>> lidav...@apache.org> >> >>> >>> > wrote: >> >>> >>> > >> > > >> >>> >>> > >> > >> I think the focus was on large result sets (though I >> don't >> >>> >>> recall >> >>> >>> > this >> >>> >>> > >> > >> being discussed before) and supporting multi-node setups >> >>> (hence >> >>> >>> > >> > >> GetFlightInfo/DoGet are separated). But it seems >> reasonable >> >>> to >> >>> >>> add >> >>> >>> > a >> >>> >>> > >> > >> one-shot query path using DoGet. >> >>> >>> > >> > >> >> >>> >>> > >> > >> On Mon, Feb 28, 2022, at 13:32, Adam Lippai wrote: >> >>> >>> > >> > >> > I saw the same. A small, stateless query ability >> would be >> >>> >>> nice >> >>> >>> > >> > >> (connection >> >>> >>> > >> > >> > open, initialization, query in one message, the >> resultset >> >>> in >> >>> >>> > the >> >>> >>> > >> > response >> >>> >>> > >> > >> > in one message) >> >>> >>> > >> > >> > >> >>> >>> > >> > >> > On Mon, Feb 28, 2022, 13:12 Micah Kornfield < >> >>> >>> > emkornfi...@gmail.com> >> >>> >>> > >> > >> wrote: >> >>> >>> > >> > >> > >> >>> >>> > >> > >> >> I'm rereviewing the Flight SQL interfaces, and I'm >> not >> >>> sure >> >>> >>> if >> >>> >>> > I'm >> >>> >>> > >> > >> missing >> >>> >>> > >> > >> >> it but is there any optimization for small results? >> My >> >>> >>> concern >> >>> >>> > is >> >>> >>> > >> > that >> >>> >>> > >> > >> the >> >>> >>> > >> > >> >> overhead of the RPCs for the DoGet after executing >> the >> >>> query >> >>> >>> > could >> >>> >>> > >> > add >> >>> >>> > >> > >> >> non-trivial latency for smaller results. >> >>> >>> > >> > >> >> >> >>> >>> > >> > >> >> Has anybody else thought about this/investigated >> it? Am >> >>> I >> >>> >>> > >> > understanding >> >>> >>> > >> > >> >> this correctly? >> >>> >>> > >> > >> >> >> >>> >>> > >> > >> >> Thanks, >> >>> >>> > >> > >> >> Micah >> >>> >>> > >> > >> >> >> >>> >>> > >> > >> >> >>> >>> > >> > >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> > >> >>> >>> >> >>> >> >> >>> >> >> >> >