Some people have already left comments on
https://github.com/apache/arrow/pull/12571  More eyes on it would be
appreciated.  If there aren't more comments, I'll try to start implementing
this feature in Flight next week, and hopefully have a vote after it is
supported in Java and C++/Python.


Thanks,
Micah

On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I put together straw-man proposal in PR [1] for the Flight changes.
> Ultimately, it seemed based on the use-cases discussed inlining the data on
> the Ticket made the most sense.  This might be overly complex (I'm not sure
> how I feel about a enum indicating partial vs full results) but welcome
> feedback.  Once we get consensus on this proposal, I can add changes to
> Flight SQL and try to provide reference implementations.
>
> [1] https://github.com/apache/arrow/pull/12571
>
> On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Would it make sense to make this part of DoGet since it
>>> still would be returning a record batch
>>
>> I would lean against this. I think in many cases the client doesn't know
>> the size of the data that it expects.  Leaving the flexibility on the
>> server side to send back inlined data when it thinks it makes sense, or a
>> bunch of tickets when there is in fact a lot of data seems like the best
>> option here.
>>
>> For cases like previewing data, you usually just want to get a small
>>> amount
>>> of data quickly.
>>
>> This is interesting and might be an additional use case.  If we did
>> decide to extend FlightInfo we might also want a way of annotating inlined
>> data with its corresponding ticket.  That way even for large results, you
>> could still send back a small preview if desired.
>>
>> After considering it a little bit I think I'm sold that inlined data
>> should not replace a ticket.  So in my mind the open question is whether
>> the client needs to actively opt-in to inlined data.  The scenarios I could
>> come with where inlined data isn't useful are:
>> 1.  The client is an old client and isn't aware inline data might be
>> returned.  In this case the main cost is of extra data on the wire and
>> storing it as unknown fields [1].
>> 2.  The client is a new client but still doesn't want to get inline data
>> (it might want to distribute all consumption to other processes).  Same
>> cost is paid as option 1.
>>
>> Are there other scenarios?  If servers choose reasonable limits on what
>> data to inline, the extra complexity of negotiating with the client in this
>> case might not be worth the benefits.
>>
>> Cheers,
>> Micah
>>
>>
>> [1] https://developers.google.com/protocol-buffers/docs/proto3#unknowns
>>
>> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler <cutl...@gmail.com> wrote:
>>
>>> I think this would be a useful feature and be nice to have in Flight
>>> core.
>>> For cases like previewing data, you usually just want to get a small
>>> amount
>>> of data quickly. Would it make sense to make this part of DoGet since it
>>> still would be returning a record batch? Perhaps a Ticket could be made
>>> to
>>> have an optional FlightDescriptor that would serve as an all-in-one shot?
>>>
>>> On Tue, Mar 1, 2022 at 8:44 AM David Li <lidav...@apache.org> wrote:
>>>
>>> > I agree with something along Antoine's proposal, though: maybe we
>>> should
>>> > be more structured with the flags (akin to what Micah mentioned with
>>> the
>>> > Feature enum).
>>> >
>>> > Also, the flag could be embedded into the Flight SQL messages instead.
>>> (So
>>> > in effect, Flight would only add the capability to return data with
>>> > FlightInfo, and it's up to applications, like Flight SQL, to decide how
>>> > they want to take advantage of that.)
>>> >
>>> > I think having a completely separate method and return type and having
>>> to
>>> > poll for it beforehand somewhat defeats the purpose of having it/would
>>> be
>>> > much harder of a transition.
>>> >
>>> > Also: it should be `repeated FlightInfo inline_data` right? In case we
>>> > also need dictionary batches?
>>> >
>>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote:
>>> > > Can we just add the following field to the FlightDescriptor message:
>>> > >
>>> > >   bool accept_inline_data = 4;
>>> > >
>>> > > and this one to the FlightInfo message:
>>> > >
>>> > >   FlightData inline_data = 100;
>>> > >
>>> > > Then new clients can `accept_inline_data` to true (the default being
>>> > > false if omitted) to signal servers that they can put the data if
>>> > > `inline_data` if deemed small enough.
>>> > >
>>> > > (the `accept_inline_data` field could also be used to the Criteria
>>> > > message)
>>> > >
>>> > >
>>> > > Alternatively, if the FlightDescriptor expansion looks a bit dirty
>>> > > (FlightDescriptor being used in other contexts where
>>> > > `accept_inline_data` makes no sense), we can instead define a new
>>> > > method:
>>> > >
>>> > >   rpc GetFlightInfoEx(GetFlightInfoRequest) returns (FlightInfo) {}
>>> > >
>>> > > with:
>>> > >
>>> > > message GetFlightInfoRequest {
>>> > >   FlightDescriptor flight_descriptor = 1;
>>> > >   bool accept_inline_data = 2;
>>> > > }
>>> > >
>>> > > Regards
>>> > >
>>> > > Antoine.
>>> > >
>>> > >
>>> > > On Mon, 28 Feb 2022 11:29:12 -0800
>>> > > James Duong <jam...@bitquilltech.com.INVALID> wrote:
>>> > >> This seems reasonable, however we need to account for existing
>>> Flight
>>> > >> clients that were written before this.
>>> > >>
>>> > >> It seems like the server will need to still handle the ticket
>>> returned
>>> > for
>>> > >> getStream() for clients that are unaware of the small result
>>> > optimization.
>>> > >>
>>> > >> On Mon, Feb 28, 2022 at 11:26 AM David Li <lidav...@apache.org>
>>> wrote:
>>> > >>
>>> > >> > Ah, that makes more sense, that would be a reasonable extension to
>>> > Flight
>>> > >> > overall. (While we're at it, I think it would help to have an
>>> > app_metadata
>>> > >> > field in FlightInfo as well.)
>>> > >> >
>>> > >> > On Mon, Feb 28, 2022, at 14:24, Micah Kornfield wrote:
>>> > >> > >>
>>> > >> > >> But it seems reasonable to add a one-shot query path using
>>> DoGet.
>>> > >> > >
>>> > >> > >
>>> > >> > > I was thinking more of adding a bytes field to FlightInfo that
>>> > could
>>> > >> > store
>>> > >> > > arrow data.  That way GetFlightInfo would be the only RPC
>>> necessary
>>> > for
>>> > >> > > small results when executing a CMD.  The client doesn't
>>> necessarily
>>> > know
>>> > >> > > whether a query will return large or small results.
>>> > >> > >
>>> > >> > > On Mon, Feb 28, 2022 at 11:04 AM David Li <lidav...@apache.org>
>>> > wrote:
>>> > >> > >
>>> > >> > >> I think the focus was on large result sets (though I don't
>>> recall
>>> > this
>>> > >> > >> being discussed before) and supporting multi-node setups (hence
>>> > >> > >> GetFlightInfo/DoGet are separated). But it seems reasonable to
>>> add
>>> > a
>>> > >> > >> one-shot query path using DoGet.
>>> > >> > >>
>>> > >> > >> On Mon, Feb 28, 2022, at 13:32, Adam Lippai wrote:
>>> > >> > >> > I saw the same. A small, stateless query ability would be
>>> nice
>>> > >> > >> (connection
>>> > >> > >> > open, initialization, query in one message, the resultset in
>>> > the
>>> > >> > response
>>> > >> > >> > in one message)
>>> > >> > >> >
>>> > >> > >> > On Mon, Feb 28, 2022, 13:12 Micah Kornfield <
>>> > emkornfi...@gmail.com>
>>> > >> > >> wrote:
>>> > >> > >> >
>>> > >> > >> >> I'm rereviewing the Flight SQL interfaces, and I'm not sure
>>> if
>>> > I'm
>>> > >> > >> missing
>>> > >> > >> >> it but is there any optimization for small results?  My
>>> concern
>>> > is
>>> > >> > that
>>> > >> > >> the
>>> > >> > >> >> overhead of the RPCs for the DoGet after executing the query
>>> > could
>>> > >> > add
>>> > >> > >> >> non-trivial latency for smaller results.
>>> > >> > >> >>
>>> > >> > >> >> Has anybody else thought about this/investigated it?  Am I
>>> > >> > understanding
>>> > >> > >> >> this correctly?
>>> > >> > >> >>
>>> > >> > >> >> Thanks,
>>> > >> > >> >> Micah
>>> > >> > >> >>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> > >>
>>> >
>>>
>>

Reply via email to