OK, great, that clarifies a lot - I didn't appreciate how flexible the
interpretation can be. Thanks for the doc pointer as well.
I'll power onwards with the experiments. If I encounter areas of Flight
that are still a bit rough around the edges - being in beta  - then we
might be able to contribute as well.

-J

On Tue, Jun 23, 2020 at 8:34 PM David Li <li.david...@gmail.com> wrote:

> Hey Joris,
>
> Your plan sounds right for Flight. As for semantics:
>
> The descriptor and ticket format are mostly application defined. For
> instance, I think some places (Dremio?) just put a raw SQL query as
> the "cmd" of a descriptor; putting serialized JSON or Protobuf is also
> certainly fine.
>
> I'd say implementing _every_ endpoint isn't required - we don't use
> ListFlights for instance.
>
> In terms of what you described, I'd map a descriptor to a query, and a
> Flight to its execution; calling GetFlightInfo would return each
> worker in its own FlightEndpoint, and the Ticket would be something
> agreed upon by your coordinator and worker (e.g. the request and time
> range).
>
> For docs, have you seen this?
> https://arrow.apache.org/docs/format/Flight.html While it's labeled
> "Format", it contains an example of a Flight request flow.
>
> Best,
> David
>
> On 6/23/20, Joris Peeters <joris.mg.peet...@gmail.com> wrote:
> > Hello,
> >
> > I'm interested in using Flight for serving large amounts of data in a
> > parallelised manner, and just building some Python prototypes, based on
> >
> https://github.com/apache/arrow/blob/apache-arrow-0.17.1/python/examples/flight
> >
> > In my use-case, we'd have a bunch of worker servers, serving a number of
> > different datasets (here called "datasetA" and "datasetB"), but also some
> > additional parameters to customise a single query (eg a date range if the
> > dataset is a time series, but can be other stuff too - depending on the
> > dataset).
> >
> > The idea is for clients to hit a single coordinator with their entire
> query
> > (eg datasetA + [1970,2020]), and then getting instructed to hit a variety
> > of workers, with slices of this, e.g. {worker1: (datasetA, [1970, 1990)),
> > worker2: (datasetA, [1990-2020])}. I.e. I want to chunk up the original
> > request in a few smaller ones, to be handled by different workers, which
> > then retrieve the data from a DB and send it back to the client, which
> > aggregates.
> >
> > Although I'm proto-typing from Python, this should work from a variety of
> > platforms.
> > Does that sound like something Flight should be able to do well?
> >
> > If so - what are the intended semantics for the descriptor and ticket
> etc,
> > based on my previous example? I see idioms for "path" and "cmd" etc, but
> > neither really seems to fit. My query is more like some opaque JSON, e.g.
> > something you'd submit to an HTTP server. Is the idea to send a
> > string-serialisation of e.g:
> >
> > {
> >   "dataset": "datasetA",
> >   "dateFrom": "1970-01-01",
> >   "dateTo": "2020-06-23"
> > }?
> >
> > In that case, what should listFlights return, given that the queries are
> > dynamic? Something like,
> > ["datasetA", "datasetB", ...] ?
> >
> > I guess I'm mainly struggling to understand what a descriptor, ticket and
> > flight really are, within my context - and can't really find it in the
> > docs.
> > Just a link to some good docs would obviously be great as well! I'm
> hitting
> > https://arrow.apache.org/docs/python/api/flight.html which is  largely
> > empty. It does say "Flight is currently not distributed as part of wheels
> > or in Conda - it is only available when built from source appropriately."
> > which seems a bit pessimistic, as it appears present in both the pypi and
> > conda 0.17.1 package I checked.
> >
> > Cheers,
> > -Joris.
> >
>

Reply via email to