OK, great, that clarifies a lot - I didn't appreciate how flexible the interpretation can be. Thanks for the doc pointer as well. I'll power onwards with the experiments. If I encounter areas of Flight that are still a bit rough around the edges - being in beta - then we might be able to contribute as well.
-J On Tue, Jun 23, 2020 at 8:34 PM David Li <li.david...@gmail.com> wrote: > Hey Joris, > > Your plan sounds right for Flight. As for semantics: > > The descriptor and ticket format are mostly application defined. For > instance, I think some places (Dremio?) just put a raw SQL query as > the "cmd" of a descriptor; putting serialized JSON or Protobuf is also > certainly fine. > > I'd say implementing _every_ endpoint isn't required - we don't use > ListFlights for instance. > > In terms of what you described, I'd map a descriptor to a query, and a > Flight to its execution; calling GetFlightInfo would return each > worker in its own FlightEndpoint, and the Ticket would be something > agreed upon by your coordinator and worker (e.g. the request and time > range). > > For docs, have you seen this? > https://arrow.apache.org/docs/format/Flight.html While it's labeled > "Format", it contains an example of a Flight request flow. > > Best, > David > > On 6/23/20, Joris Peeters <joris.mg.peet...@gmail.com> wrote: > > Hello, > > > > I'm interested in using Flight for serving large amounts of data in a > > parallelised manner, and just building some Python prototypes, based on > > > https://github.com/apache/arrow/blob/apache-arrow-0.17.1/python/examples/flight > > > > In my use-case, we'd have a bunch of worker servers, serving a number of > > different datasets (here called "datasetA" and "datasetB"), but also some > > additional parameters to customise a single query (eg a date range if the > > dataset is a time series, but can be other stuff too - depending on the > > dataset). > > > > The idea is for clients to hit a single coordinator with their entire > query > > (eg datasetA + [1970,2020]), and then getting instructed to hit a variety > > of workers, with slices of this, e.g. {worker1: (datasetA, [1970, 1990)), > > worker2: (datasetA, [1990-2020])}. I.e. I want to chunk up the original > > request in a few smaller ones, to be handled by different workers, which > > then retrieve the data from a DB and send it back to the client, which > > aggregates. > > > > Although I'm proto-typing from Python, this should work from a variety of > > platforms. > > Does that sound like something Flight should be able to do well? > > > > If so - what are the intended semantics for the descriptor and ticket > etc, > > based on my previous example? I see idioms for "path" and "cmd" etc, but > > neither really seems to fit. My query is more like some opaque JSON, e.g. > > something you'd submit to an HTTP server. Is the idea to send a > > string-serialisation of e.g: > > > > { > > "dataset": "datasetA", > > "dateFrom": "1970-01-01", > > "dateTo": "2020-06-23" > > }? > > > > In that case, what should listFlights return, given that the queries are > > dynamic? Something like, > > ["datasetA", "datasetB", ...] ? > > > > I guess I'm mainly struggling to understand what a descriptor, ticket and > > flight really are, within my context - and can't really find it in the > > docs. > > Just a link to some good docs would obviously be great as well! I'm > hitting > > https://arrow.apache.org/docs/python/api/flight.html which is largely > > empty. It does say "Flight is currently not distributed as part of wheels > > or in Conda - it is only available when built from source appropriately." > > which seems a bit pessimistic, as it appears present in both the pypi and > > conda 0.17.1 package I checked. > > > > Cheers, > > -Joris. > > >