Hi,

I would like to propose adding support for long-running
queries to Apache Arrow Flight. If anyone has comments for
this proposal, please share them at here or the issue for
this proposal:
  https://github.com/apache/arrow/issues/36155

Implementation in C++/Go/Java:
  https://github.com/apache/arrow/pull/36946
Documentations:
  
http://crossbow.voltrondata.com/pr_docs/36946/format/Flight.html#downloading-data-by-running-a-heavy-query


This is one of proposals in "[DISCUSS] Flight RPC/Flight
SQL/ADBC enhancements":

  https://lists.apache.org/thread/247z3t06mf132nocngc1jkp3oqglz7jp

See also the "Flight RPC: Long-Running Queries" section in
the design document for the proposals:

  
https://docs.google.com/document/d/1jhPyPZSOo2iy0LqIJVUs9KWPyFULVFJXTILDfkadx2g/edit#


Changes since the original proposal:

* Removed RetryInfo.cancel_descriptor because we can use the
  existing CancelFlightInfo action.
* Renamed RetryInfo.retry_descriptor to
  RetryInfo.flight_descriptor because we have only one
  descriptor by removing RetryInfo.cancel_descriptor.


Background:

Queries generally don't complete instantly (as much as we
would like them to). So where can we put the 'query
evaluation time'?

* In GetFlightInfo: block and wait for the query to complete.
  * Con: this is a long-running blocking call, which may
    fail or time out. Then when the client retries, the
    server has to redo all the work.
  * Con: parts of the result may be ready before others, but
    the client can't do anything until everything is ready.

* In DoGet: return a fixed number of partitions
  * Con: this makes handling worker failures hard. Systems
    like Trino support fault-tolerant execution by replacing
    workers at runtime. But GetFlightInfo has already
    passed, so we can't notify the client of new workers2.
  * Con: we have to know or fix the partitioning up front.

Neither solution is optimal.


Proposal:

Add PollFlightInfo as a retryable version of GetFlightInfo.
Clients can poll the current query status and start reading
the currently available results so far before the query is
completed.

See documentation for details:
  
http://crossbow.voltrondata.com/pr_docs/36946/format/Flight.html#downloading-data-by-running-a-heavy-query


Implementation:

https://github.com/apache/arrow/pull/36946 is an
implementation of this proposal. The pull request has the
followings:

1. Format changes:
   * format/Flight.proto
     
https://github.com/apache/arrow/pull/36946/files#diff-53b6c132dcc789483c879f667a1c675792b77aae9a056b257d6b20287bb09dba

2. Documentation changes:
   docs/source/format/Flight.rst
   
https://github.com/apache/arrow/pull/36946/files#diff-839518fb41e923de682e8587f0b6fdb00eb8f3361d360c2f7249284a136a7d89

3. The C++ implementation and an integration test:
   * cpp/src/arrow/flight/

4. The Go implementation and an integration test:
   * go/arrow/flight/server.go
   * go/arrow/internal/flight_integration/scenario.go

5. The Java implementation and an integration test:
   * java/flight/flight-core/
   * java/flight/flight-integration-tests/


Next:

I'll start a vote for this proposal after we reach a consensus
on this proposal.

It's the standard process for format change.
See also:
https://arrow.apache.org/docs/dev/format/Changing.html


Thanks,
-- 
kou

Reply via email to