Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs

Antoine Pitrou Mon, 20 Nov 2023 06:44:33 -0800

I also agree that an informal spec "how to efficiently transfer Arrowdata over HTTP" makes sense.


Probably with several aspects:
- one-shot GET data
- streaming GET
- one-shot PUT or POST
- streaming POST
- non-Arrow prologue and epilogue (for example JSON-based metadata)
- conventions for well-known headers


Le 20/11/2023 à 15:23, David Li a écrit :

I'm with Kou: what exactly are we trying to specify?

- The HTTP mapping of Flight RPC?
- A full, locked down RPC framework like Flight RPC, but otherwise unrelated?
- Something else?

I'd also ask: do we need to specify anything in the first place? What is stopping people 
from using Arrow in their REST APIs, and what kind of interoperability are we trying to 
achieve? I would say that Flight RPC effectively has no interoperability at all - each 
project using it has its own bespoke layers on top, and the "standardized" RPC 
methods just hinder the applications that would like more control and flexibility that 
Flight RPC does not provide. The recent additions to the Flight RPC spec speak to that: 
they were meant for Flight SQL, but needed to be implemented at the Flight RPC layer; 
there is not a real abstraction layer that Flight RPC really serves.

It could consist only of a specification for how to implement
support for exchanging Arrow-formatted data in an existing REST API.


I would say that this is the only part that might make sense: once a client has 
acquired an Arrow-aware endpoint, what should be the format of the Arrow data 
it gets (whether this is just the Arrow stream format, or something fancier 
like FlightData in Flight RPC).

Separately, it might make sense to define how GraphQL works with Arrow, or 
other specific, full protocols/APIs. But I'm not sure there's much room for a 
Flight RPC equivalent for HTTP/1, if Flight RPC on its own really ever made 
sense as a full framework/protocol in the first place.

On Sat, Nov 18, 2023, at 14:17, Gavin Ray wrote:

I know that myself and a number of folks I work with would be interested in
this.

gRPC is a bit of a barrier for a lot of services.
Having a spec for doing Arrow over HTTP API's would be solid.

In my opinion, it doesn't necessarily need to be REST-ful.
Something like JSON-RPC might fit well with the existing model for Arrow
over the wire that's been implemented in things like Flight/FlightSQL.

Something else I've been interested in (I think Matt Topol has done work in
this area) is Arrow over GraphQL, too:
GraphQL and Apache Arrow: A Match Made in Data (youtube.com)
<https://www.youtube.com/watch?v=5N97TzY_tis>

On Sat, Nov 18, 2023 at 1:52 PM Ian Cook <[email protected]> wrote:

Hi Kou,

I think it is too early to make a specific proposal. I hope to use this
discussion to collect more information about existing approaches. If
several viable approaches emerge from this discussion, then I think we
should make a document listing them, like you suggest.

Thank you for the information about Groonga. This type of straightforward
HTTP-based approach would work in the context of a REST API, as I
understand it.

But how is the performance? Have you measured the throughput of this
approach to see if it is comparable to using Flight SQL? Is this approach
able to saturate a fast network connection?

And what about the case in which the server wants to begin sending batches
to the client before the total number of result batches / records is known?
Would this approach work in that case? I think so but I am not sure.

If this HTTP-based type of approach is sufficiently performant and it works
in a sufficient proportion of the envisioned use cases, then perhaps the
proposed spec / protocol could be based on this approach. If so, then we
could refocus this discussion on which best practices to incorporate /
recommend, such as:
- server should not return the result data in the body of a response to a
query request; instead server should return a response body that gives
URI(s) at which clients can GET the result data
- transmit result data in chunks (Transfer-Encoding: chunked), with
recommendations about chunk size
- support range requests (Accept-Range: bytes) to allow clients to request
result ranges (or not?)
- recommendations about compression
- recommendations about TCP receive window size
- recommendation to open multiple TCP connections on very fast networks
(e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck

On the other hand, if the performance and functionality of this HTTP-based
type of approach is not sufficient, then we might consider fundamentally
different approaches.

Ian

Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs

Reply via email to