I'm with Kou: what exactly are we trying to specify?

- The HTTP mapping of Flight RPC?
- A full, locked down RPC framework like Flight RPC, but otherwise unrelated?
- Something else?

I'd also ask: do we need to specify anything in the first place? What is 
stopping people from using Arrow in their REST APIs, and what kind of 
interoperability are we trying to achieve? I would say that Flight RPC 
effectively has no interoperability at all - each project using it has its own 
bespoke layers on top, and the "standardized" RPC methods just hinder the 
applications that would like more control and flexibility that Flight RPC does 
not provide. The recent additions to the Flight RPC spec speak to that: they 
were meant for Flight SQL, but needed to be implemented at the Flight RPC 
layer; there is not a real abstraction layer that Flight RPC really serves.

> It could consist only of a specification for how to implement
> support for exchanging Arrow-formatted data in an existing REST API.

I would say that this is the only part that might make sense: once a client has 
acquired an Arrow-aware endpoint, what should be the format of the Arrow data 
it gets (whether this is just the Arrow stream format, or something fancier 
like FlightData in Flight RPC).

Separately, it might make sense to define how GraphQL works with Arrow, or 
other specific, full protocols/APIs. But I'm not sure there's much room for a 
Flight RPC equivalent for HTTP/1, if Flight RPC on its own really ever made 
sense as a full framework/protocol in the first place.

On Sat, Nov 18, 2023, at 14:17, Gavin Ray wrote:
> I know that myself and a number of folks I work with would be interested in
> this.
>
> gRPC is a bit of a barrier for a lot of services.
> Having a spec for doing Arrow over HTTP API's would be solid.
>
> In my opinion, it doesn't necessarily need to be REST-ful.
> Something like JSON-RPC might fit well with the existing model for Arrow
> over the wire that's been implemented in things like Flight/FlightSQL.
>
> Something else I've been interested in (I think Matt Topol has done work in
> this area) is Arrow over GraphQL, too:
> GraphQL and Apache Arrow: A Match Made in Data (youtube.com)
> <https://www.youtube.com/watch?v=5N97TzY_tis>
>
> On Sat, Nov 18, 2023 at 1:52 PM Ian Cook <ianmc...@apache.org> wrote:
>
>> Hi Kou,
>>
>> I think it is too early to make a specific proposal. I hope to use this
>> discussion to collect more information about existing approaches. If
>> several viable approaches emerge from this discussion, then I think we
>> should make a document listing them, like you suggest.
>>
>> Thank you for the information about Groonga. This type of straightforward
>> HTTP-based approach would work in the context of a REST API, as I
>> understand it.
>>
>> But how is the performance? Have you measured the throughput of this
>> approach to see if it is comparable to using Flight SQL? Is this approach
>> able to saturate a fast network connection?
>>
>> And what about the case in which the server wants to begin sending batches
>> to the client before the total number of result batches / records is known?
>> Would this approach work in that case? I think so but I am not sure.
>>
>> If this HTTP-based type of approach is sufficiently performant and it works
>> in a sufficient proportion of the envisioned use cases, then perhaps the
>> proposed spec / protocol could be based on this approach. If so, then we
>> could refocus this discussion on which best practices to incorporate /
>> recommend, such as:
>> - server should not return the result data in the body of a response to a
>> query request; instead server should return a response body that gives
>> URI(s) at which clients can GET the result data
>> - transmit result data in chunks (Transfer-Encoding: chunked), with
>> recommendations about chunk size
>> - support range requests (Accept-Range: bytes) to allow clients to request
>> result ranges (or not?)
>> - recommendations about compression
>> - recommendations about TCP receive window size
>> - recommendation to open multiple TCP connections on very fast networks
>> (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck
>>
>> On the other hand, if the performance and functionality of this HTTP-based
>> type of approach is not sufficient, then we might consider fundamentally
>> different approaches.
>>
>> Ian
>>

Reply via email to