I know that myself and a number of folks I work with would be interested in
this.

gRPC is a bit of a barrier for a lot of services.
Having a spec for doing Arrow over HTTP API's would be solid.

In my opinion, it doesn't necessarily need to be REST-ful.
Something like JSON-RPC might fit well with the existing model for Arrow
over the wire that's been implemented in things like Flight/FlightSQL.

Something else I've been interested in (I think Matt Topol has done work in
this area) is Arrow over GraphQL, too:
GraphQL and Apache Arrow: A Match Made in Data (youtube.com)
<https://www.youtube.com/watch?v=5N97TzY_tis>

On Sat, Nov 18, 2023 at 1:52 PM Ian Cook <ianmc...@apache.org> wrote:

> Hi Kou,
>
> I think it is too early to make a specific proposal. I hope to use this
> discussion to collect more information about existing approaches. If
> several viable approaches emerge from this discussion, then I think we
> should make a document listing them, like you suggest.
>
> Thank you for the information about Groonga. This type of straightforward
> HTTP-based approach would work in the context of a REST API, as I
> understand it.
>
> But how is the performance? Have you measured the throughput of this
> approach to see if it is comparable to using Flight SQL? Is this approach
> able to saturate a fast network connection?
>
> And what about the case in which the server wants to begin sending batches
> to the client before the total number of result batches / records is known?
> Would this approach work in that case? I think so but I am not sure.
>
> If this HTTP-based type of approach is sufficiently performant and it works
> in a sufficient proportion of the envisioned use cases, then perhaps the
> proposed spec / protocol could be based on this approach. If so, then we
> could refocus this discussion on which best practices to incorporate /
> recommend, such as:
> - server should not return the result data in the body of a response to a
> query request; instead server should return a response body that gives
> URI(s) at which clients can GET the result data
> - transmit result data in chunks (Transfer-Encoding: chunked), with
> recommendations about chunk size
> - support range requests (Accept-Range: bytes) to allow clients to request
> result ranges (or not?)
> - recommendations about compression
> - recommendations about TCP receive window size
> - recommendation to open multiple TCP connections on very fast networks
> (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck
>
> On the other hand, if the performance and functionality of this HTTP-based
> type of approach is not sufficient, then we might consider fundamentally
> different approaches.
>
> Ian
>

Reply via email to