I know that myself and a number of folks I work with would be interested in this.
gRPC is a bit of a barrier for a lot of services. Having a spec for doing Arrow over HTTP API's would be solid. In my opinion, it doesn't necessarily need to be REST-ful. Something like JSON-RPC might fit well with the existing model for Arrow over the wire that's been implemented in things like Flight/FlightSQL. Something else I've been interested in (I think Matt Topol has done work in this area) is Arrow over GraphQL, too: GraphQL and Apache Arrow: A Match Made in Data (youtube.com) <https://www.youtube.com/watch?v=5N97TzY_tis> On Sat, Nov 18, 2023 at 1:52 PM Ian Cook <ianmc...@apache.org> wrote: > Hi Kou, > > I think it is too early to make a specific proposal. I hope to use this > discussion to collect more information about existing approaches. If > several viable approaches emerge from this discussion, then I think we > should make a document listing them, like you suggest. > > Thank you for the information about Groonga. This type of straightforward > HTTP-based approach would work in the context of a REST API, as I > understand it. > > But how is the performance? Have you measured the throughput of this > approach to see if it is comparable to using Flight SQL? Is this approach > able to saturate a fast network connection? > > And what about the case in which the server wants to begin sending batches > to the client before the total number of result batches / records is known? > Would this approach work in that case? I think so but I am not sure. > > If this HTTP-based type of approach is sufficiently performant and it works > in a sufficient proportion of the envisioned use cases, then perhaps the > proposed spec / protocol could be based on this approach. If so, then we > could refocus this discussion on which best practices to incorporate / > recommend, such as: > - server should not return the result data in the body of a response to a > query request; instead server should return a response body that gives > URI(s) at which clients can GET the result data > - transmit result data in chunks (Transfer-Encoding: chunked), with > recommendations about chunk size > - support range requests (Accept-Range: bytes) to allow clients to request > result ranges (or not?) > - recommendations about compression > - recommendations about TCP receive window size > - recommendation to open multiple TCP connections on very fast networks > (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck > > On the other hand, if the performance and functionality of this HTTP-based > type of approach is not sufficient, then we might consider fundamentally > different approaches. > > Ian >