This is a continuation of the discussion entitled "[DISCUSS] Protocol for exchanging Arrow data over REST APIs". See the previous messages at https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf.
To inform this discussion, I created some basic Arrow-over-HTTP client and server examples here: https://github.com/apache/arrow/pull/39081 My intention is to expand and improve this set of examples (with your help) until they reflect a set of conventions that we are comfortable documenting as recommendations. Please take a look and add comments / suggestions in the PR. Thanks, Ian On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington <de...@voltrondata.com.invalid> wrote: > I also think a set of best practices for Arrow over HTTP would be a > valuable resource for the community...even if it never becomes a > specification of its own, it will be beneficial for API developers and > consumers of those APIs to have a place to look to understand how > Arrow can help improve throughput/latency/maybe other things. Possibly > something like httpbin.org but for requests/responses that use Arrow > would be helpful as well. Thank you Ian for leading this effort! > > It has mostly been covered already, but in the (ubiquitous) situation > where a response contains some schema/table and some non-schema/table > information there is some tension between throughput (best served by a > JSON response plus one or more IPC stream responses) and latency (best > served by a single HTTP response? JSON? IPC with metadata/header?). In > addition to Antoine's list, I would add: > > - How to serve the same table in multiple requests (e.g., to saturate > a network connection, or because separate worker nodes are generating > results anyway). > - How to inline a small schema/table into a single request with other > metadata (I have seen this done as base64-encoded IPC in JSON, but > perhaps there is a better way) > > If anybody is interested in experimenting, I repurposed a previous > experiment I had as a flask app that can stream IPC to a client: > > https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files > . > > > - recommendations about compression > > Just a note that there is also Content-Encoding: gzip (for consumers > like Arrow JS that don't currently support buffer compression but that > can leverage the facilities of the browser/http library) > > Cheers! > > -dewey > > > On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei <k...@clear-code.com> wrote: > > > > Hi, > > > > > But how is the performance? > > > > It's faster than the original JSON based API. > > > > I implemented Apache Arrow support for a C# client. So I > > measured only with Apache Arrow C# but the Apache Arrow > > based API is faster than JSON based API. > > > > > Have you measured the throughput of this approach to see > > > if it is comparable to using Flight SQL? > > > > Sorry. I didn't measure the throughput. In the case, elapsed > > time of one request/response pair is important than > > throughput. And it was faster than JSON based API and enough > > performance. > > > > I couldn't compare to a Flight SQL based approach because > > Groonga doesn't support Flight SQL yet. > > > > > Is this approach able to saturate a fast network > > > connection? > > > > I think that we can't measure this with the Groonga case > > because the Groonga case doesn't send data without > > stopping. Here is one of request patterns: > > > > 1. Groonga has log data partitioned by day > > 2. Groonga does full text search against one partition (2023-11-01) > > 3. Groonga sends the result to client as Apache Arrow > > streaming format record batches > > 4. Groonga does full text search against the next partition (2023-11-02) > > 5. Groonga sends the result to client as Apache Arrow > > streaming format record batches > > 6. ... > > > > In the case, the result data aren't always sending. (search > > -> send -> search -> send -> ...) So it doesn't saturate a > > fast network connection. > > > > (3. and 4. can be parallel but it's not implemented yet.) > > > > If we optimize this approach, this approach may be able to > > saturate a fast network connection. > > > > > And what about the case in which the server wants to begin sending > batches > > > to the client before the total number of result batches / records is > known? > > > > Ah, sorry. I forgot to explain the case. Groonga uses the > > above approach for it. > > > > > - server should not return the result data in the body of a response > to a > > > query request; instead server should return a response body that gives > > > URI(s) at which clients can GET the result data > > > > If we want to do this, the standard "Location" HTTP headers > > may be suitable. > > > > > - transmit result data in chunks (Transfer-Encoding: chunked), with > > > recommendations about chunk size > > > > Ah, sorry. I forgot to explain this case too. Groonga uses > > "Transfer-Encoding: chunked". But recommended chunk size may > > be case-by-case... If a server can produce enough data as > > fast as possible, larger chunk size may be > > faster. Otherwise, larger chunk size may be slower. > > > > > - support range requests (Accept-Range: bytes) to allow clients to > request > > > result ranges (or not?) > > > > In the Groonga case, it's not supported. Because Groonga > > drops the result after one request/response pair. Groonga > > can't return only the specified range result after the > > response is returned. > > > > > - recommendations about compression > > > > In the case that network is the bottleneck, LZ4 or Zstandard > > compression will improve total performance. > > > > > - recommendations about TCP receive window size > > > - recommendation to open multiple TCP connections on very fast networks > > > (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck > > > > HTTP/3 may be better for these cases. > > > > > > Thanks, > > -- > > kou > > > > In <CANa9GTHuXBBkn-=uevmbr2edmiyquunc6qdqdvh7gpeps9c...@mail.gmail.com> > > "Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs" on > Sat, 18 Nov 2023 13:51:53 -0500, > > Ian Cook <ianmc...@apache.org> wrote: > > > > > Hi Kou, > > > > > > I think it is too early to make a specific proposal. I hope to use this > > > discussion to collect more information about existing approaches. If > > > several viable approaches emerge from this discussion, then I think we > > > should make a document listing them, like you suggest. > > > > > > Thank you for the information about Groonga. This type of > straightforward > > > HTTP-based approach would work in the context of a REST API, as I > > > understand it. > > > > > > But how is the performance? Have you measured the throughput of this > > > approach to see if it is comparable to using Flight SQL? Is this > approach > > > able to saturate a fast network connection? > > > > > > And what about the case in which the server wants to begin sending > batches > > > to the client before the total number of result batches / records is > known? > > > Would this approach work in that case? I think so but I am not sure. > > > > > > If this HTTP-based type of approach is sufficiently performant and it > works > > > in a sufficient proportion of the envisioned use cases, then perhaps > the > > > proposed spec / protocol could be based on this approach. If so, then > we > > > could refocus this discussion on which best practices to incorporate / > > > recommend, such as: > > > - server should not return the result data in the body of a response > to a > > > query request; instead server should return a response body that gives > > > URI(s) at which clients can GET the result data > > > - transmit result data in chunks (Transfer-Encoding: chunked), with > > > recommendations about chunk size > > > - support range requests (Accept-Range: bytes) to allow clients to > request > > > result ranges (or not?) > > > - recommendations about compression > > > - recommendations about TCP receive window size > > > - recommendation to open multiple TCP connections on very fast networks > > > (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck > > > > > > On the other hand, if the performance and functionality of this > HTTP-based > > > type of approach is not sufficient, then we might consider > fundamentally > > > different approaches. > > > > > > Ian >