I also think a set of best practices for Arrow over HTTP would be a valuable resource for the community...even if it never becomes a specification of its own, it will be beneficial for API developers and consumers of those APIs to have a place to look to understand how Arrow can help improve throughput/latency/maybe other things. Possibly something like httpbin.org but for requests/responses that use Arrow would be helpful as well. Thank you Ian for leading this effort!
It has mostly been covered already, but in the (ubiquitous) situation where a response contains some schema/table and some non-schema/table information there is some tension between throughput (best served by a JSON response plus one or more IPC stream responses) and latency (best served by a single HTTP response? JSON? IPC with metadata/header?). In addition to Antoine's list, I would add: - How to serve the same table in multiple requests (e.g., to saturate a network connection, or because separate worker nodes are generating results anyway). - How to inline a small schema/table into a single request with other metadata (I have seen this done as base64-encoded IPC in JSON, but perhaps there is a better way) If anybody is interested in experimenting, I repurposed a previous experiment I had as a flask app that can stream IPC to a client: https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files . > - recommendations about compression Just a note that there is also Content-Encoding: gzip (for consumers like Arrow JS that don't currently support buffer compression but that can leverage the facilities of the browser/http library) Cheers! -dewey On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei <k...@clear-code.com> wrote: > > Hi, > > > But how is the performance? > > It's faster than the original JSON based API. > > I implemented Apache Arrow support for a C# client. So I > measured only with Apache Arrow C# but the Apache Arrow > based API is faster than JSON based API. > > > Have you measured the throughput of this approach to see > > if it is comparable to using Flight SQL? > > Sorry. I didn't measure the throughput. In the case, elapsed > time of one request/response pair is important than > throughput. And it was faster than JSON based API and enough > performance. > > I couldn't compare to a Flight SQL based approach because > Groonga doesn't support Flight SQL yet. > > > Is this approach able to saturate a fast network > > connection? > > I think that we can't measure this with the Groonga case > because the Groonga case doesn't send data without > stopping. Here is one of request patterns: > > 1. Groonga has log data partitioned by day > 2. Groonga does full text search against one partition (2023-11-01) > 3. Groonga sends the result to client as Apache Arrow > streaming format record batches > 4. Groonga does full text search against the next partition (2023-11-02) > 5. Groonga sends the result to client as Apache Arrow > streaming format record batches > 6. ... > > In the case, the result data aren't always sending. (search > -> send -> search -> send -> ...) So it doesn't saturate a > fast network connection. > > (3. and 4. can be parallel but it's not implemented yet.) > > If we optimize this approach, this approach may be able to > saturate a fast network connection. > > > And what about the case in which the server wants to begin sending batches > > to the client before the total number of result batches / records is known? > > Ah, sorry. I forgot to explain the case. Groonga uses the > above approach for it. > > > - server should not return the result data in the body of a response to a > > query request; instead server should return a response body that gives > > URI(s) at which clients can GET the result data > > If we want to do this, the standard "Location" HTTP headers > may be suitable. > > > - transmit result data in chunks (Transfer-Encoding: chunked), with > > recommendations about chunk size > > Ah, sorry. I forgot to explain this case too. Groonga uses > "Transfer-Encoding: chunked". But recommended chunk size may > be case-by-case... If a server can produce enough data as > fast as possible, larger chunk size may be > faster. Otherwise, larger chunk size may be slower. > > > - support range requests (Accept-Range: bytes) to allow clients to request > > result ranges (or not?) > > In the Groonga case, it's not supported. Because Groonga > drops the result after one request/response pair. Groonga > can't return only the specified range result after the > response is returned. > > > - recommendations about compression > > In the case that network is the bottleneck, LZ4 or Zstandard > compression will improve total performance. > > > - recommendations about TCP receive window size > > - recommendation to open multiple TCP connections on very fast networks > > (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck > > HTTP/3 may be better for these cases. > > > Thanks, > -- > kou > > In <CANa9GTHuXBBkn-=uevmbr2edmiyquunc6qdqdvh7gpeps9c...@mail.gmail.com> > "Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs" on Sat, > 18 Nov 2023 13:51:53 -0500, > Ian Cook <ianmc...@apache.org> wrote: > > > Hi Kou, > > > > I think it is too early to make a specific proposal. I hope to use this > > discussion to collect more information about existing approaches. If > > several viable approaches emerge from this discussion, then I think we > > should make a document listing them, like you suggest. > > > > Thank you for the information about Groonga. This type of straightforward > > HTTP-based approach would work in the context of a REST API, as I > > understand it. > > > > But how is the performance? Have you measured the throughput of this > > approach to see if it is comparable to using Flight SQL? Is this approach > > able to saturate a fast network connection? > > > > And what about the case in which the server wants to begin sending batches > > to the client before the total number of result batches / records is known? > > Would this approach work in that case? I think so but I am not sure. > > > > If this HTTP-based type of approach is sufficiently performant and it works > > in a sufficient proportion of the envisioned use cases, then perhaps the > > proposed spec / protocol could be based on this approach. If so, then we > > could refocus this discussion on which best practices to incorporate / > > recommend, such as: > > - server should not return the result data in the body of a response to a > > query request; instead server should return a response body that gives > > URI(s) at which clients can GET the result data > > - transmit result data in chunks (Transfer-Encoding: chunked), with > > recommendations about chunk size > > - support range requests (Accept-Range: bytes) to allow clients to request > > result ranges (or not?) > > - recommendations about compression > > - recommendations about TCP receive window size > > - recommendation to open multiple TCP connections on very fast networks > > (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck > > > > On the other hand, if the performance and functionality of this HTTP-based > > type of approach is not sufficient, then we might consider fundamentally > > different approaches. > > > > Ian