[DISCUSS] Conventions for transporting Arrow data over HTTP

Ian Cook Tue, 05 Dec 2023 12:10:44 -0800

This is a continuation of the discussion entitled "[DISCUSS] Protocol for
exchanging Arrow data over REST APIs". See the previous messages at
https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf.


To inform this discussion, I created some basic Arrow-over-HTTP client and
server examples here:
https://github.com/apache/arrow/pull/39081

My intention is to expand and improve this set of examples (with your help)
until they reflect a set of conventions that we are comfortable documenting
as recommendations.

Please take a look and add comments / suggestions in the PR.

Thanks,
Ian

On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington
<de...@voltrondata.com.invalid> wrote:

> I also think a set of best practices for Arrow over HTTP would be a
> valuable resource for the community...even if it never becomes a
> specification of its own, it will be beneficial for API developers and
> consumers of those APIs to have a place to look to understand how
> Arrow can help improve throughput/latency/maybe other things. Possibly
> something like httpbin.org but for requests/responses that use Arrow
> would be helpful as well. Thank you Ian for leading this effort!
>
> It has mostly been covered already, but in the (ubiquitous) situation
> where a response contains some schema/table and some non-schema/table
> information there is some tension between throughput (best served by a
> JSON response plus one or more IPC stream responses) and latency (best
> served by a single HTTP response? JSON? IPC with metadata/header?). In
> addition to Antoine's list, I would add:
>
> - How to serve the same table in multiple requests (e.g., to saturate
> a network connection, or because separate worker nodes are generating
> results anyway).
> - How to inline a small schema/table into a single request with other
> metadata (I have seen this done as base64-encoded IPC in JSON, but
> perhaps there is a better way)
>
> If anybody is interested in experimenting, I repurposed a previous
> experiment I had as a flask app that can stream IPC to a client:
>
> https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files
> .
>
> > - recommendations about compression
>
> Just a note that there is also Content-Encoding: gzip (for consumers
> like Arrow JS that don't currently support buffer compression but that
> can leverage the facilities of the browser/http library)
>
> Cheers!
>
> -dewey
>
>
> On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei <k...@clear-code.com> wrote:
> >
> > Hi,
> >
> > > But how is the performance?
> >
> > It's faster than the original JSON based API.
> >
> > I implemented Apache Arrow support for a C# client. So I
> > measured only with Apache Arrow C# but the Apache Arrow
> > based API is faster than JSON based API.
> >
> > > Have you measured the throughput of this approach to see
> > > if it is comparable to using Flight SQL?
> >
> > Sorry. I didn't measure the throughput. In the case, elapsed
> > time of one request/response pair is important than
> > throughput. And it was faster than JSON based API and enough
> > performance.
> >
> > I couldn't compare to a Flight SQL based approach because
> > Groonga doesn't support Flight SQL yet.
> >
> > > Is this approach able to saturate a fast network
> > > connection?
> >
> > I think that we can't measure this with the Groonga case
> > because the Groonga case doesn't send data without
> > stopping. Here is one of request patterns:
> >
> > 1. Groonga has log data partitioned by day
> > 2. Groonga does full text search against one partition (2023-11-01)
> > 3. Groonga sends the result to client as Apache Arrow
> >    streaming format record batches
> > 4. Groonga does full text search against the next partition (2023-11-02)
> > 5. Groonga sends the result to client as Apache Arrow
> >    streaming format record batches
> > 6. ...
> >
> > In the case, the result data aren't always sending. (search
> > -> send -> search -> send -> ...) So it doesn't saturate a
> > fast network connection.
> >
> > (3. and 4. can be parallel but it's not implemented yet.)
> >
> > If we optimize this approach, this approach may be able to
> > saturate a fast network connection.
> >
> > > And what about the case in which the server wants to begin sending
> batches
> > > to the client before the total number of result batches / records is
> known?
> >
> > Ah, sorry. I forgot to explain the case. Groonga uses the
> > above approach for it.
> >
> > > - server should not return the result data in the body of a response
> to a
> > > query request; instead server should return a response body that gives
> > > URI(s) at which clients can GET the result data
> >
> > If we want to do this, the standard "Location" HTTP headers
> > may be suitable.
> >
> > > - transmit result data in chunks (Transfer-Encoding: chunked), with
> > > recommendations about chunk size
> >
> > Ah, sorry. I forgot to explain this case too. Groonga uses
> > "Transfer-Encoding: chunked". But recommended chunk size may
> > be case-by-case... If a server can produce enough data as
> > fast as possible, larger chunk size may be
> > faster. Otherwise, larger chunk size may be slower.
> >
> > > - support range requests (Accept-Range: bytes) to allow clients to
> request
> > > result ranges (or not?)
> >
> > In the Groonga case, it's not supported. Because Groonga
> > drops the result after one request/response pair. Groonga
> > can't return only the specified range result after the
> > response is returned.
> >
> > > - recommendations about compression
> >
> > In the case that network is the bottleneck, LZ4 or Zstandard
> > compression will improve total performance.
> >
> > > - recommendations about TCP receive window size
> > > - recommendation to open multiple TCP connections on very fast networks
> > > (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck
> >
> > HTTP/3 may be better for these cases.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In <CANa9GTHuXBBkn-=uevmbr2edmiyquunc6qdqdvh7gpeps9c...@mail.gmail.com>
> >   "Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs" on
> Sat, 18 Nov 2023 13:51:53 -0500,
> >   Ian Cook <ianmc...@apache.org> wrote:
> >
> > > Hi Kou,
> > >
> > > I think it is too early to make a specific proposal. I hope to use this
> > > discussion to collect more information about existing approaches. If
> > > several viable approaches emerge from this discussion, then I think we
> > > should make a document listing them, like you suggest.
> > >
> > > Thank you for the information about Groonga. This type of
> straightforward
> > > HTTP-based approach would work in the context of a REST API, as I
> > > understand it.
> > >
> > > But how is the performance? Have you measured the throughput of this
> > > approach to see if it is comparable to using Flight SQL? Is this
> approach
> > > able to saturate a fast network connection?
> > >
> > > And what about the case in which the server wants to begin sending
> batches
> > > to the client before the total number of result batches / records is
> known?
> > > Would this approach work in that case? I think so but I am not sure.
> > >
> > > If this HTTP-based type of approach is sufficiently performant and it
> works
> > > in a sufficient proportion of the envisioned use cases, then perhaps
> the
> > > proposed spec / protocol could be based on this approach. If so, then
> we
> > > could refocus this discussion on which best practices to incorporate /
> > > recommend, such as:
> > > - server should not return the result data in the body of a response
> to a
> > > query request; instead server should return a response body that gives
> > > URI(s) at which clients can GET the result data
> > > - transmit result data in chunks (Transfer-Encoding: chunked), with
> > > recommendations about chunk size
> > > - support range requests (Accept-Range: bytes) to allow clients to
> request
> > > result ranges (or not?)
> > > - recommendations about compression
> > > - recommendations about TCP receive window size
> > > - recommendation to open multiple TCP connections on very fast networks
> > > (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck
> > >
> > > On the other hand, if the performance and functionality of this
> HTTP-based
> > > type of approach is not sufficient, then we might consider
> fundamentally
> > > different approaches.
> > >
> > > Ian
>

[DISCUSS] Conventions for transporting Arrow data over HTTP

Reply via email to