Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

Andrew Lamb Mon, 11 Mar 2024 12:18:40 -0700

Update -- turns out there was already a Rust client/server -- linked to the
ticket now


On Mon, Mar 11, 2024 at 3:07 PM Andrew Lamb <al...@influxdata.com> wrote:

> I sadly don't have time to help with this directly, however, I did file a
> ticket with the request to help with a Rust prototype [1]. Hopefully we'll
> get a taker
>
> [1] https://github.com/apache/arrow-rs/issues/5496
>
> On Tue, Mar 5, 2024 at 11:03 PM Ian Cook <ianmc...@apache.org> wrote:
>
>> Update on recent progress in this Arrow-over-HTTP project:
>>
>> I cleaned up the minimal examples of HTTP clients and servers and
>> moved them into a directory in the Arrow Experiments repo:
>> https://github.com/apache/arrow-experiments/tree/main/http
>>
>> So far there are client examples in six languages and server examples
>> in two languages (Python and Go). They all have READMEs describing how
>> to use them.
>>
>> I have an open PR that adds a third server example in Java. Reviews
>> appreciated:
>> https://github.com/apache/arrow-experiments/pull/4
>>
>> I would like to see minimal client and server examples in a few more
>> languages (especially Rust) before we move on to developing richer
>> types of examples. Is anyone interested in contributing additional
>> minimal examples?
>>
>> Thanks,
>> Ian
>>
>> On Wed, Dec 6, 2023 at 2:29 PM Ian Cook <ianmc...@apache.org> wrote:
>> >
>> > I just remembered that there is an unused "Arrow Experiments" repo [1]
>> > which Wes created a few years ago [2]. That seems like a more
>> > appropriate place to open PRs like this one. If there are no
>> > objections, I will start using that repo for these Arrow-over-HTTP
>> > PRs.
>> >
>> > [1] https://github.com/apache/arrow-experiments
>> > [2] https://lists.apache.org/thread/cw14s874pwplzf9ycnvfwtwq0xq17npg
>> >
>> > Ian
>> >
>> > On Wed, Dec 6, 2023 at 1:45 PM Ian Cook <ianmc...@apache.org> wrote:
>> > >
>> > > Antoine,
>> > >
>> > > Thank you for taking a look. I agree—these are basic examples intended
>> > > to prove the concept and answer fundamental questions. Next I intend
>> > > to expand the set of examples to cover more complex cases.
>> > >
>> > > > This might necessitate some kind of framing layer, or a
>> > > > standardized delimiter.
>> > >
>> > > I am interested to hear more perspectives on this. My perspective is
>> > > that we should recommend using HTTP conventions to keep clean
>> > > separation between the Arrow-formatted binary data payloads and the
>> > > various application-specific fields. This can be achieved by encoding
>> > > application-specific fields in URI paths, query parameters, headers,
>> > > or separate parts of multipart/form-data messages.
>> > >
>> > > Ian
>> > >
>> > > On Wed, Dec 6, 2023 at 1:24 PM Antoine Pitrou <anto...@python.org>
>> wrote:
>> > > >
>> > > >
>> > > > Hi,
>> > > >
>> > > > While this looks like a nice start, I would expect more precise
>> > > > recommendations for writing non-trivial services. Especially, one
>> > > > question is how to send both an application-specific POST request
>> and an
>> > > > Arrow stream, or an application-specific GET response and an Arrow
>> > > > stream. This might necessitate some kind of framing layer, or a
>> > > > standardized delimiter.
>> > > >
>> > > > Regards
>> > > >
>> > > > Antoine.
>> > > >
>> > > >
>> > > >
>> > > > Le 05/12/2023 à 21:10, Ian Cook a écrit :
>> > > > > This is a continuation of the discussion entitled "[DISCUSS]
>> Protocol for
>> > > > > exchanging Arrow data over REST APIs". See the previous messages
>> at
>> > > > > https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf.
>> > > > >
>> > > > > To inform this discussion, I created some basic Arrow-over-HTTP
>> client and
>> > > > > server examples here:
>> > > > > https://github.com/apache/arrow/pull/39081
>> > > > >
>> > > > > My intention is to expand and improve this set of examples (with
>> your help)
>> > > > > until they reflect a set of conventions that we are comfortable
>> documenting
>> > > > > as recommendations.
>> > > > >
>> > > > > Please take a look and add comments / suggestions in the PR.
>> > > > >
>> > > > > Thanks,
>> > > > > Ian
>> > > > >
>> > > > > On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington
>> > > > > <de...@voltrondata.com.invalid> wrote:
>> > > > >
>> > > > >> I also think a set of best practices for Arrow over HTTP would
>> be a
>> > > > >> valuable resource for the community...even if it never becomes a
>> > > > >> specification of its own, it will be beneficial for API
>> developers and
>> > > > >> consumers of those APIs to have a place to look to understand how
>> > > > >> Arrow can help improve throughput/latency/maybe other things.
>> Possibly
>> > > > >> something like httpbin.org but for requests/responses that use
>> Arrow
>> > > > >> would be helpful as well. Thank you Ian for leading this effort!
>> > > > >>
>> > > > >> It has mostly been covered already, but in the (ubiquitous)
>> situation
>> > > > >> where a response contains some schema/table and some
>> non-schema/table
>> > > > >> information there is some tension between throughput (best
>> served by a
>> > > > >> JSON response plus one or more IPC stream responses) and latency
>> (best
>> > > > >> served by a single HTTP response? JSON? IPC with
>> metadata/header?). In
>> > > > >> addition to Antoine's list, I would add:
>> > > > >>
>> > > > >> - How to serve the same table in multiple requests (e.g., to
>> saturate
>> > > > >> a network connection, or because separate worker nodes are
>> generating
>> > > > >> results anyway).
>> > > > >> - How to inline a small schema/table into a single request with
>> other
>> > > > >> metadata (I have seen this done as base64-encoded IPC in JSON,
>> but
>> > > > >> perhaps there is a better way)
>> > > > >>
>> > > > >> If anybody is interested in experimenting, I repurposed a
>> previous
>> > > > >> experiment I had as a flask app that can stream IPC to a client:
>> > > > >>
>> > > > >>
>> https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files
>> > > > >> .
>> > > > >>
>> > > > >>> - recommendations about compression
>> > > > >>
>> > > > >> Just a note that there is also Content-Encoding: gzip (for
>> consumers
>> > > > >> like Arrow JS that don't currently support buffer compression
>> but that
>> > > > >> can leverage the facilities of the browser/http library)
>> > > > >>
>> > > > >> Cheers!
>> > > > >>
>> > > > >> -dewey
>> > > > >>
>> > > > >>
>> > > > >> On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei <k...@clear-code.com>
>> wrote:
>> > > > >>>
>> > > > >>> Hi,
>> > > > >>>
>> > > > >>>> But how is the performance?
>> > > > >>>
>> > > > >>> It's faster than the original JSON based API.
>> > > > >>>
>> > > > >>> I implemented Apache Arrow support for a C# client. So I
>> > > > >>> measured only with Apache Arrow C# but the Apache Arrow
>> > > > >>> based API is faster than JSON based API.
>> > > > >>>
>> > > > >>>> Have you measured the throughput of this approach to see
>> > > > >>>> if it is comparable to using Flight SQL?
>> > > > >>>
>> > > > >>> Sorry. I didn't measure the throughput. In the case, elapsed
>> > > > >>> time of one request/response pair is important than
>> > > > >>> throughput. And it was faster than JSON based API and enough
>> > > > >>> performance.
>> > > > >>>
>> > > > >>> I couldn't compare to a Flight SQL based approach because
>> > > > >>> Groonga doesn't support Flight SQL yet.
>> > > > >>>
>> > > > >>>> Is this approach able to saturate a fast network
>> > > > >>>> connection?
>> > > > >>>
>> > > > >>> I think that we can't measure this with the Groonga case
>> > > > >>> because the Groonga case doesn't send data without
>> > > > >>> stopping. Here is one of request patterns:
>> > > > >>>
>> > > > >>> 1. Groonga has log data partitioned by day
>> > > > >>> 2. Groonga does full text search against one partition
>> (2023-11-01)
>> > > > >>> 3. Groonga sends the result to client as Apache Arrow
>> > > > >>>     streaming format record batches
>> > > > >>> 4. Groonga does full text search against the next partition
>> (2023-11-02)
>> > > > >>> 5. Groonga sends the result to client as Apache Arrow
>> > > > >>>     streaming format record batches
>> > > > >>> 6. ...
>> > > > >>>
>> > > > >>> In the case, the result data aren't always sending. (search
>> > > > >>> -> send -> search -> send -> ...) So it doesn't saturate a
>> > > > >>> fast network connection.
>> > > > >>>
>> > > > >>> (3. and 4. can be parallel but it's not implemented yet.)
>> > > > >>>
>> > > > >>> If we optimize this approach, this approach may be able to
>> > > > >>> saturate a fast network connection.
>> > > > >>>
>> > > > >>>> And what about the case in which the server wants to begin
>> sending
>> > > > >> batches
>> > > > >>>> to the client before the total number of result batches /
>> records is
>> > > > >> known?
>> > > > >>>
>> > > > >>> Ah, sorry. I forgot to explain the case. Groonga uses the
>> > > > >>> above approach for it.
>> > > > >>>
>> > > > >>>> - server should not return the result data in the body of a
>> response
>> > > > >> to a
>> > > > >>>> query request; instead server should return a response body
>> that gives
>> > > > >>>> URI(s) at which clients can GET the result data
>> > > > >>>
>> > > > >>> If we want to do this, the standard "Location" HTTP headers
>> > > > >>> may be suitable.
>> > > > >>>
>> > > > >>>> - transmit result data in chunks (Transfer-Encoding: chunked),
>> with
>> > > > >>>> recommendations about chunk size
>> > > > >>>
>> > > > >>> Ah, sorry. I forgot to explain this case too. Groonga uses
>> > > > >>> "Transfer-Encoding: chunked". But recommended chunk size may
>> > > > >>> be case-by-case... If a server can produce enough data as
>> > > > >>> fast as possible, larger chunk size may be
>> > > > >>> faster. Otherwise, larger chunk size may be slower.
>> > > > >>>
>> > > > >>>> - support range requests (Accept-Range: bytes) to allow
>> clients to
>> > > > >> request
>> > > > >>>> result ranges (or not?)
>> > > > >>>
>> > > > >>> In the Groonga case, it's not supported. Because Groonga
>> > > > >>> drops the result after one request/response pair. Groonga
>> > > > >>> can't return only the specified range result after the
>> > > > >>> response is returned.
>> > > > >>>
>> > > > >>>> - recommendations about compression
>> > > > >>>
>> > > > >>> In the case that network is the bottleneck, LZ4 or Zstandard
>> > > > >>> compression will improve total performance.
>> > > > >>>
>> > > > >>>> - recommendations about TCP receive window size
>> > > > >>>> - recommendation to open multiple TCP connections on very fast
>> networks
>> > > > >>>> (e.g. >25 Gbps) where a CPU thread could be the throughput
>> bottleneck
>> > > > >>>
>> > > > >>> HTTP/3 may be better for these cases.
>> > > > >>>
>> > > > >>>
>> > > > >>> Thanks,
>> > > > >>> --
>> > > > >>> kou
>> > > > >>>
>> > > > >>> In <CANa9GTHuXBBkn-=
>> uevmbr2edmiyquunc6qdqdvh7gpeps9c...@mail.gmail.com>
>> > > > >>>    "Re: [DISCUSS] Protocol for exchanging Arrow data over REST
>> APIs" on
>> > > > >> Sat, 18 Nov 2023 13:51:53 -0500,
>> > > > >>>    Ian Cook <ianmc...@apache.org> wrote:
>> > > > >>>
>> > > > >>>> Hi Kou,
>> > > > >>>>
>> > > > >>>> I think it is too early to make a specific proposal. I hope to
>> use this
>> > > > >>>> discussion to collect more information about existing
>> approaches. If
>> > > > >>>> several viable approaches emerge from this discussion, then I
>> think we
>> > > > >>>> should make a document listing them, like you suggest.
>> > > > >>>>
>> > > > >>>> Thank you for the information about Groonga. This type of
>> > > > >> straightforward
>> > > > >>>> HTTP-based approach would work in the context of a REST API,
>> as I
>> > > > >>>> understand it.
>> > > > >>>>
>> > > > >>>> But how is the performance? Have you measured the throughput
>> of this
>> > > > >>>> approach to see if it is comparable to using Flight SQL? Is
>> this
>> > > > >> approach
>> > > > >>>> able to saturate a fast network connection?
>> > > > >>>>
>> > > > >>>> And what about the case in which the server wants to begin
>> sending
>> > > > >> batches
>> > > > >>>> to the client before the total number of result batches /
>> records is
>> > > > >> known?
>> > > > >>>> Would this approach work in that case? I think so but I am not
>> sure.
>> > > > >>>>
>> > > > >>>> If this HTTP-based type of approach is sufficiently performant
>> and it
>> > > > >> works
>> > > > >>>> in a sufficient proportion of the envisioned use cases, then
>> perhaps
>> > > > >> the
>> > > > >>>> proposed spec / protocol could be based on this approach. If
>> so, then
>> > > > >> we
>> > > > >>>> could refocus this discussion on which best practices to
>> incorporate /
>> > > > >>>> recommend, such as:
>> > > > >>>> - server should not return the result data in the body of a
>> response
>> > > > >> to a
>> > > > >>>> query request; instead server should return a response body
>> that gives
>> > > > >>>> URI(s) at which clients can GET the result data
>> > > > >>>> - transmit result data in chunks (Transfer-Encoding: chunked),
>> with
>> > > > >>>> recommendations about chunk size
>> > > > >>>> - support range requests (Accept-Range: bytes) to allow
>> clients to
>> > > > >> request
>> > > > >>>> result ranges (or not?)
>> > > > >>>> - recommendations about compression
>> > > > >>>> - recommendations about TCP receive window size
>> > > > >>>> - recommendation to open multiple TCP connections on very fast
>> networks
>> > > > >>>> (e.g. >25 Gbps) where a CPU thread could be the throughput
>> bottleneck
>> > > > >>>>
>> > > > >>>> On the other hand, if the performance and functionality of this
>> > > > >> HTTP-based
>> > > > >>>> type of approach is not sufficient, then we might consider
>> > > > >> fundamentally
>> > > > >>>> different approaches.
>> > > > >>>>
>> > > > >>>> Ian
>> > > > >>
>> > > > >
>>
>

Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

Reply via email to