Re: [DISCUSS] Conventions for transporting Arrow data over HTTP
Thanks to everyone who has contributed to this work so far. We now have simple HTTP client/server examples in 10 languages, all tested and verified to interoperate: https://github.com/apache/arrow-experiments/tree/main/http/get_simple There is an umbrella issue tracking the next planned tasks: https://github.com/apache/arrow/issues/40465 Ian On Tue, Mar 5, 2024 at 11:01 PM Ian Cook wrote: > Update on recent progress in this Arrow-over-HTTP project: > > I cleaned up the minimal examples of HTTP clients and servers and > moved them into a directory in the Arrow Experiments repo: > https://github.com/apache/arrow-experiments/tree/main/http > > So far there are client examples in six languages and server examples > in two languages (Python and Go). They all have READMEs describing how > to use them. > > I have an open PR that adds a third server example in Java. Reviews > appreciated: > https://github.com/apache/arrow-experiments/pull/4 > > I would like to see minimal client and server examples in a few more > languages (especially Rust) before we move on to developing richer > types of examples. Is anyone interested in contributing additional > minimal examples? > > Thanks, > Ian > > On Wed, Dec 6, 2023 at 2:29 PM Ian Cook wrote: > > > > I just remembered that there is an unused "Arrow Experiments" repo [1] > > which Wes created a few years ago [2]. That seems like a more > > appropriate place to open PRs like this one. If there are no > > objections, I will start using that repo for these Arrow-over-HTTP > > PRs. > > > > [1] https://github.com/apache/arrow-experiments > > [2] https://lists.apache.org/thread/cw14s874pwplzf9ycnvfwtwq0xq17npg > > > > Ian > > > > On Wed, Dec 6, 2023 at 1:45 PM Ian Cook wrote: > > > > > > Antoine, > > > > > > Thank you for taking a look. I agree—these are basic examples intended > > > to prove the concept and answer fundamental questions. Next I intend > > > to expand the set of examples to cover more complex cases. > > > > > > > This might necessitate some kind of framing layer, or a > > > > standardized delimiter. > > > > > > I am interested to hear more perspectives on this. My perspective is > > > that we should recommend using HTTP conventions to keep clean > > > separation between the Arrow-formatted binary data payloads and the > > > various application-specific fields. This can be achieved by encoding > > > application-specific fields in URI paths, query parameters, headers, > > > or separate parts of multipart/form-data messages. > > > > > > Ian > > > > > > On Wed, Dec 6, 2023 at 1:24 PM Antoine Pitrou > wrote: > > > > > > > > > > > > Hi, > > > > > > > > While this looks like a nice start, I would expect more precise > > > > recommendations for writing non-trivial services. Especially, one > > > > question is how to send both an application-specific POST request > and an > > > > Arrow stream, or an application-specific GET response and an Arrow > > > > stream. This might necessitate some kind of framing layer, or a > > > > standardized delimiter. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > > > > > Le 05/12/2023 à 21:10, Ian Cook a écrit : > > > > > This is a continuation of the discussion entitled "[DISCUSS] > Protocol for > > > > > exchanging Arrow data over REST APIs". See the previous messages at > > > > > https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf. > > > > > > > > > > To inform this discussion, I created some basic Arrow-over-HTTP > client and > > > > > server examples here: > > > > > https://github.com/apache/arrow/pull/39081 > > > > > > > > > > My intention is to expand and improve this set of examples (with > your help) > > > > > until they reflect a set of conventions that we are comfortable > documenting > > > > > as recommendations. > > > > > > > > > > Please take a look and add comments / suggestions in the PR. > > > > > > > > > > Thanks, > > > > > Ian > > > > > > > > > > On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington > > > > > wrote: > > > > > > > > > >> I also think a set of best practices for Arrow over HTTP would be > a > > > > >> valuable resource for the community...even if it never becomes a > > > > >> specification of its own, it will be beneficial for API > developers and > > > > >> consumers of those APIs to have a place to look to understand how > > > > >> Arrow can help improve throughput/latency/maybe other things. > Possibly > > > > >> something like httpbin.org but for requests/responses that use > Arrow > > > > >> would be helpful as well. Thank you Ian for leading this effort! > > > > >> > > > > >> It has mostly been covered already, but in the (ubiquitous) > situation > > > > >> where a response contains some schema/table and some > non-schema/table > > > > >> information there is some tension between throughput (best served > by a > > > > >> JSON response plus one or more IPC stream responses) and latency > (best > > > > >> served by a single HTTP response? JSON? IPC with >
Re: [DISCUSS] Conventions for transporting Arrow data over HTTP
I'd be happy to contribute C# and Ruby examples. I'll work on those this week. On Tue, Mar 5, 2024 at 7:03 PM Ian Cook wrote: > > Update on recent progress in this Arrow-over-HTTP project: > > I cleaned up the minimal examples of HTTP clients and servers and > moved them into a directory in the Arrow Experiments repo: > https://github.com/apache/arrow-experiments/tree/main/http > > So far there are client examples in six languages and server examples > in two languages (Python and Go). They all have READMEs describing how > to use them. > > I have an open PR that adds a third server example in Java. Reviews > appreciated: > https://github.com/apache/arrow-experiments/pull/4 > > I would like to see minimal client and server examples in a few more > languages (especially Rust) before we move on to developing richer > types of examples. Is anyone interested in contributing additional > minimal examples? > > Thanks, > Ian > > On Wed, Dec 6, 2023 at 2:29 PM Ian Cook wrote: > > > > I just remembered that there is an unused "Arrow Experiments" repo [1] > > which Wes created a few years ago [2]. That seems like a more > > appropriate place to open PRs like this one. If there are no > > objections, I will start using that repo for these Arrow-over-HTTP > > PRs. > > > > [1] https://github.com/apache/arrow-experiments > > [2] https://lists.apache.org/thread/cw14s874pwplzf9ycnvfwtwq0xq17npg > > > > Ian > > > > On Wed, Dec 6, 2023 at 1:45 PM Ian Cook wrote: > > > > > > Antoine, > > > > > > Thank you for taking a look. I agree—these are basic examples intended > > > to prove the concept and answer fundamental questions. Next I intend > > > to expand the set of examples to cover more complex cases. > > > > > > > This might necessitate some kind of framing layer, or a > > > > standardized delimiter. > > > > > > I am interested to hear more perspectives on this. My perspective is > > > that we should recommend using HTTP conventions to keep clean > > > separation between the Arrow-formatted binary data payloads and the > > > various application-specific fields. This can be achieved by encoding > > > application-specific fields in URI paths, query parameters, headers, > > > or separate parts of multipart/form-data messages. > > > > > > Ian > > > > > > On Wed, Dec 6, 2023 at 1:24 PM Antoine Pitrou wrote: > > > > > > > > > > > > Hi, > > > > > > > > While this looks like a nice start, I would expect more precise > > > > recommendations for writing non-trivial services. Especially, one > > > > question is how to send both an application-specific POST request and an > > > > Arrow stream, or an application-specific GET response and an Arrow > > > > stream. This might necessitate some kind of framing layer, or a > > > > standardized delimiter. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > > > > > Le 05/12/2023 à 21:10, Ian Cook a écrit : > > > > > This is a continuation of the discussion entitled "[DISCUSS] Protocol > > > > > for > > > > > exchanging Arrow data over REST APIs". See the previous messages at > > > > > https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf. > > > > > > > > > > To inform this discussion, I created some basic Arrow-over-HTTP > > > > > client and > > > > > server examples here: > > > > > https://github.com/apache/arrow/pull/39081 > > > > > > > > > > My intention is to expand and improve this set of examples (with your > > > > > help) > > > > > until they reflect a set of conventions that we are comfortable > > > > > documenting > > > > > as recommendations. > > > > > > > > > > Please take a look and add comments / suggestions in the PR. > > > > > > > > > > Thanks, > > > > > Ian > > > > > > > > > > On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington > > > > > wrote: > > > > > > > > > >> I also think a set of best practices for Arrow over HTTP would be a > > > > >> valuable resource for the community...even if it never becomes a > > > > >> specification of its own, it will be beneficial for API developers > > > > >> and > > > > >> consumers of those APIs to have a place to look to understand how > > > > >> Arrow can help improve throughput/latency/maybe other things. > > > > >> Possibly > > > > >> something like httpbin.org but for requests/responses that use Arrow > > > > >> would be helpful as well. Thank you Ian for leading this effort! > > > > >> > > > > >> It has mostly been covered already, but in the (ubiquitous) situation > > > > >> where a response contains some schema/table and some non-schema/table > > > > >> information there is some tension between throughput (best served by > > > > >> a > > > > >> JSON response plus one or more IPC stream responses) and latency > > > > >> (best > > > > >> served by a single HTTP response? JSON? IPC with metadata/header?). > > > > >> In > > > > >> addition to Antoine's list, I would add: > > > > >> > > > > >> - How to serve the same table in multiple requests (e.g., to saturate > > > > >> a network
Re: [DISCUSS] Conventions for transporting Arrow data over HTTP
Update -- turns out there was already a Rust client/server -- linked to the ticket now On Mon, Mar 11, 2024 at 3:07 PM Andrew Lamb wrote: > I sadly don't have time to help with this directly, however, I did file a > ticket with the request to help with a Rust prototype [1]. Hopefully we'll > get a taker > > [1] https://github.com/apache/arrow-rs/issues/5496 > > On Tue, Mar 5, 2024 at 11:03 PM Ian Cook wrote: > >> Update on recent progress in this Arrow-over-HTTP project: >> >> I cleaned up the minimal examples of HTTP clients and servers and >> moved them into a directory in the Arrow Experiments repo: >> https://github.com/apache/arrow-experiments/tree/main/http >> >> So far there are client examples in six languages and server examples >> in two languages (Python and Go). They all have READMEs describing how >> to use them. >> >> I have an open PR that adds a third server example in Java. Reviews >> appreciated: >> https://github.com/apache/arrow-experiments/pull/4 >> >> I would like to see minimal client and server examples in a few more >> languages (especially Rust) before we move on to developing richer >> types of examples. Is anyone interested in contributing additional >> minimal examples? >> >> Thanks, >> Ian >> >> On Wed, Dec 6, 2023 at 2:29 PM Ian Cook wrote: >> > >> > I just remembered that there is an unused "Arrow Experiments" repo [1] >> > which Wes created a few years ago [2]. That seems like a more >> > appropriate place to open PRs like this one. If there are no >> > objections, I will start using that repo for these Arrow-over-HTTP >> > PRs. >> > >> > [1] https://github.com/apache/arrow-experiments >> > [2] https://lists.apache.org/thread/cw14s874pwplzf9ycnvfwtwq0xq17npg >> > >> > Ian >> > >> > On Wed, Dec 6, 2023 at 1:45 PM Ian Cook wrote: >> > > >> > > Antoine, >> > > >> > > Thank you for taking a look. I agree—these are basic examples intended >> > > to prove the concept and answer fundamental questions. Next I intend >> > > to expand the set of examples to cover more complex cases. >> > > >> > > > This might necessitate some kind of framing layer, or a >> > > > standardized delimiter. >> > > >> > > I am interested to hear more perspectives on this. My perspective is >> > > that we should recommend using HTTP conventions to keep clean >> > > separation between the Arrow-formatted binary data payloads and the >> > > various application-specific fields. This can be achieved by encoding >> > > application-specific fields in URI paths, query parameters, headers, >> > > or separate parts of multipart/form-data messages. >> > > >> > > Ian >> > > >> > > On Wed, Dec 6, 2023 at 1:24 PM Antoine Pitrou >> wrote: >> > > > >> > > > >> > > > Hi, >> > > > >> > > > While this looks like a nice start, I would expect more precise >> > > > recommendations for writing non-trivial services. Especially, one >> > > > question is how to send both an application-specific POST request >> and an >> > > > Arrow stream, or an application-specific GET response and an Arrow >> > > > stream. This might necessitate some kind of framing layer, or a >> > > > standardized delimiter. >> > > > >> > > > Regards >> > > > >> > > > Antoine. >> > > > >> > > > >> > > > >> > > > Le 05/12/2023 à 21:10, Ian Cook a écrit : >> > > > > This is a continuation of the discussion entitled "[DISCUSS] >> Protocol for >> > > > > exchanging Arrow data over REST APIs". See the previous messages >> at >> > > > > https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf. >> > > > > >> > > > > To inform this discussion, I created some basic Arrow-over-HTTP >> client and >> > > > > server examples here: >> > > > > https://github.com/apache/arrow/pull/39081 >> > > > > >> > > > > My intention is to expand and improve this set of examples (with >> your help) >> > > > > until they reflect a set of conventions that we are comfortable >> documenting >> > > > > as recommendations. >> > > > > >> > > > > Please take a look and add comments / suggestions in the PR. >> > > > > >> > > > > Thanks, >> > > > > Ian >> > > > > >> > > > > On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington >> > > > > wrote: >> > > > > >> > > > >> I also think a set of best practices for Arrow over HTTP would >> be a >> > > > >> valuable resource for the community...even if it never becomes a >> > > > >> specification of its own, it will be beneficial for API >> developers and >> > > > >> consumers of those APIs to have a place to look to understand how >> > > > >> Arrow can help improve throughput/latency/maybe other things. >> Possibly >> > > > >> something like httpbin.org but for requests/responses that use >> Arrow >> > > > >> would be helpful as well. Thank you Ian for leading this effort! >> > > > >> >> > > > >> It has mostly been covered already, but in the (ubiquitous) >> situation >> > > > >> where a response contains some schema/table and some >> non-schema/table >> > > > >> information there is some tension between throughput (best >> served by a >> > > >
Re: [DISCUSS] Conventions for transporting Arrow data over HTTP
I sadly don't have time to help with this directly, however, I did file a ticket with the request to help with a Rust prototype [1]. Hopefully we'll get a taker [1] https://github.com/apache/arrow-rs/issues/5496 On Tue, Mar 5, 2024 at 11:03 PM Ian Cook wrote: > Update on recent progress in this Arrow-over-HTTP project: > > I cleaned up the minimal examples of HTTP clients and servers and > moved them into a directory in the Arrow Experiments repo: > https://github.com/apache/arrow-experiments/tree/main/http > > So far there are client examples in six languages and server examples > in two languages (Python and Go). They all have READMEs describing how > to use them. > > I have an open PR that adds a third server example in Java. Reviews > appreciated: > https://github.com/apache/arrow-experiments/pull/4 > > I would like to see minimal client and server examples in a few more > languages (especially Rust) before we move on to developing richer > types of examples. Is anyone interested in contributing additional > minimal examples? > > Thanks, > Ian > > On Wed, Dec 6, 2023 at 2:29 PM Ian Cook wrote: > > > > I just remembered that there is an unused "Arrow Experiments" repo [1] > > which Wes created a few years ago [2]. That seems like a more > > appropriate place to open PRs like this one. If there are no > > objections, I will start using that repo for these Arrow-over-HTTP > > PRs. > > > > [1] https://github.com/apache/arrow-experiments > > [2] https://lists.apache.org/thread/cw14s874pwplzf9ycnvfwtwq0xq17npg > > > > Ian > > > > On Wed, Dec 6, 2023 at 1:45 PM Ian Cook wrote: > > > > > > Antoine, > > > > > > Thank you for taking a look. I agree—these are basic examples intended > > > to prove the concept and answer fundamental questions. Next I intend > > > to expand the set of examples to cover more complex cases. > > > > > > > This might necessitate some kind of framing layer, or a > > > > standardized delimiter. > > > > > > I am interested to hear more perspectives on this. My perspective is > > > that we should recommend using HTTP conventions to keep clean > > > separation between the Arrow-formatted binary data payloads and the > > > various application-specific fields. This can be achieved by encoding > > > application-specific fields in URI paths, query parameters, headers, > > > or separate parts of multipart/form-data messages. > > > > > > Ian > > > > > > On Wed, Dec 6, 2023 at 1:24 PM Antoine Pitrou > wrote: > > > > > > > > > > > > Hi, > > > > > > > > While this looks like a nice start, I would expect more precise > > > > recommendations for writing non-trivial services. Especially, one > > > > question is how to send both an application-specific POST request > and an > > > > Arrow stream, or an application-specific GET response and an Arrow > > > > stream. This might necessitate some kind of framing layer, or a > > > > standardized delimiter. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > > > > > Le 05/12/2023 à 21:10, Ian Cook a écrit : > > > > > This is a continuation of the discussion entitled "[DISCUSS] > Protocol for > > > > > exchanging Arrow data over REST APIs". See the previous messages at > > > > > https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf. > > > > > > > > > > To inform this discussion, I created some basic Arrow-over-HTTP > client and > > > > > server examples here: > > > > > https://github.com/apache/arrow/pull/39081 > > > > > > > > > > My intention is to expand and improve this set of examples (with > your help) > > > > > until they reflect a set of conventions that we are comfortable > documenting > > > > > as recommendations. > > > > > > > > > > Please take a look and add comments / suggestions in the PR. > > > > > > > > > > Thanks, > > > > > Ian > > > > > > > > > > On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington > > > > > wrote: > > > > > > > > > >> I also think a set of best practices for Arrow over HTTP would be > a > > > > >> valuable resource for the community...even if it never becomes a > > > > >> specification of its own, it will be beneficial for API > developers and > > > > >> consumers of those APIs to have a place to look to understand how > > > > >> Arrow can help improve throughput/latency/maybe other things. > Possibly > > > > >> something like httpbin.org but for requests/responses that use > Arrow > > > > >> would be helpful as well. Thank you Ian for leading this effort! > > > > >> > > > > >> It has mostly been covered already, but in the (ubiquitous) > situation > > > > >> where a response contains some schema/table and some > non-schema/table > > > > >> information there is some tension between throughput (best served > by a > > > > >> JSON response plus one or more IPC stream responses) and latency > (best > > > > >> served by a single HTTP response? JSON? IPC with > metadata/header?). In > > > > >> addition to Antoine's list, I would add: > > > > >> > > > > >> - How to serve the same table in multiple
Re: [DISCUSS] Conventions for transporting Arrow data over HTTP
Update on recent progress in this Arrow-over-HTTP project: I cleaned up the minimal examples of HTTP clients and servers and moved them into a directory in the Arrow Experiments repo: https://github.com/apache/arrow-experiments/tree/main/http So far there are client examples in six languages and server examples in two languages (Python and Go). They all have READMEs describing how to use them. I have an open PR that adds a third server example in Java. Reviews appreciated: https://github.com/apache/arrow-experiments/pull/4 I would like to see minimal client and server examples in a few more languages (especially Rust) before we move on to developing richer types of examples. Is anyone interested in contributing additional minimal examples? Thanks, Ian On Wed, Dec 6, 2023 at 2:29 PM Ian Cook wrote: > > I just remembered that there is an unused "Arrow Experiments" repo [1] > which Wes created a few years ago [2]. That seems like a more > appropriate place to open PRs like this one. If there are no > objections, I will start using that repo for these Arrow-over-HTTP > PRs. > > [1] https://github.com/apache/arrow-experiments > [2] https://lists.apache.org/thread/cw14s874pwplzf9ycnvfwtwq0xq17npg > > Ian > > On Wed, Dec 6, 2023 at 1:45 PM Ian Cook wrote: > > > > Antoine, > > > > Thank you for taking a look. I agree—these are basic examples intended > > to prove the concept and answer fundamental questions. Next I intend > > to expand the set of examples to cover more complex cases. > > > > > This might necessitate some kind of framing layer, or a > > > standardized delimiter. > > > > I am interested to hear more perspectives on this. My perspective is > > that we should recommend using HTTP conventions to keep clean > > separation between the Arrow-formatted binary data payloads and the > > various application-specific fields. This can be achieved by encoding > > application-specific fields in URI paths, query parameters, headers, > > or separate parts of multipart/form-data messages. > > > > Ian > > > > On Wed, Dec 6, 2023 at 1:24 PM Antoine Pitrou wrote: > > > > > > > > > Hi, > > > > > > While this looks like a nice start, I would expect more precise > > > recommendations for writing non-trivial services. Especially, one > > > question is how to send both an application-specific POST request and an > > > Arrow stream, or an application-specific GET response and an Arrow > > > stream. This might necessitate some kind of framing layer, or a > > > standardized delimiter. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > > Le 05/12/2023 à 21:10, Ian Cook a écrit : > > > > This is a continuation of the discussion entitled "[DISCUSS] Protocol > > > > for > > > > exchanging Arrow data over REST APIs". See the previous messages at > > > > https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf. > > > > > > > > To inform this discussion, I created some basic Arrow-over-HTTP client > > > > and > > > > server examples here: > > > > https://github.com/apache/arrow/pull/39081 > > > > > > > > My intention is to expand and improve this set of examples (with your > > > > help) > > > > until they reflect a set of conventions that we are comfortable > > > > documenting > > > > as recommendations. > > > > > > > > Please take a look and add comments / suggestions in the PR. > > > > > > > > Thanks, > > > > Ian > > > > > > > > On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington > > > > wrote: > > > > > > > >> I also think a set of best practices for Arrow over HTTP would be a > > > >> valuable resource for the community...even if it never becomes a > > > >> specification of its own, it will be beneficial for API developers and > > > >> consumers of those APIs to have a place to look to understand how > > > >> Arrow can help improve throughput/latency/maybe other things. Possibly > > > >> something like httpbin.org but for requests/responses that use Arrow > > > >> would be helpful as well. Thank you Ian for leading this effort! > > > >> > > > >> It has mostly been covered already, but in the (ubiquitous) situation > > > >> where a response contains some schema/table and some non-schema/table > > > >> information there is some tension between throughput (best served by a > > > >> JSON response plus one or more IPC stream responses) and latency (best > > > >> served by a single HTTP response? JSON? IPC with metadata/header?). In > > > >> addition to Antoine's list, I would add: > > > >> > > > >> - How to serve the same table in multiple requests (e.g., to saturate > > > >> a network connection, or because separate worker nodes are generating > > > >> results anyway). > > > >> - How to inline a small schema/table into a single request with other > > > >> metadata (I have seen this done as base64-encoded IPC in JSON, but > > > >> perhaps there is a better way) > > > >> > > > >> If anybody is interested in experimenting, I repurposed a previous > > > >> experiment I had as a flask app that can stream
Re: [DISCUSS] Conventions for transporting Arrow data over HTTP
On Wed, Dec 6, 2023 at 7:45 PM Ian Cook wrote: > > I am interested to hear more perspectives on this. My perspective is > that we should recommend using HTTP conventions to keep clean > separation between the Arrow-formatted binary data payloads and the > various application-specific fields. This can be achieved by encoding > application-specific fields in URI paths, query parameters, headers, > or separate parts of multipart/form-data messages. > Submitting big binary data in POST messages via multipart/form-data is usually not very performant, in theory the boundary of the message has to be constructed by verifying that it does not collide with the content of the data itself. Which for huge files means traversing the whole file in search of the bytes matching the boundary. Many implementation are optimistic based on the fact that there are very little chances that a long enough randomly generated boundary will be contained in the message, but this is not guaranteed to be true and I would refrain from suggesting an approach that, even though it's remote, has a chance of being slow or not working. Also most HTTP servers tend to implement a maximum request time to reduce the risk of exhausting the maximum available connections with broken (or malicious) clients that leave the connection open for too long. So uploading a 1GB file in a single POST is at serious risk of failing in most deployments. There is also the issue that for multipart/form-data a maximum transferred data size exists as the content of files is frequently saved in a temporary file by the HTTP server before it gets forwarded to the server side application. Thus opening the system for an out of disk error if a client uploads too big data and no limit is configured. So I would suggest that any recommended approach to submit Arrow data via HTTP relies on Content-Range and chunked uploads to transmit the data, thus reducing the risk of timeouts or size limits. And allowing to simply resend a chunk in case of those.
Re: [DISCUSS] Conventions for transporting Arrow data over HTTP
hi all — I was just catching up on e-mail threads and wanted to give a few historical comments on this. When we were assembling the Arrow PMC and committing to do the project in 2015, standardizing Arrow-over-REST was always something that was on the TODO list — at that time we didn't have the IPC protocol yet, so that was the fundamental design/engineering that had to take place. I agree that having example code and well-documented patterns for using REST+Arrow in production would make it easier for people to adopt Arrow in their systems for transport, and it would have been better to do this years ago to help onboard users into the ecosystem faster (and make the "getting started" part of this less of a DYI affair). When Jacques and I did the original design / prototyping for Flight-on-gRPC (2018), the goal was not to convey that as "the Preferred Way" for network transport (which could steer people away from directly using HTTP, though perhaps that was an unintentional consequence because of our effort developing and promoting Flight), but rather to establish generic patterns for creating distributed Arrow services using gRPC, and to optimize the serialization aspect (i.e. avoiding extra protobuf encoding/decoding steps which would be present if you naively used gRPC and put the Arrow IPC format in a protobuf message as a blob). In any case, I think having a "Rosetta stone"-type setup of starter code for building HTTP services that send and receive Arrow would be a help to developers/users who want to adopt Arrow in their systems. Thanks Wes On Wed, Dec 6, 2023 at 1:30 PM Ian Cook wrote: > I just remembered that there is an unused "Arrow Experiments" repo [1] > which Wes created a few years ago [2]. That seems like a more > appropriate place to open PRs like this one. If there are no > objections, I will start using that repo for these Arrow-over-HTTP > PRs. > > [1] https://github.com/apache/arrow-experiments > [2] https://lists.apache.org/thread/cw14s874pwplzf9ycnvfwtwq0xq17npg > > Ian > > On Wed, Dec 6, 2023 at 1:45 PM Ian Cook wrote: > > > > Antoine, > > > > Thank you for taking a look. I agree—these are basic examples intended > > to prove the concept and answer fundamental questions. Next I intend > > to expand the set of examples to cover more complex cases. > > > > > This might necessitate some kind of framing layer, or a > > > standardized delimiter. > > > > I am interested to hear more perspectives on this. My perspective is > > that we should recommend using HTTP conventions to keep clean > > separation between the Arrow-formatted binary data payloads and the > > various application-specific fields. This can be achieved by encoding > > application-specific fields in URI paths, query parameters, headers, > > or separate parts of multipart/form-data messages. > > > > Ian > > > > On Wed, Dec 6, 2023 at 1:24 PM Antoine Pitrou > wrote: > > > > > > > > > Hi, > > > > > > While this looks like a nice start, I would expect more precise > > > recommendations for writing non-trivial services. Especially, one > > > question is how to send both an application-specific POST request and > an > > > Arrow stream, or an application-specific GET response and an Arrow > > > stream. This might necessitate some kind of framing layer, or a > > > standardized delimiter. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > > Le 05/12/2023 à 21:10, Ian Cook a écrit : > > > > This is a continuation of the discussion entitled "[DISCUSS] > Protocol for > > > > exchanging Arrow data over REST APIs". See the previous messages at > > > > https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf. > > > > > > > > To inform this discussion, I created some basic Arrow-over-HTTP > client and > > > > server examples here: > > > > https://github.com/apache/arrow/pull/39081 > > > > > > > > My intention is to expand and improve this set of examples (with > your help) > > > > until they reflect a set of conventions that we are comfortable > documenting > > > > as recommendations. > > > > > > > > Please take a look and add comments / suggestions in the PR. > > > > > > > > Thanks, > > > > Ian > > > > > > > > On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington > > > > wrote: > > > > > > > >> I also think a set of best practices for Arrow over HTTP would be a > > > >> valuable resource for the community...even if it never becomes a > > > >> specification of its own, it will be beneficial for API developers > and > > > >> consumers of those APIs to have a place to look to understand how > > > >> Arrow can help improve throughput/latency/maybe other things. > Possibly > > > >> something like httpbin.org but for requests/responses that use > Arrow > > > >> would be helpful as well. Thank you Ian for leading this effort! > > > >> > > > >> It has mostly been covered already, but in the (ubiquitous) > situation > > > >> where a response contains some schema/table and some > non-schema/table > > > >> information there is
Re: [DISCUSS] Conventions for transporting Arrow data over HTTP
I just remembered that there is an unused "Arrow Experiments" repo [1] which Wes created a few years ago [2]. That seems like a more appropriate place to open PRs like this one. If there are no objections, I will start using that repo for these Arrow-over-HTTP PRs. [1] https://github.com/apache/arrow-experiments [2] https://lists.apache.org/thread/cw14s874pwplzf9ycnvfwtwq0xq17npg Ian On Wed, Dec 6, 2023 at 1:45 PM Ian Cook wrote: > > Antoine, > > Thank you for taking a look. I agree—these are basic examples intended > to prove the concept and answer fundamental questions. Next I intend > to expand the set of examples to cover more complex cases. > > > This might necessitate some kind of framing layer, or a > > standardized delimiter. > > I am interested to hear more perspectives on this. My perspective is > that we should recommend using HTTP conventions to keep clean > separation between the Arrow-formatted binary data payloads and the > various application-specific fields. This can be achieved by encoding > application-specific fields in URI paths, query parameters, headers, > or separate parts of multipart/form-data messages. > > Ian > > On Wed, Dec 6, 2023 at 1:24 PM Antoine Pitrou wrote: > > > > > > Hi, > > > > While this looks like a nice start, I would expect more precise > > recommendations for writing non-trivial services. Especially, one > > question is how to send both an application-specific POST request and an > > Arrow stream, or an application-specific GET response and an Arrow > > stream. This might necessitate some kind of framing layer, or a > > standardized delimiter. > > > > Regards > > > > Antoine. > > > > > > > > Le 05/12/2023 à 21:10, Ian Cook a écrit : > > > This is a continuation of the discussion entitled "[DISCUSS] Protocol for > > > exchanging Arrow data over REST APIs". See the previous messages at > > > https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf. > > > > > > To inform this discussion, I created some basic Arrow-over-HTTP client and > > > server examples here: > > > https://github.com/apache/arrow/pull/39081 > > > > > > My intention is to expand and improve this set of examples (with your > > > help) > > > until they reflect a set of conventions that we are comfortable > > > documenting > > > as recommendations. > > > > > > Please take a look and add comments / suggestions in the PR. > > > > > > Thanks, > > > Ian > > > > > > On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington > > > wrote: > > > > > >> I also think a set of best practices for Arrow over HTTP would be a > > >> valuable resource for the community...even if it never becomes a > > >> specification of its own, it will be beneficial for API developers and > > >> consumers of those APIs to have a place to look to understand how > > >> Arrow can help improve throughput/latency/maybe other things. Possibly > > >> something like httpbin.org but for requests/responses that use Arrow > > >> would be helpful as well. Thank you Ian for leading this effort! > > >> > > >> It has mostly been covered already, but in the (ubiquitous) situation > > >> where a response contains some schema/table and some non-schema/table > > >> information there is some tension between throughput (best served by a > > >> JSON response plus one or more IPC stream responses) and latency (best > > >> served by a single HTTP response? JSON? IPC with metadata/header?). In > > >> addition to Antoine's list, I would add: > > >> > > >> - How to serve the same table in multiple requests (e.g., to saturate > > >> a network connection, or because separate worker nodes are generating > > >> results anyway). > > >> - How to inline a small schema/table into a single request with other > > >> metadata (I have seen this done as base64-encoded IPC in JSON, but > > >> perhaps there is a better way) > > >> > > >> If anybody is interested in experimenting, I repurposed a previous > > >> experiment I had as a flask app that can stream IPC to a client: > > >> > > >> https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files > > >> . > > >> > > >>> - recommendations about compression > > >> > > >> Just a note that there is also Content-Encoding: gzip (for consumers > > >> like Arrow JS that don't currently support buffer compression but that > > >> can leverage the facilities of the browser/http library) > > >> > > >> Cheers! > > >> > > >> -dewey > > >> > > >> > > >> On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei wrote: > > >>> > > >>> Hi, > > >>> > > But how is the performance? > > >>> > > >>> It's faster than the original JSON based API. > > >>> > > >>> I implemented Apache Arrow support for a C# client. So I > > >>> measured only with Apache Arrow C# but the Apache Arrow > > >>> based API is faster than JSON based API. > > >>> > > Have you measured the throughput of this approach to see > > if it is comparable to using Flight SQL? > > >>> > > >>> Sorry. I didn't measure the throughput. In the case,
Re: [DISCUSS] Conventions for transporting Arrow data over HTTP
Antoine, Thank you for taking a look. I agree—these are basic examples intended to prove the concept and answer fundamental questions. Next I intend to expand the set of examples to cover more complex cases. > This might necessitate some kind of framing layer, or a > standardized delimiter. I am interested to hear more perspectives on this. My perspective is that we should recommend using HTTP conventions to keep clean separation between the Arrow-formatted binary data payloads and the various application-specific fields. This can be achieved by encoding application-specific fields in URI paths, query parameters, headers, or separate parts of multipart/form-data messages. Ian On Wed, Dec 6, 2023 at 1:24 PM Antoine Pitrou wrote: > > > Hi, > > While this looks like a nice start, I would expect more precise > recommendations for writing non-trivial services. Especially, one > question is how to send both an application-specific POST request and an > Arrow stream, or an application-specific GET response and an Arrow > stream. This might necessitate some kind of framing layer, or a > standardized delimiter. > > Regards > > Antoine. > > > > Le 05/12/2023 à 21:10, Ian Cook a écrit : > > This is a continuation of the discussion entitled "[DISCUSS] Protocol for > > exchanging Arrow data over REST APIs". See the previous messages at > > https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf. > > > > To inform this discussion, I created some basic Arrow-over-HTTP client and > > server examples here: > > https://github.com/apache/arrow/pull/39081 > > > > My intention is to expand and improve this set of examples (with your help) > > until they reflect a set of conventions that we are comfortable documenting > > as recommendations. > > > > Please take a look and add comments / suggestions in the PR. > > > > Thanks, > > Ian > > > > On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington > > wrote: > > > >> I also think a set of best practices for Arrow over HTTP would be a > >> valuable resource for the community...even if it never becomes a > >> specification of its own, it will be beneficial for API developers and > >> consumers of those APIs to have a place to look to understand how > >> Arrow can help improve throughput/latency/maybe other things. Possibly > >> something like httpbin.org but for requests/responses that use Arrow > >> would be helpful as well. Thank you Ian for leading this effort! > >> > >> It has mostly been covered already, but in the (ubiquitous) situation > >> where a response contains some schema/table and some non-schema/table > >> information there is some tension between throughput (best served by a > >> JSON response plus one or more IPC stream responses) and latency (best > >> served by a single HTTP response? JSON? IPC with metadata/header?). In > >> addition to Antoine's list, I would add: > >> > >> - How to serve the same table in multiple requests (e.g., to saturate > >> a network connection, or because separate worker nodes are generating > >> results anyway). > >> - How to inline a small schema/table into a single request with other > >> metadata (I have seen this done as base64-encoded IPC in JSON, but > >> perhaps there is a better way) > >> > >> If anybody is interested in experimenting, I repurposed a previous > >> experiment I had as a flask app that can stream IPC to a client: > >> > >> https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files > >> . > >> > >>> - recommendations about compression > >> > >> Just a note that there is also Content-Encoding: gzip (for consumers > >> like Arrow JS that don't currently support buffer compression but that > >> can leverage the facilities of the browser/http library) > >> > >> Cheers! > >> > >> -dewey > >> > >> > >> On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei wrote: > >>> > >>> Hi, > >>> > But how is the performance? > >>> > >>> It's faster than the original JSON based API. > >>> > >>> I implemented Apache Arrow support for a C# client. So I > >>> measured only with Apache Arrow C# but the Apache Arrow > >>> based API is faster than JSON based API. > >>> > Have you measured the throughput of this approach to see > if it is comparable to using Flight SQL? > >>> > >>> Sorry. I didn't measure the throughput. In the case, elapsed > >>> time of one request/response pair is important than > >>> throughput. And it was faster than JSON based API and enough > >>> performance. > >>> > >>> I couldn't compare to a Flight SQL based approach because > >>> Groonga doesn't support Flight SQL yet. > >>> > Is this approach able to saturate a fast network > connection? > >>> > >>> I think that we can't measure this with the Groonga case > >>> because the Groonga case doesn't send data without > >>> stopping. Here is one of request patterns: > >>> > >>> 1. Groonga has log data partitioned by day > >>> 2. Groonga does full text search against one partition (2023-11-01) > >>> 3. Groonga sends the
Re: [DISCUSS] Conventions for transporting Arrow data over HTTP
Hi, While this looks like a nice start, I would expect more precise recommendations for writing non-trivial services. Especially, one question is how to send both an application-specific POST request and an Arrow stream, or an application-specific GET response and an Arrow stream. This might necessitate some kind of framing layer, or a standardized delimiter. Regards Antoine. Le 05/12/2023 à 21:10, Ian Cook a écrit : This is a continuation of the discussion entitled "[DISCUSS] Protocol for exchanging Arrow data over REST APIs". See the previous messages at https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf. To inform this discussion, I created some basic Arrow-over-HTTP client and server examples here: https://github.com/apache/arrow/pull/39081 My intention is to expand and improve this set of examples (with your help) until they reflect a set of conventions that we are comfortable documenting as recommendations. Please take a look and add comments / suggestions in the PR. Thanks, Ian On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington wrote: I also think a set of best practices for Arrow over HTTP would be a valuable resource for the community...even if it never becomes a specification of its own, it will be beneficial for API developers and consumers of those APIs to have a place to look to understand how Arrow can help improve throughput/latency/maybe other things. Possibly something like httpbin.org but for requests/responses that use Arrow would be helpful as well. Thank you Ian for leading this effort! It has mostly been covered already, but in the (ubiquitous) situation where a response contains some schema/table and some non-schema/table information there is some tension between throughput (best served by a JSON response plus one or more IPC stream responses) and latency (best served by a single HTTP response? JSON? IPC with metadata/header?). In addition to Antoine's list, I would add: - How to serve the same table in multiple requests (e.g., to saturate a network connection, or because separate worker nodes are generating results anyway). - How to inline a small schema/table into a single request with other metadata (I have seen this done as base64-encoded IPC in JSON, but perhaps there is a better way) If anybody is interested in experimenting, I repurposed a previous experiment I had as a flask app that can stream IPC to a client: https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files . - recommendations about compression Just a note that there is also Content-Encoding: gzip (for consumers like Arrow JS that don't currently support buffer compression but that can leverage the facilities of the browser/http library) Cheers! -dewey On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei wrote: Hi, But how is the performance? It's faster than the original JSON based API. I implemented Apache Arrow support for a C# client. So I measured only with Apache Arrow C# but the Apache Arrow based API is faster than JSON based API. Have you measured the throughput of this approach to see if it is comparable to using Flight SQL? Sorry. I didn't measure the throughput. In the case, elapsed time of one request/response pair is important than throughput. And it was faster than JSON based API and enough performance. I couldn't compare to a Flight SQL based approach because Groonga doesn't support Flight SQL yet. Is this approach able to saturate a fast network connection? I think that we can't measure this with the Groonga case because the Groonga case doesn't send data without stopping. Here is one of request patterns: 1. Groonga has log data partitioned by day 2. Groonga does full text search against one partition (2023-11-01) 3. Groonga sends the result to client as Apache Arrow streaming format record batches 4. Groonga does full text search against the next partition (2023-11-02) 5. Groonga sends the result to client as Apache Arrow streaming format record batches 6. ... In the case, the result data aren't always sending. (search -> send -> search -> send -> ...) So it doesn't saturate a fast network connection. (3. and 4. can be parallel but it's not implemented yet.) If we optimize this approach, this approach may be able to saturate a fast network connection. And what about the case in which the server wants to begin sending batches to the client before the total number of result batches / records is known? Ah, sorry. I forgot to explain the case. Groonga uses the above approach for it. - server should not return the result data in the body of a response to a query request; instead server should return a response body that gives URI(s) at which clients can GET the result data If we want to do this, the standard "Location" HTTP headers may be suitable. - transmit result data in chunks (Transfer-Encoding: chunked), with recommendations about chunk size Ah, sorry. I forgot to explain this case too.
[DISCUSS] Conventions for transporting Arrow data over HTTP
This is a continuation of the discussion entitled "[DISCUSS] Protocol for exchanging Arrow data over REST APIs". See the previous messages at https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf. To inform this discussion, I created some basic Arrow-over-HTTP client and server examples here: https://github.com/apache/arrow/pull/39081 My intention is to expand and improve this set of examples (with your help) until they reflect a set of conventions that we are comfortable documenting as recommendations. Please take a look and add comments / suggestions in the PR. Thanks, Ian On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington wrote: > I also think a set of best practices for Arrow over HTTP would be a > valuable resource for the community...even if it never becomes a > specification of its own, it will be beneficial for API developers and > consumers of those APIs to have a place to look to understand how > Arrow can help improve throughput/latency/maybe other things. Possibly > something like httpbin.org but for requests/responses that use Arrow > would be helpful as well. Thank you Ian for leading this effort! > > It has mostly been covered already, but in the (ubiquitous) situation > where a response contains some schema/table and some non-schema/table > information there is some tension between throughput (best served by a > JSON response plus one or more IPC stream responses) and latency (best > served by a single HTTP response? JSON? IPC with metadata/header?). In > addition to Antoine's list, I would add: > > - How to serve the same table in multiple requests (e.g., to saturate > a network connection, or because separate worker nodes are generating > results anyway). > - How to inline a small schema/table into a single request with other > metadata (I have seen this done as base64-encoded IPC in JSON, but > perhaps there is a better way) > > If anybody is interested in experimenting, I repurposed a previous > experiment I had as a flask app that can stream IPC to a client: > > https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files > . > > > - recommendations about compression > > Just a note that there is also Content-Encoding: gzip (for consumers > like Arrow JS that don't currently support buffer compression but that > can leverage the facilities of the browser/http library) > > Cheers! > > -dewey > > > On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei wrote: > > > > Hi, > > > > > But how is the performance? > > > > It's faster than the original JSON based API. > > > > I implemented Apache Arrow support for a C# client. So I > > measured only with Apache Arrow C# but the Apache Arrow > > based API is faster than JSON based API. > > > > > Have you measured the throughput of this approach to see > > > if it is comparable to using Flight SQL? > > > > Sorry. I didn't measure the throughput. In the case, elapsed > > time of one request/response pair is important than > > throughput. And it was faster than JSON based API and enough > > performance. > > > > I couldn't compare to a Flight SQL based approach because > > Groonga doesn't support Flight SQL yet. > > > > > Is this approach able to saturate a fast network > > > connection? > > > > I think that we can't measure this with the Groonga case > > because the Groonga case doesn't send data without > > stopping. Here is one of request patterns: > > > > 1. Groonga has log data partitioned by day > > 2. Groonga does full text search against one partition (2023-11-01) > > 3. Groonga sends the result to client as Apache Arrow > >streaming format record batches > > 4. Groonga does full text search against the next partition (2023-11-02) > > 5. Groonga sends the result to client as Apache Arrow > >streaming format record batches > > 6. ... > > > > In the case, the result data aren't always sending. (search > > -> send -> search -> send -> ...) So it doesn't saturate a > > fast network connection. > > > > (3. and 4. can be parallel but it's not implemented yet.) > > > > If we optimize this approach, this approach may be able to > > saturate a fast network connection. > > > > > And what about the case in which the server wants to begin sending > batches > > > to the client before the total number of result batches / records is > known? > > > > Ah, sorry. I forgot to explain the case. Groonga uses the > > above approach for it. > > > > > - server should not return the result data in the body of a response > to a > > > query request; instead server should return a response body that gives > > > URI(s) at which clients can GET the result data > > > > If we want to do this, the standard "Location" HTTP headers > > may be suitable. > > > > > - transmit result data in chunks (Transfer-Encoding: chunked), with > > > recommendations about chunk size > > > > Ah, sorry. I forgot to explain this case too. Groonga uses > > "Transfer-Encoding: chunked". But recommended chunk size may > > be case-by-case... If a server can