ianmcook commented on code in PR #35:
URL: https://github.com/apache/arrow-experiments/pull/35#discussion_r1843910337


##########
http/get_compressed/README.md:
##########
@@ -20,3 +20,160 @@
 # HTTP GET Arrow Data: Compression Examples
 
 This directory contains examples of HTTP servers/clients that transmit/receive 
data in the Arrow IPC streaming format and use compression (in various ways) to 
reduce the size of the transmitted data.
+
+Since we re-use the [Arrow IPC format][ipc] for transferring Arrow data over
+HTTP and both Arrow IPC and HTTP standards support compression on their own,
+there are at least two approaches to this problem:
+
+1. Compressed HTTP responses carrying Arrow IPC streams with uncompressed
+   array buffers.
+2. Uncompressed HTTP responses carrying Arrow IPC streams with compressed
+   array buffers.
+
+Applying both IPC buffer and HTTP compression to the same data is not
+recommended. The extra CPU overhead of decompressing the data twice is
+not worth any possible gains that double compression might bring. If
+compression ratios are unambiguously more important than reducing CPU
+overhead, then a different compression algorithm that optimizes for that can
+be chosen.
+
+This table shows the support for different compression algorithms in HTTP and
+Arrow IPC:
+
+| Format             | HTTP Support    | IPC Support     |
+| ------------------ | --------------- | --------------- |
+| gzip (GZip)        | X               |                 |
+| deflate (DEFLATE)  | X               |                 |
+| br (Brotli)        | X[^2]           |                 |
+| zstd (Zstandard)   | X[^2]           | X               |
+| lz4 (LZ4)          |                 | X               |
+
+Since not all Arrow IPC implementations support compression, HTTP compression
+based on accepted formats negotiated with the client is a great way to increase
+the chances of efficient data transfer.
+
+Servers may check the `Accept-Encoding` header of the client and choose the
+compression format in this order of preference: `zstd`, `br`, `gzip`,
+`identity` (no compression). If the client does not specify a preference, the
+only constraint on the server is the availability of the compression algorithm
+in the server environment.
+
+## Arrow IPC Compression
+
+When IPC buffer compression is preferred and servers can't assume all clients
+support it[^3], clients may be asked to explicitly list the supported 
compression
+algorithms in the request headers. The `Accept` header can be used for this
+since `Accept-Encoding` (and `Content-Encoding`) is used to control compression
+of the entire HTTP response stream and instruct HTTP clients (like browsers) to
+decompress the response before giving data to the application or saving the
+data.
+
+    Accept: application/vnd.apache.arrow.ipc; codecs="zstd, lz4"
+
+This is similar to clients requesting video streams by specifying the
+container format and the codecs they support
+(e.g. `Accept: video/webm; codecs="vp8, vorbis"`).
+
+The server is allowed to choose any of the listed codecs, or not compress the
+IPC buffers at all. Uncompressed IPC buffers should always be acceptable by
+clients.
+
+If a server adopts this approach and a client does not specify any codecs in
+the `Accept` header, the server can fall back to checking `Accept-Encoding`
+header to pick a compression algorithm for the entire HTTP response stream.
+
+To make debugging easier servers may include the chosen compression codec(s)
+in the `Content-Type` header of the response (quotes are optional):
+
+    Content-Type: application/vnd.apache.arrow.ipc; codecs=zstd
+
+This is not necessary for correct decompression because the payload already
+contains information that tells the IPC reader how to decompress the buffers,
+but it can help developers understand what is going on.
+
+When programatically checking if the `Content-Type` header contains a specific
+format, it is important to use a parser that can handle parameters or look
+only at the media type part of the header. This is not an exclusivity of the
+Arrow IPC format, but a general rule for all media types. For example,
+`application/json; charset=utf-8` should match `application/json`.
+
+## HTTP/1.1 Response Compression
+
+HTTP/1.1 offers an elaborate way for clients to specify their preferred
+content encoding (read compression algorithm) using the `Accept-Encoding`
+header.[^1]
+
+At least the Python server (in `python/`)  implements a fully compliant
+parser for the `Accept-Encoding` header. Application servers may choose
+to implement a simpler check of the `Accept-Encoding` header or assume
+that the client accepts the chosen compression scheme when talking
+to that server.
+
+Here is an example of a header that a client may send and what it means:
+
+   Accept-Encoding: zstd;q=1.0, gzip;q=0.5, br;q=0.8, identity;q=0
+
+This header says that the client prefers that the server compress the
+response with `zstd`, but if that is not possible, then `brotli` and `gzip`
+are acceptable (in that order because 0.8 is greater than 0.5). The client
+does not want the response to be uncompressed. This is communicated by
+`"identity"` being listed with `q=0`.
+
+To tell the server the client only accepts `zstd` responses and nothing
+else, not even uncompressed responses, the client would send:
+
+   Accept-Encoding: zstd, *;q=0
+
+RFC 2616[^1] specifies the rules for how a server should interpret the
+`Accept-Encoding` header:
+
+    A server tests whether a content-coding is acceptable, according to
+    an Accept-Encoding field, using these rules:
+
+       1. If the content-coding is one of the content-codings listed in
+          the Accept-Encoding field, then it is acceptable, unless it is
+          accompanied by a qvalue of 0. (As defined in section 3.9, a
+          qvalue of 0 means "not acceptable.")
+
+       2. The special "*" symbol in an Accept-Encoding field matches any
+          available content-coding not explicitly listed in the header
+          field.
+
+       3. If multiple content-codings are acceptable, then the acceptable
+          content-coding with the highest non-zero qvalue is preferred.
+
+       4. The "identity" content-coding is always acceptable, unless
+          specifically refused because the Accept-Encoding field includes
+          "identity;q=0", or because the field includes "*;q=0" and does
+          not explicitly include the "identity" content-coding. If the
+          Accept-Encoding field-value is empty, then only the "identity"
+          encoding is acceptable.
+
+If you're targeting web browsers, check the compatibility table of [compression
+algorithms on MDN Web Docs][^2].
+
+Another important rule is that if the server compresses the response, it
+must include a `Content-Encoding` header in the response.
+
+    If the content-coding of an entity is not "identity", then the
+    response MUST include a Content-Encoding entity-header (section
+    14.11) that lists the non-identity content-coding(s) used.
+
+Since not all servers implement the full `Accept-Encoding` header parsing
+logic, clients tend to stick to simple header values like
+`Accept-Encoding: identity` when no compression is desired, and
+`Accept-Encoding: gzip, deflate, zstd, br` when the client supports different
+compression formats and is indifferent to which one the server chooses. Clients
+should expect uncompressed responses as well in theses cases. The only way to
+force a "406 Not Acceptable" response when no compression is available is to
+send `identity;q=0` or `*;q=0` somewhere in the end of the `Accept-Encoding`
+header. But that relies on the server implementing the full `Accept-Encoding`
+handling logic.
+
+
+[^1]: [Fielding, R. et al. (1999). HTTP/1.1. RFC 2616, Section 14.3 
Accept-Encoding.](https://www.rfc-editor.org/rfc/rfc2616#section-14.3)
+[^2]: [MDN Web Docs: 
Accept-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding#browser_compatibility)
+[^3]: Web applications using the JavaScript Arrow implementation don't have
+    access to the compression APIs to decompress `zstd` and `lz4` IPC buffers.

Review Comment:
   ```suggestion
   [^3]: [Arrow Columnar Format: 
Compression](https://arrow.apache.org/docs/format/Columnar.html#compression)
   [^4]: Web applications using the JavaScript Arrow implementation don't have
       access to the compression APIs to decompress `zstd` and `lz4` IPC 
buffers.
   [^5]: [Arrow Implementation Status: IPC 
Format](https://arrow.apache.org/docs/status.html#ipc-format)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to