nkemnitz opened a new issue, #774:
URL: https://github.com/apache/arrow-rs-object-store/issues/774

   **Describe the bug**
   
   `ObjectStore::get` on the GCS backend fails with `Error::Generic { store: 
"GCS", source: Header { source: MissingContentLength } }` for any object stored 
with `Content-Encoding: gzip`. GCS serves these via `Transfer-Encoding: 
chunked` with **no `Content-Length`**, and object_store treats a missing 
`Content-Length` as fatal.
   
   It manifests in two ways, and crucially **a client cannot fully avoid it**:
   
   - **Default reads (no `Accept-Encoding`):** GCS applies *decompressive 
transcoding* — it decompresses the object server-side and streams the result 
chunked, with no `Content-Length`. Every gzip object fails, at any size.
   - **Even with `Accept-Encoding: gzip`** (which returns the raw stored 
bytes): objects whose **stored size exceeds ~8 MiB** are *still* served chunked 
with no `Content-Length` (empirically the cutover is between 9 and 10 MB), so 
they fail too. Only *small* objects read with `Accept-Encoding: gzip` succeed.
   
   So whether you receive the transcoded (uncompressed) bytes or the raw 
(compressed) bytes, the response is chunked with no `Content-Length` — and 
object_store rejects it before returning any data. `head()` and range reads on 
gzip objects fail as well.
   
   These are valid HTTP/1.1 responses: a chunked body is self-delimiting, and 
[RFC 9112 §6.2](https://www.rfc-editor.org/rfc/rfc9112#section-6.2) states a 
sender **MUST NOT** send `Content-Length` together with `Transfer-Encoding`;
   [RFC 9112 §6.3](https://www.rfc-editor.org/rfc/rfc9112#section-6.3) says the 
chunked framing determines the body length and overrides any `Content-Length`. 
object_store is therefore requiring a header the spec forbids on exactly these 
responses.
   
   **To Reproduce**
   
   Observe the offending response with no credentials against Google's public 
demo data (the `?cb=` cache-buster forces origin past the public edge cache):
   
   ```console
   $ curl -sD - -o /dev/null \
     
"https://storage.googleapis.com/neuroglancer-public-data/kasthuri2011/ground_truth/6_6_30/5376-5440_6656-6720_896-960?cb=$RANDOM";
 \
     | grep -iE 
"HTTP/|content-length|transfer-encoding|x-goog-stored-content-encoding"
   
   HTTP/2 200
   transfer-encoding: chunked
   x-goog-stored-content-encoding: gzip
   # (no content-length)
   ```
   
   object_store then fails on that same public object (authenticated read — a 
service account or on GCP; verified with object_store 0.13.1 and main 
`de0029a`):
   
   ```rust
   use object_store::gcp::GoogleCloudStorageBuilder;
   use object_store::{path::Path, ObjectStore, ObjectStoreExt};
   
   let store = GoogleCloudStorageBuilder::new()
       .with_bucket_name("neuroglancer-public-data")
       .with_service_account_path(sa) // or workload identity on GCP
       .build()?;
   store
       .get(&Path::from(
           "kasthuri2011/ground_truth/6_6_30/5376-5440_6656-6720_896-960",
       ))
       .await?; // Err: Generic { store: "GCS", source: Header { source: 
MissingContentLength } }
   ```
   
   For a self-contained object of any size in your own bucket (a >8 MiB one 
fails even with `Accept-Encoding: gzip`):
   
   ```console
   head -c 20000000 /dev/urandom | gzip | gsutil -h "Content-Encoding:gzip" cp 
- gs://YOUR_BUCKET/big.gz
   ```
   
   > **Note:** the bundled `fake-gcs-server` does **not** emulate decompressive 
transcoding (it always
   > returns a `Content-Length`), so it cannot reproduce this. The crate's own 
`MockServer` can — push a
   > response with a chunked body and no `Content-Length` header.
   
   **Expected behavior**
   
   A full GET of a valid, self-delimiting (chunked / HTTP-2) response should 
succeed by reading the body to completion rather than requiring a 
`Content-Length` header that the HTTP spec forbids on chunked responses.
   
   **Additional context**
   
   - Affects gzip only; `br`/`zstd`/uncompressed objects are served identity 
(with `Content-Length`) and read fine.
   - Root cause: `header_meta` requires `CONTENT_LENGTH` unconditionally 
([`header.rs#L144`](https://github.com/apache/arrow-rs-object-store/blob/de0029aa91f7727015fab37e623fdbe11672914b/src/client/header.rs#L144),
 in [`header_meta` 
L114](https://github.com/apache/arrow-rs-object-store/blob/de0029aa91f7727015fab37e623fdbe11672914b/src/client/header.rs#L114)),
 and the GET path derives `ObjectMeta.size`/`range` from it before streaming
     
([`get.rs#L314`](https://github.com/apache/arrow-rs-object-store/blob/de0029aa91f7727015fab37e623fdbe11672914b/src/client/get.rs#L314)
 → 
[`#L333`](https://github.com/apache/arrow-rs-object-store/blob/de0029aa91f7727015fab37e623fdbe11672914b/src/client/get.rs#L333)).
   - Wire evidence (>8 MiB gzip object, authenticated, `Accept-Encoding: gzip`):
     `Transfer-Encoding: chunked`, no `Content-Length`, 
`x-goog-stored-content-encoding: gzip`,
     `x-goog-stored-content-length: <n>`.
   
   <details><summary>Prior art — every other major GCS client reads to EOF 
(commit-pinned)</summary>
   
   | client | behavior | source |
   |---|---|---|
   | google-cloud-storage (Python) | streams via `response.iter_content()`; 
`x-goog-stored-content-length` only a retry heuristic | 
[download.py#L145](https://github.com/googleapis/python-storage/blob/ab4997ce0f7b85947e84b226bd0edf6d714a946a/google/cloud/storage/_media/requests/download.py#L145),
 
[#L163](https://github.com/googleapis/python-storage/blob/ab4997ce0f7b85947e84b226bd0edf6d714a946a/google/cloud/storage/_media/requests/download.py#L163)
 |
   | cloud.google.com/go/storage (Go) | `Reader.Remain()` returns `-1` for 
chunked / `Decompressed`; CRC skipped on transcoding | 
[reader.go#L436](https://github.com/googleapis/google-cloud-go/blob/a25e93d25635b8fd42985edbe0290ba9a8cf2169/storage/reader.go#L436),
 
[#L74](https://github.com/googleapis/google-cloud-go/blob/a25e93d25635b8fd42985edbe0290ba9a8cf2169/storage/reader.go#L74),
 
[http_client.go#L1455](https://github.com/googleapis/google-cloud-go/blob/a25e93d25635b8fd42985edbe0290ba9a8cf2169/storage/http_client.go#L1455)
 |
   | google-cloud-cpp (Apache Arrow C++) | reads to EOF via `HasUnreadData()`; 
**falls back to `x-goog-stored-content-length`** for size | 
[object_read_source.cc#L114](https://github.com/googleapis/google-cloud-cpp/blob/149ca440cc492a66e612e2e1f1fb385136530110/google/cloud/storage/internal/rest/object_read_source.cc#L114),
 
[#L55](https://github.com/googleapis/google-cloud-cpp/blob/149ca440cc492a66e612e2e1f1fb385136530110/google/cloud/storage/internal/rest/object_read_source.cc#L55)
 |
   | TensorStore (C++ GCS kvstore) | accumulates body via libcurl callback to 
EOF; enables Accept-Encoding | 
[gcs_key_value_store.cc#L578](https://github.com/google/tensorstore/blob/613280f459520c7dddc9aa11a41412a0c2a6b913/tensorstore/kvstore/gcs_http/gcs_key_value_store.cc#L578)
 |
   | gcsfs (fsspec) | reads the aiohttp stream to EOF | 
[core.py#L118](https://github.com/fsspec/gcsfs/blob/f707e61fa75dcb4dc6b7bad0bc2321d425336a3a/gcsfs/core.py#L118)
 |
   | rclone (Go) | GCS backend returns `res.Body`, reads to EOF | 
[googlecloudstorage.go#L1385](https://github.com/rclone/rclone/blob/59c86b01bb39624650badd39f3acfd20be2b743b/backend/googlecloudstorage/googlecloudstorage.go#L1385),
 [issue #2658](https://github.com/rclone/rclone/issues/2658) |
   | smart_open (Python) | same bug class, fixed by delegating to 
google-cloud-storage | [issue 
#422](https://github.com/piskvorky/smart_open/issues/422) |
   
   </details>
   
   :robot: helped writing the report.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to