yongda-fan opened a new issue, #9564:
URL: https://github.com/apache/arrow-datafusion/issues/9564
### Describe the bug
When reading a compressed json file, with `repartition_file_scans = true`
(default value), datafusion try to uncompress the file with parallel read. This
will cause `ArrowError(IoError("invalid gzip header", Custom { kind:
InvalidInput, error: "invalid gzip header" }), None)` because there is no gzip
header in the middle.
### To Reproduce
```rust
let df = ctx.read_json(
"C:/path/to/file.gz",
NdJsonReadOptions::default()
.file_compression_type(FileCompressionType::GZIP)
.file_extension("gz")
.schema(&s3_user_schema())
).await.unwrap();
```
### Expected behavior
the data should read correctly without errors
### Additional context
by put a print statement before the `JsonOpener`, we can see
```
GetResult { payload: GetResultPayload(File), meta: ObjectMeta { location:
Path { raw: "C:/some/path/foo.gz" }, last_modified: 2024-03-06T00:01:02Z, size:
149873338, e_tag: Some("0-612f2a7828b80-8eee2ba"), version: None }, range:
0..4685075 }
GetResult { payload: GetResultPayload(File), meta: ObjectMeta { location:
Path { raw: "C:/some/path/foo.gz" }, last_modified: 2024-03-06T00:01:02Z, size:
149873338, e_tag: Some("0-612f2a7828b80-8eee2ba"), version: None }, range:
37468620..42151887 }
GetResult { payload: GetResultPayload(File), meta: ObjectMeta { location:
Path { raw: "C:/some/path/foo.gz" }, last_modified: 2024-03-06T00:01:02Z, size:
149873338, e_tag: Some("0-6.......
```
which suggest it's indeed reading a compressed json file in parallel.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]