[I] Parallel read for json compressed files when it should not [arrow-datafusion]

via GitHub Mon, 11 Mar 2024 17:53:50 -0700


yongda-fan opened a new issue, #9564:
URL: https://github.com/apache/arrow-datafusion/issues/9564


   ### Describe the bug
   
   When reading a compressed json file, with `repartition_file_scans = true` 
(default value), datafusion try to uncompress the file with parallel read. This 
will cause `ArrowError(IoError("invalid gzip header", Custom { kind: 
InvalidInput, error: "invalid gzip header" }), None)` because there is no gzip 
header in the middle. 
   
   ### To Reproduce
   
   ```rust
   let df = ctx.read_json(
           "C:/path/to/file.gz", 
           NdJsonReadOptions::default()
           .file_compression_type(FileCompressionType::GZIP)
           .file_extension("gz")
           .schema(&s3_user_schema())
       ).await.unwrap();
   ```
   
   ### Expected behavior
   
   the data should read correctly without errors
   
   ### Additional context
   
   by put a print statement before the `JsonOpener`, we can see
   ```
   GetResult { payload: GetResultPayload(File), meta: ObjectMeta { location: 
Path { raw: "C:/some/path/foo.gz" }, last_modified: 2024-03-06T00:01:02Z, size: 
149873338, e_tag: Some("0-612f2a7828b80-8eee2ba"), version: None }, range: 
0..4685075 }
   GetResult { payload: GetResultPayload(File), meta: ObjectMeta { location: 
Path { raw: "C:/some/path/foo.gz" }, last_modified: 2024-03-06T00:01:02Z, size: 
149873338, e_tag: Some("0-612f2a7828b80-8eee2ba"), version: None }, range: 
37468620..42151887 }
   GetResult { payload: GetResultPayload(File), meta: ObjectMeta { location: 
Path { raw: "C:/some/path/foo.gz" }, last_modified: 2024-03-06T00:01:02Z, size: 
149873338, e_tag: Some("0-6.......
   ```
   which suggest it's indeed reading a compressed json file in parallel. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Parallel read for json compressed files when it should not [arrow-datafusion]

Reply via email to