[GitHub] [arrow-datafusion] metesynnada opened a new pull request, #4525: Avoid reading the entire file in ChunkedStore

GitBox Tue, 06 Dec 2022 04:39:27 -0800


metesynnada opened a new pull request, #4525:
URL: https://github.com/apache/arrow-datafusion/pull/4525


   # Which issue does this PR close?
   
   Closes #4524.
   
   # Rationale for this change
   
   We would like to have low memory usage while processing queries that involve 
only pipeline-able operators. Obviously, the very first step of this is to read 
files and byte streams in a suitable manner. However, the current 
`ChunkedStore` implementation materializes the whole data in the memory 
*before* creating the `Result<Byte>` stream.  This PR fixes the implementation 
so that the entire file is actually read in chunks, not just outputted in 
chunks.
   
   # What changes are included in this PR?
   
   We changed the `get` method of `ChunkedStore` so that it actually reads file 
in chunks, without reading the whole file and splitting it into chunks.
   
   # Are these changes tested?
   
   Chunked byte stream conversion is tested for
   
   - CSV
   - JSON
   - AVRO
   - Byte array.
   
   # Are there any user-facing changes?
   
   Now, a user can use `ChunkedStore` for incremental reading for various 
types. The user can even concatenate  `ChunkedStore` with the `AmazonS3` store 
for increased reading performance.
   
   # Discussion for future work
   
   **(1)** `ChunkedStore`'s current use case seems to be mostly subsumed by 
[arrow_json](https://docs.rs/arrow-json/latest/arrow_json/#)] and 
[arrow_csv](https://docs.rs/arrow-csv/latest/arrow_csv/#)]. They can also *read 
and output* files in chunks if we supply `false` to `with_collect_statistics` 
and (`1` to `target_partition` in certain cases like reading FIFO files).
   
   Therefore, `ChunkedStore` may not be required anymore. If we are not missing 
something and this is indeed the case, we can discuss deprecating it in the 
future.
   
   **(2)** One has to load the entire data in memory for byte streams unless 
one defines a schema; i.e. `infer_schema` operates on the entire dataset.
   
   We probably want to fix this too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] metesynnada opened a new pull request, #4525: Avoid reading the entire file in ChunkedStore

Reply via email to