metesynnada opened a new pull request, #4525: URL: https://github.com/apache/arrow-datafusion/pull/4525
# Which issue does this PR close? Closes #4524. # Rationale for this change We would like to have low memory usage while processing queries that involve only pipeline-able operators. Obviously, the very first step of this is to read files and byte streams in a suitable manner. However, the current `ChunkedStore` implementation materializes the whole data in the memory *before* creating the `Result<Byte>` stream. This PR fixes the implementation so that the entire file is actually read in chunks, not just outputted in chunks. # What changes are included in this PR? We changed the `get` method of `ChunkedStore` so that it actually reads file in chunks, without reading the whole file and splitting it into chunks. # Are these changes tested? Chunked byte stream conversion is tested for - CSV - JSON - AVRO - Byte array. # Are there any user-facing changes? Now, a user can use `ChunkedStore` for incremental reading for various types. The user can even concatenate `ChunkedStore` with the `AmazonS3` store for increased reading performance. # Discussion for future work **(1)** `ChunkedStore`'s current use case seems to be mostly subsumed by [arrow_json](https://docs.rs/arrow-json/latest/arrow_json/#)] and [arrow_csv](https://docs.rs/arrow-csv/latest/arrow_csv/#)]. They can also *read and output* files in chunks if we supply `false` to `with_collect_statistics` and (`1` to `target_partition` in certain cases like reading FIFO files). Therefore, `ChunkedStore` may not be required anymore. If we are not missing something and this is indeed the case, we can discuss deprecating it in the future. **(2)** One has to load the entire data in memory for byte streams unless one defines a schema; i.e. `infer_schema` operates on the entire dataset. We probably want to fix this too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
