[GitHub] [arrow-rs] tustvold opened a new pull request #1154: POC: Async parquet reader

GitBox Tue, 11 Jan 2022 02:21:23 -0800


tustvold opened a new pull request #1154:
URL: https://github.com/apache/arrow-rs/pull/1154



   **Proof of concept, tests are currently extremely limited**
   
   # Which issue does this PR close?
   
   Closes #111 .
   
   # Rationale for this change
   
   See ticket, in particular I wanted to confirm that it is possible to create 
an async parquet reader without any major changes to the parquet crate. This 
seems to come up as a frequent ask from the community, and I think we could 
support it without any major churn.
   
   # What changes are included in this PR?
   
   Adds a layer of indirection to `array_reader` to abstract it away from 
files, _I think this change may stand on its own merits_.
   
   It then adds a ParquetRecordBatchStream which is a `Stream` that yields 
`RecordBatch`. Under the hood, this uses async to read row groups into memory 
and then feeds these into the non-async decoders. 
   
   The [parquet docs](https://parquet.apache.org/documentation/latest/) 
describe the column chunk as the unit of IO, and so I think buffering 
compressed row groups in memory is not an impractical approach. It also avoids 
having to maintain sync and async version of all the decoders, readers, etc...
   
   # Are there any user-facing changes?
   
   The only changes are to `array_reader` which since #1133 no longer has 
stability guarantees
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold opened a new pull request #1154: POC: Async parquet reader

Reply via email to