alamb commented on issue #111: URL: https://github.com/apache/arrow-rs/issues/111#issuecomment-917611415
The approach that @jorgecarleitao took in https://github.com/jorgecarleitao/arrow2/pull/260 is quite clever. Rather than a single struct that can read parquet files synchronously and asynchronously, I think he effectively added a second API for reading the required portions of the files into memory buffers and then uses shared encoding/decoding logic with the serialized reader. Thus, one idea for adding async support to the `parquet` crate might be to follow this example and create a new reader like `AsyncFileReader` (alongside the existing `SerializedFileReader`) that handles the I/O to fetch the required parts (e.g. fetching the bytes that contain metadata, or encoded pages), and then calls into the existing encoder/decoder logic Something like ``` ┌────────────────────────────┐ │ Existing common encoding + │ │decoding logic that operates│ │ on bytes in memory │ └────────────────────────────┘ ▲ ┌────────────┴──────────┐ │ │ │ │ .─────────. .─────────. ,─' '─. ,─' '─. ; Logic to : ; new logic to : : read bytes ; : read bytes ; ╲ synchronously ╱ ╲asynchronously ╱ '─. ,─' '─. ,─' `───────' `───────' ▲ ▲ ┌────┘ └──────┐ │ │ │ │ ┌───────────────────────┐ ┌───────────────────────┐ │ SerializedFileReader │ │ AsyncFileReader │ └───────────────────────┘ └───────────────────────┘ existing parquet new crate entrypoint for async reader ``` Here is the current read API: https://docs.rs/parquet/5.3.0/parquet/file/reader/index.html cc @yjshen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
