[GitHub] [arrow-rs] alamb commented on issue #111: async parquet reader

GitBox Sun, 12 Sep 2021 03:52:03 -0700


alamb commented on issue #111:
URL: https://github.com/apache/arrow-rs/issues/111#issuecomment-917611415



   The approach that @jorgecarleitao  took in 
https://github.com/jorgecarleitao/arrow2/pull/260 is quite clever. Rather than 
a single struct that can read parquet files synchronously and asynchronously, I 
think he effectively added a second API for reading the required portions of 
the files into memory buffers  and then uses  shared encoding/decoding logic 
with the serialized reader. 
   
   Thus, one idea for adding async support to the `parquet` crate might be to 
follow this example and  create a new reader like `AsyncFileReader` (alongside 
the existing `SerializedFileReader`) that handles the I/O to fetch the required 
parts (e.g. fetching the bytes that contain metadata, or encoded pages), and 
then calls into the existing encoder/decoder logic
   
   Something like
   
   ```
                  ┌────────────────────────────┐                
                  │ Existing common encoding + │                
                  │decoding logic that operates│                
                  │     on bytes in memory     │                
                  └────────────────────────────┘                
                                 ▲                              
                    ┌────────────┴──────────┐                   
                    │                       │                   
                    │                       │                   
               .─────────.             .─────────.              
            ,─'           '─.       ,─'           '─.           
           ;    Logic to     :     ;  new logic to   :          
           :   read bytes    ;     :   read bytes    ;          
            ╲ synchronously ╱       ╲asynchronously ╱           
             '─.         ,─'         '─.         ,─'            
                `───────'               `───────'               
                    ▲                       ▲                   
               ┌────┘                       └──────┐            
               │                                   │            
               │                                   │            
   ┌───────────────────────┐           ┌───────────────────────┐
   │ SerializedFileReader  │           │    AsyncFileReader    │
   └───────────────────────┘           └───────────────────────┘
                                                                
       existing parquet                           new           
             crate                       entrypoint for async   
                                                reader          
   ```
   
   Here is the current read API:
    https://docs.rs/parquet/5.3.0/parquet/file/reader/index.html
   
   cc @yjshen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb commented on issue #111: async parquet reader

Reply via email to