[I] Avoid buffering the entire row group in Async parquet reader [arrow-rs]

via GitHub Mon, 06 Jan 2025 12:11:07 -0800


alamb opened a new issue, #6946:
URL: https://github.com/apache/arrow-rs/issues/6946


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   The async_reader 
[`ParquetRecordBatchStreamBuilder`](https://docs.rs/parquet/latest/parquet/arrow/async_reader/type.ParquetRecordBatchStreamBuilder.html)
 is awesome and fast
   
   However, it is currently quite aggressive when reading and will buffer an 
entire row group in memory during read. For data reorganization operations that 
read all columns, such merging multiple files together,  we see significant 
memory used buffering this entire input
   
   For example, we have some files at InfluxData that contain 1M rows each with 
10 columns. Using the default `WriterProperties` with row group size 1M and 20k 
rows per page results in a parquet file with 
   1. a single 1M row group
   2. 10 column chunks each with  50 pages. 
   3. Each file is 60MB 
   
   This merging 10 such files requires 600MB of memory just to buffer the 
parquet data. 100 such files requires 6GB buffer, etc. 
   
   We can reduce the memory required by reducing the number of files merged 
concurrently (the fan-in) as well as reducing the number of rows in each row 
group. However, I also think there is potential improvement in the parquet 
reader itself to not buffer the entire file while reading
   
   The root cause of this memory use is that all the pages of the current 
RowGroup are read in memory via the  
[`InMemoryRowGroup`](https://github.com/apache/arrow-rs/blob/2b452048e46ae7e8bf2b2a4b504b535197d56c21/parquet/src/arrow/async_reader/mod.rs#L797)
   
   In pictures, if we have a parquet file on disk/object store:
   ```
   
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
   ┃ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─          ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
─  ┃
   ┃    ┌──────────────┐  │     ┌──────────────┐  │            ┌──────────────┐ 
 │ ┃
   ┃ │  │    Page 1    │     │  │    Page 1    │            │  │    Page 1    │ 
   ┃
   ┃    └──────────────┘  │     └──────────────┘  │            └──────────────┘ 
 │ ┃
   ┃ │  ┌──────────────┐     │  ┌──────────────┐            │  ┌──────────────┐ 
   ┃
   ┃    │    Page 2    │  │     │    Page 2    │  │            │    Page 2    │ 
 │ ┃
   ┃ │  └──────────────┘     │  └──────────────┘      ...   │  └──────────────┘ 
   ┃
   ┃          ...         │           ...         │                  ...        
 │ ┃
   ┃ │                       │                              │                   
   ┃
   ┃    ┌──────────────┐  │     ┌──────────────┐  │            ┌──────────────┐ 
 │ ┃
   ┃ │  │    Page N    │     │  │    Page N    │            │  │    Page N    │ 
   ┃
   ┃    └──────────────┘  │     └──────────────┘  │            └──────────────┘ 
 │ ┃
   ┃ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─          └ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
─  ┃
   ┃      ColumnChunk             ColumnChunk                    ColumnChunk    
   ┃
   ┃                                                                            
   ┃
   ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Row Group 
1━━┛
   ```
   
   Reading it requires loading all pages from all columns read into 
`InMemoryRowGroup`
   
   ```
   ┏━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ 
━━━ ━━━ ┓ 
                                                                                
       ┃ 
   ┃ 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ 
   ┃ ┃ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─          ┌ ─ ─ ─ ─ ─ ─ ─ ─ 
─ ─  ┃   
   ┃ ┃    ┌──────────────┐  │     ┌──────────────┐  │            
┌──────────────┐  │ ┃ ┃ 
     ┃ │  │    Page 1    │     │  │    Page 1    │            │  │    Page 1    
│    ┃ ┃ 
   ┃ ┃    └──────────────┘  │     └──────────────┘  │            
└──────────────┘  │ ┃ ┃ 
   ┃ ┃ │  ┌──────────────┐     │  ┌──────────────┐            │  
┌──────────────┐    ┃   
   ┃ ┃    │    Page 2    │  │     │    Page 2    │  │            │    Page 2    
│  │ ┃ ┃ 
     ┃ │  └──────────────┘     │  └──────────────┘      ...   │  
└──────────────┘    ┃ ┃ 
   ┃ ┃          ...         │           ...         │                  ...      
   │ ┃ ┃ 
   ┃ ┃ │                       │                              │                 
     ┃   
   ┃ ┃    ┌──────────────┐  │     ┌──────────────┐  │            
┌──────────────┐  │ ┃ ┃ 
     ┃ │  │    Page N    │     │  │    Page N    │            │  │    Page N    
│    ┃ ┃ 
   ┃ ┃    └──────────────┘  │     └──────────────┘  │            
└──────────────┘  │ ┃ ┃ 
   ┃ ┃ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─          └ ─ ─ ─ ─ ─ ─ ─ ─ 
─ ─  ┃   
   ┃ ┃      ColumnChunk             ColumnChunk                    ColumnChunk  
     ┃ ┃ 
     ┃                                                                          
     ┃ ┃ 
   ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Row 
Group 1━━┛ ┃ 
   ┃                                                                            
         
   ┗ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ 
━━━ ━━━ ━━┛ 
                                                                                
         
                                                                     
InMemoryRowGroup    
                                                                                
         
   ```
   
   
   
   **Describe the solution you'd like**
   
   Basically, TLDR only read a subset of the pages into memory at any time and 
do additional IO to fetch others
   
   ```
   ┏━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ 
━━━ ━━━ ┓  
   ┃                                                                            
       ┃  
   ┃ 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃  
     ┃ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─          ┌ ─ ─ ─ ─ ─ ─ ─ ─ 
─ ─  ┃    
   ┃ ┃    ┌──────────────┐  │     ┌──────────────┐  │            
┌──────────────┐  │ ┃ ┃  
   ┃ ┃ │  │    Page 1    │     │  │    Page 1    │            │  │    Page 1    
│    ┃ ┃  
   ┃ ┃    └──────────────┘  │     └──────────────┘  │            
└──────────────┘  │ ┃ ┃  
     ┃ │  ┌──────────────┐     │  ┌──────────────┐            │  
┌──────────────┐    ┃    
   ┃ ┃    │    Page 2    │  │     │    Page 2    │  │            │    Page 2    
│  │ ┃ ┃  
   ┃ ┃ │  └──────────────┘     │  └──────────────┘      ...   │  
└──────────────┘    ┃ ┃  
   ┃ ┃  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘          ─ ─ ─ ─ ─ ─ ─ ─ 
─ ─ ┘ ┃ ┃  
     ┃      ColumnChunk             ColumnChunk                    ColumnChunk  
     ┃    
   ┃ ┃                                                                          
     ┃ ┃  
   ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Row 
Group 1━━┛ ┃  
   ┃                                                                            
       ┃  
    ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ 
━━━ ━━━   
                                                                                
          
                                                                      
InMemoryRowGroup    
                                                                                
          
   ```
   
   **Describe alternatives you've considered**
   
   If course there is a tradeoff between buffer usage and number of I/O 
requests so there is unlikely to be one setting that works for all use cases. 
In our examples the two extremes are to
   
   | Target | I/Os required | Buffer Required |
   |--------|--------|--------|
   | Minimize I/O requests (behavior today) | 1 request 
(`ObjectStore::get_ranges` call) | 60MB Buffer |
   | Minimize buffering | 500 requests (one for each page) | ~1MB buffer 
(60MB/50 pages) |
   | Intermediate buffering | 10 requests (get 5 pages per column per request) 
| 10MB buffer |
   
   So I would propose implementing some new optional setting like 
`row_group_page_buffer_size` that the reader would respect and  try and limit 
its buffer usage to `row_group_page_buffer_size`, ideally fetching multiple 
pages at a time. 
   
   large pages (e.g. large string data) are likely to make it impossible to 
always keep to a particular buffer size
   
   
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Avoid buffering the entire row group in Async parquet reader [arrow-rs]

Reply via email to