[I] [DISCUSSION] Parquet Metadata Improvements [arrow-rs]

via GitHub Fri, 26 Jul 2024 05:59:50 -0700


alamb opened a new issue, #6129:
URL: https://github.com/apache/arrow-rs/issues/6129


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   As we work on various features of Parquet metadata it is becoming clear that 
working with the current code organization is challenging.
   
   I just wanted to write down some of my thoughts about how it all fits 
together
   
   Here are some challenges:
   
   1. The naming is challenging https://github.com/apache/arrow-rs/issues/6097
   2. There is no way to easily write to bytes outside the context of a parquet 
file: https://github.com/apache/arrow-rs/pull/6000
   3. It is complicated to understand how to read optional parts of the 
metadata that are not inlined (e.g. OffsetIndexes) - 
https://github.com/apache/arrow-rs/pull/5887
   4. If we ever wanted to speed up (e.g. 
https://github.com/apache/arrow-rs/issues/5854) it would be hard with the 
current structure
   5. There is not always a 1-1 correspondence between `file::metadata` and the 
thrift structures in `format::metadata`, 
   
   **Describe the solution you'd like**
   I would like to propose 
   1. We continue to clarify the distinction between `file::metadata` and 
`format::metadata` 
   2. Improve the API to translate back and forth between them and bytes and 
de-emphasize the conversion between thrift structures
   
   
   Maybe this is clear to others but it is not to me
   
   Here is how I see the structures involved:
   
   ```text
                                   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐               ┌ 
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─    
                                     ┌──────────────┐                         
┌───────────────────────┐ │   
                                   │ │ ColumnIndex  │        │               ││ 
   ParquetMetaData    │     
                                     └──────────────┘                         
└───────────────────────┘ │   
     ┌──────────────┐              │ ┌────────────────┐      │               
│┌───────────────────────┐     
     │   ..0x24..   │  ◀────────▶    │  OffsetIndex   │          ◀────────▶   │ 
   ParquetMetaData    │ │   
     └──────────────┘              │ └────────────────┘      │               
│└───────────────────────┘     
                                              ...                               
        ...             │   
                                   │ ┌──────────────────┐    │               │ 
┌──────────────────┐         
   bytes                             │  FileMetaData*   │                      
│  FileMetaData*   │     │   
   (thrift encoded)                │ └──────────────────┘    │               │ 
└──────────────────┘         
                                    ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─                 ─ 
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘   
                                                                                
                            
                                        format::meta structures               
file::metadata structures         
                                                                                
                            
                                                                                
                            
                                                        * Same name, different 
struct                       
                                                                                
                            
   ```
   
   I would like to focus on improving the API for going back/forth between 
bytes and the `file::metadata` structures
   
   
   
   ```
                                                     ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
─    
                                                      ┌───────────────────────┐ 
│   
   ┌──────────────┐                                  ││    ParquetMetaData    │ 
    
   │   ..0x24..   │           ◀────────▶              └───────────────────────┘ 
│   
   └──────────────┘                                  │┌───────────────────────┐ 
    
                                                      │    ParquetMetaData    │ 
│   
                           Would like to focus       │└───────────────────────┘ 
    
    bytes                  on this API to/from                                  
│   
    (thrift encoded)       bytes and the             │ ┌──────────────────┐     
    
                           file::metadata              │  FileMetaData*   │     
│   
                                                     │ └──────────────────┘     
    
                                                      ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
┘   
                                                                                
    
                                                      file::metadata structures 
    
                                                                                
    
   ```
   
   **Describe alternatives you've considered**
   I think we probably need at least two different APIs:
   
   # Reading
   1. One that writes to `[u8]` buffered in memory ( 
[decode_footer](https://docs.rs/parquet/latest/parquet/file/footer/fn.decode_footer.html)
 and 
[decode_metadata](https://docs.rs/parquet/latest/parquet/file/footer/fn.decode_metadata.html))
   2. One that reads from an `AsyncReader` or something equivalent 
([`MetadataLoader`](https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.MetadataLoader.html)
 is enough / needs some more information)
   
   # Writing 
   1. Writes to `[u8]`  https://github.com/apache/arrow-rs/issues/6002)
   2. Writes to an `AsyncWriter` perhaps
   
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [DISCUSSION] Parquet Metadata Improvements [arrow-rs]

Reply via email to