alamb commented on issue #6002:
URL: https://github.com/apache/arrow-rs/issues/6002#issuecomment-2273302430

   REminder here is what the metadata looks like
   
   ```text
   ┌──────────────────────┐                                
   │                      │                                
   │         ...          │                                
   │                      │                                
   │┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │                                
   │     ColumnIndex     ◀│─ ─ ─                           
   ││    (Optional)     │ │     │                          
   │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  │                                
   │┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │     │ FileMetadata             
   │     OffsetIndex      │       contains embedded        
   ││    (Optional)     │◀┼ ─   │ offsets to               
   │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  │  │    ColumnIndex and          
   │╔═══════════════════╗ │     │ OffsetIndex              
   │║                   ║ │  │                             
   │║                   ║ ┼ ─   │                          
   │║   FileMetadata    ║ │                                
   │║                   ║ ┼ ─ ─ ┘                          
   │║                   ║ │                                
   │╚═══════════════════╝ │                                
   │┌───────────────────┐ │                                
   ││  metadata length  │ │ length of FileMetadata  (only) 
   │└───────────────────┘ │                                
   │┌───────────────────┐ │                                
   ││      'PAR1'       │ │ Parquet Magic Bytes            
   │└───────────────────┘ │                                
   └──────────────────────┘                                
                                                           
        Output Buffer                                      
   ```
   
   
   
   
   ## How to read this today
   
   Using the code in https://github.com/apache/arrow-rs/pull/6081 as an 
example, here is the best way I have come up with for reading metadata without 
firing up a parquet file reader:
   
   Note this DOES NOT read the `ColumnIndex` / `OffsetIndex`, even if if they 
are present
   
   ```rust
   /// Reads the metadata from a file
   ///
   /// This function reads the format written by `write_metadata_to_file`
   fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
       let mut file = std::fs::File::open(file).unwrap();
       // This API is kind of awkward compared to the writer
       let mut buffer = Vec::new();
       file.read_to_end(&mut buffer).unwrap();
       let len = buffer.len();
   
       let mut footer = [0; 8];
       footer.copy_from_slice(&buffer[len - 8..len]);
   
       let md_length = decode_footer(&footer).unwrap();
       // note this also doesn't contain the ColumnOffset or ColumnIndex
       let metadata_buffer = &buffer[len - 8 - md_length..md_length];
       decode_metadata(metadata_buffer).unwrap()
   }
   ```
   
   ## Proposed API
   
   Here is how I would like to interact with the data (this would apply equally 
to metadata stored memory blobs too)
   
   ```rust
   /// Reads the metadata from a file
   ///
   /// This function reads the format written by `write_metadata_to_file`
   fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
       let mut file = std::fs::File::open(file).unwrap();
       // This API is kind of awkward compared to the writer
       let mut buffer = Vec::new();
       file.read_to_end(&mut buffer).unwrap();
   
        let decoder = ParquetMetaDataDecoder::new()
          // read OffsetIndex and PageIndex,  if present, populating 
          // ParquetMetaData::column_index and ParquetMetaData::offset_index 
          .with_page_index(true);
   
        decoder.decode(&but).unwrap()
   }
   ```
   
   
   ## Nuances
   1. Is this sufficient to coordinate / decode the footer from a parquet file 
itself
   1. Since the `FileMetadata` structure have pointers / offsets into the 
buffer if you don't have the entire file in memory you need to update the 
offsets relative to the slice you do have 
   3. How will we work this into the parquet metadata loader (that may need to 
fetch multiple buffers)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to