friendlymatthew opened a new issue, #9860:
URL: https://github.com/apache/arrow-rs/issues/9860

   I'm curious what's the right way to support custom, application-defined byte 
blocks in Parquet files
   
   There are use cases where it's valuable to embed arbitrary byte ranges like 
bloom filter extensions, custom statistics, **secondary indexes**, or 
application specific metadata directly inside a Parquet file
   
   Today, `key_value_metadata` can store small values but isn't designed for 
large binary blobs with efficient random access (readers must deserialize the 
entire footer to access them)
   
   I prototyped one approach that adds a new Thrift field to `FileMetaData`:
   
   ```thrift
   // field 10 of FileMetaData
   struct CustomBlock {
     1: required string name
     2: required i64 offset
     3: required i64 length
     4: optional string block_type
   }
   ```
   
   Blocks are written after bloom filters / column + offset indexes but before 
the Thrift footer. Readers that don't recognize field 10 skip it, so the file 
stays valid
   
   This seems to work but it raises some open questions like if this is the 
right mechanism or if this is even the right layer to edit
   
   # Motivation
   The main use case I'm exploring is storing precomputed secondary indexes 
inside Parquet files so that query engines can seek directly to them without 
external sidecar files. Keeping everything in one file simplifies write 
amplification and cache invalidation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to