Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

via GitHub Sat, 21 Jun 2025 02:31:04 -0700


zhuqi-lucas commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993484604


   > wow this is so cool!
   > 
   > I have a question (and I think it's worth adding to the comment for people 
like me that's not familiar with parquet internals): How does it ensure that 
this extra index can be safely ignored by other readers? If another parquet 
reader implementation decides to do a sequential whole file scan, will it read 
into the extra custom data?
   
   Thank you for the review and great question, @2010YOUY01!
   
   **Short answer:**  
   Because we append our custom index *before* the Parquet footer and never 
modify the existing metadata schema, Parquet readers will still:
   
   1.  Seek to the **end of file** and read the last 8 bytes, which consist of: 
 
       - A 4‑byte little‑endian footer length  
       - The magic marker `PAR1`  
   2.  Jump back by that length to parse the Thrift‑encoded footer (and its 
key‑value list).  
   
   Any bytes you append *ahead* of the footer (i.e. after the data pages but 
before writing footer and magic) are simply skipped over by steps (1)&(2), 
because readers never scan from the file start—they always locate the footer 
via the trailer magic and length.  
   
   **Why key/value metadata is safe:**  
   - We only **add** two new keys (`distinct_index_offset` and 
`distinct_index_length`) into the existing footer metadata map.  
   - All standard readers will see unknown keys and either ignore them or 
surface them as “extra metadata,” but they will not attempt to deserialize our 
custom binary blob.  
   - On our side, we:
   
      1. Read the Parquet footer as usual.  
      2. Extract our two key/value entries for offset+length.  
      3. `seek(offset)` + `read_exact(length)` to load the custom index and 
deserialize it.  
   
   Because every compliant Parquet reader must interpret the `PAR1` magic and 
footer length, none of them will ever “spill over” into our blob or treat it as 
data pages.
   
   I’ll add these details into the code comments. We’re also planning a blog 
post on Parquet indexing internals suggested by @alamb , thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

Reply via email to