zhuqi-lucas commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993484604
> wow this is so cool! > > I have a question (and I think it's worth adding to the comment for people like me that's not familiar with parquet internals): How does it ensure that this extra index can be safely ignored by other readers? If another parquet reader implementation decides to do a sequential whole file scan, will it read into the extra custom data? Thank you for the review and great question, @2010YOUY01! **Short answer:** Because we append our custom index *before* the Parquet footer and never modify the existing metadata schema, Parquet readers will still: 1. Seek to the **end of file** and read the last 8 bytes, which consist of: - A 4‑byte little‑endian footer length - The magic marker `PAR1` 2. Jump back by that length to parse the Thrift‑encoded footer (and its key‑value list). Any bytes you append *ahead* of the footer (i.e. after the data pages but before writing footer and magic) are simply skipped over by steps (1)&(2), because readers never scan from the file start—they always locate the footer via the trailer magic and length. **Why key/value metadata is safe:** - We only **add** two new keys (`distinct_index_offset` and `distinct_index_length`) into the existing footer metadata map. - All standard readers will see unknown keys and either ignore them or surface them as “extra metadata,” but they will not attempt to deserialize our custom binary blob. - On our side, we: 1. Read the Parquet footer as usual. 2. Extract our two key/value entries for offset+length. 3. `seek(offset)` + `read_exact(length)` to load the custom index and deserialize it. Because every compliant Parquet reader must interpret the `PAR1` magic and footer length, none of them will ever “spill over” into our blob or treat it as data pages. I’ll add these details into the code comments. We’re also planning a blog post on Parquet indexing internals suggested by @alamb , thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org