Re: [PR] [POC] Metadata index for Parquet files [arrow-rs]

via GitHub Tue, 28 Oct 2025 12:59:33 -0700


alamb commented on code in PR #8714:
URL: https://github.com/apache/arrow-rs/pull/8714#discussion_r2470876229



##########
parquet/src/file/metadata/thrift/mod.rs:
##########
@@ -669,9 +696,46 @@ fn read_row_group(
     Ok(row_group)
 }
 
+/// Extract the metadata index from the footer bytes. `buf` should contain the 
entire footer.
+pub(crate) fn get_metadata_index(buf: &[u8]) -> Result<Option<MetaIndex>> {
+    // TODO(ets): need constants to get rid of magic numbers
+    if buf.len() < 13 {
+        return Ok(None);
+    }
+    // check the last 4 bytes to see if we have the full footer or not
+    let magic = &buf[buf.len() - 4..];
+    let buf = if magic == "PAR1".as_bytes() {
+        &buf[0..buf.len() - 8]
+    } else {
+        buf
+    };
+
+    // check for PARI followed by 0.
+    if buf[buf.len() - 1] != 0 {
+        return Ok(None);
+    }
+    let magic = &buf[buf.len() - 5..buf.len() - 1];
+    if magic != "PARI".as_bytes() {

Review Comment:
   > The advantage of using field 32767 is that it's already included in the 
footer length. No changes necessary to code used to fetch the footer in as few 
GETs as possible. The footer just gets a bit larger.
   
   So the argument goes something like even if you have optimistically fetched 
a bunch of bytes in the hopes of reading the entire footer in the first read, 
you could not guarantee that a second fetch would get the index too (you would 
have to do a second optimistic fetch or something)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [POC] Metadata index for Parquet files [arrow-rs]

Reply via email to