Re: [PR] [POC] Metadata index for Parquet files [arrow-rs]

via GitHub Mon, 27 Oct 2025 14:08:31 -0700


etseidl commented on code in PR #8714:
URL: https://github.com/apache/arrow-rs/pull/8714#discussion_r2467093357



##########
parquet/src/file/metadata/thrift/mod.rs:
##########
@@ -669,9 +696,46 @@ fn read_row_group(
     Ok(row_group)
 }
 
+/// Extract the metadata index from the footer bytes. `buf` should contain the 
entire footer.
+pub(crate) fn get_metadata_index(buf: &[u8]) -> Result<Option<MetaIndex>> {
+    // TODO(ets): need constants to get rid of magic numbers
+    if buf.len() < 13 {
+        return Ok(None);
+    }
+    // check the last 4 bytes to see if we have the full footer or not
+    let magic = &buf[buf.len() - 4..];
+    let buf = if magic == "PAR1".as_bytes() {
+        &buf[0..buf.len() - 8]
+    } else {
+        buf
+    };
+
+    // check for PARI followed by 0.
+    if buf[buf.len() - 1] != 0 {
+        return Ok(None);
+    }
+    let magic = &buf[buf.len() - 5..buf.len() - 1];
+    if magic != "PARI".as_bytes() {

Review Comment:
   I was just wanting to use the existing protocol extension recently added for 
parquet-format. Makes things easy for testing, and shouldn't conflict with the 
flatbuffers stuff...they'll see the field but not recognice the footer so 
should ignore.
   
   This brings up an issue I ran across while doing the remodel, but the thrift 
implementation of skip for binary fields uses `read_string`. So when you try to 
skip a pure binary field, if it's not all UTF-8, it throws an error. 56.0 can't 
read my footer (nor will it be able to read the flatbuffer one either). Seems 
as if the python thrift parser suffers from the same problem. Guess it's time 
for PR there, but that doesn't really help us with backwards compatibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [POC] Metadata index for Parquet files [arrow-rs]

Reply via email to