etseidl commented on code in PR #8714:
URL: https://github.com/apache/arrow-rs/pull/8714#discussion_r2466945034
##########
parquet/src/file/metadata/thrift/mod.rs:
##########
@@ -669,9 +696,46 @@ fn read_row_group(
Ok(row_group)
}
+/// Extract the metadata index from the footer bytes. `buf` should contain the
entire footer.
+pub(crate) fn get_metadata_index(buf: &[u8]) -> Result<Option<MetaIndex>> {
+ // TODO(ets): need constants to get rid of magic numbers
+ if buf.len() < 13 {
+ return Ok(None);
+ }
+ // check the last 4 bytes to see if we have the full footer or not
+ let magic = &buf[buf.len() - 4..];
+ let buf = if magic == "PAR1".as_bytes() {
+ &buf[0..buf.len() - 8]
+ } else {
+ buf
+ };
+
+ // check for PARI followed by 0.
+ if buf[buf.len() - 1] != 0 {
+ return Ok(None);
+ }
+ let magic = &buf[buf.len() - 5..buf.len() - 1];
+ if magic != "PARI".as_bytes() {
Review Comment:
This uses the same approach as the flatbuffers stuff. The index becomes
binary field 32767 in the FileMetaData struct. Old readers will simply skip
this unparsed and then hit end-of-struct and return. I'm modifying the current
reader to search backwards some for the "PARI", and if found then parse that
field first to get the index. That part seems to be working 😄
So it's
```
(.. data pages ..)
(.. page indexes ..)
(.. current footer, not terminated ..)
(.. field 32767 marker ..)
(.. thrift encoded MetaIndex|len|PARI ..)
(.. 0 (to end FileMetaData) ..)
(.. PAR1 w/ len ..)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]