etseidl commented on code in PR #8714:
URL: https://github.com/apache/arrow-rs/pull/8714#discussion_r2467093357
##########
parquet/src/file/metadata/thrift/mod.rs:
##########
@@ -669,9 +696,46 @@ fn read_row_group(
Ok(row_group)
}
+/// Extract the metadata index from the footer bytes. `buf` should contain the
entire footer.
+pub(crate) fn get_metadata_index(buf: &[u8]) -> Result<Option<MetaIndex>> {
+ // TODO(ets): need constants to get rid of magic numbers
+ if buf.len() < 13 {
+ return Ok(None);
+ }
+ // check the last 4 bytes to see if we have the full footer or not
+ let magic = &buf[buf.len() - 4..];
+ let buf = if magic == "PAR1".as_bytes() {
+ &buf[0..buf.len() - 8]
+ } else {
+ buf
+ };
+
+ // check for PARI followed by 0.
+ if buf[buf.len() - 1] != 0 {
+ return Ok(None);
+ }
+ let magic = &buf[buf.len() - 5..buf.len() - 1];
+ if magic != "PARI".as_bytes() {
Review Comment:
I was just wanting to use the existing protocol extension recently added for
parquet-format. Makes things easy for testing, and shouldn't conflict with the
flatbuffers stuff...they'll see the field but not recognice the footer so
should ignore.
This brings up an issue I ran across while doing the remodel, but the thrift
implementation of skip for binary fields uses `read_string`. So when you try to
skip a pure binary field, if it's not all UTF-8, it throws an error. 56.0 can't
read my footer (nor will it be able to read the flatbuffer one either). Seems
as if the python thrift parser suffers from the same problem. Guess it's time
for PR there, but that doesn't really help us with backwards compatibility.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]