tustvold commented on code in PR #8733:
URL: https://github.com/apache/arrow-rs/pull/8733#discussion_r2483677316


##########
parquet/src/column/reader.rs:
##########
@@ -403,7 +465,27 @@ where
     /// Returns false if there's no page left.
     fn read_new_page(&mut self) -> Result<bool> {
         loop {
-            match self.page_reader.get_next_page()? {
+            let page_result = match self.page_reader.get_next_page() {
+                Ok(page) => page,
+                Err(err) => {
+                    return match err {
+                        ParquetError::General(message)
+                            if message
+                                .starts_with("Invalid offset in sparse column 
chunk data:") =>
+                        {
+                            let metadata = self.page_reader.peek_next_page()?;
+                            // Some writers omit data pages for sparse column 
chunks and encode the gap
+                            // as a reader-visible error. Use the metadata 
peek to synthesise a page of
+                            // null definition levels so downstream consumers 
see consistent row counts.
+                            self.try_create_synthetic_page(metadata)?;

Review Comment:
   Additionally I think it would imply that the predicate pushdown is 
"reversing" earlier forms of pushdown and relying on the IO implementation to 
have chosen to do a sparse read - this feels unfortunate



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to