tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913082760


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = 
self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();

Review Comment:
   RecordReader is a bit of an odd cookie, let me try to explain what it is 
doing.
   
   In the absence of repetition levels, it can simply read batch size levels, 
and the corresponding number of values.
   
   However, if repetition levels are present, it will likely need to read more 
than batch_size levels in order to read batch_size actual records (rows). 
   
   To achieve this it reads to its internal buffer and then splits off the data 
corresponding to batch_size rows, leaving the excess behind.
   
   It is this excess of data that has been read to its buffers but not yielded 
to the caller yet, which we must consume here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to