[GitHub] [arrow-rs] tustvold commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

GitBox Mon, 13 Dec 2021 13:00:05 -0800


tustvold commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768116795




##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -381,32 +380,26 @@ impl<T: DataType> RecordReader<T> {
         match rep_levels {
             Some(buf) => {
                 let mut records_read = 0;
+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {

Review comment:
       Users of `RecordBatch` call `read_records` and then call 
`consume_rep_levels` and friends to split data out. The result being it should 
only buffer a little bit more than the batch_size passed to `read_records`. 
   
   I agree this API is not particularly intuitive, I created #1032 in part 
because I felt these APIs were clearly not designed for external consumption. I 
believe the funky arises because `ArrayReader` wants to be able to stitch 
together multiple column chunks from different row groups (i.e. `PageReader`) 
into the same RecordBatch.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Reply via email to