yordan-pavlov commented on a change in pull request #692:
URL: https://github.com/apache/arrow-rs/pull/692#discussion_r697005231



##########
File path: parquet/src/arrow/arrow_array_reader.rs
##########
@@ -777,31 +829,118 @@ impl ValueDecoder for LevelValueDecoder {
         &mut self,
         num_values: usize,
         read_bytes: &mut dyn FnMut(&[u8], usize),
-    ) -> Result<usize> {
+    ) -> Result<(usize, usize)> {
         let value_size = std::mem::size_of::<i16>();
         let mut total_values_read = 0;
-        while total_values_read < num_values {
-            let values_to_read = std::cmp::min(
-                num_values - total_values_read,
-                self.level_value_buffer.len(),
-            );
-            let values_read = match self
-                .level_decoder
-                .get(&mut self.level_value_buffer[..values_to_read])
-            {
-                Ok(values_read) => values_read,
-                Err(e) => return Err(e),
-            };
-            if values_read > 0 {
-                let level_value_bytes =
-                    &self.level_value_buffer.to_byte_slice()[..values_read * 
value_size];
-                read_bytes(level_value_bytes, values_read);
-                total_values_read += values_read;
-            } else {
-                break;
+
+        // When reading repetition levels, num_values will likely be less than 
the array values
+        // needed, e.g. a list array with [[0, 1], [2, 3]] has 2 values, but 
needs 4 level values
+        // to be read. We have to count the number of records read by checking 
where repetition = 0 to denote
+        // the start of a list slot.
+        // Thus we separate the logic for repetitions and definitions,
+        // as we do not need to do this for them.
+        // In the example above, we could have `num_values` being 1 because we 
want to only read
+        // one value, but we will return that we need to read 2 values to fill 
that 1 list slot.
+        match self.level_type {
+            LevelType::Definition => {
+                while total_values_read < num_values {
+                    let values_to_read = std::cmp::min(
+                        num_values - total_values_read,
+                        self.level_value_buffer.len(),
+                    );
+                    let values_read = match self
+                        .level_decoder
+                        .get(&mut self.level_value_buffer[..values_to_read])
+                    {
+                        Ok(values_read) => values_read,
+                        Err(e) => return Err(e),
+                    };
+                    if values_read > 0 {
+                        let level_value_bytes = 
&self.level_value_buffer.to_byte_slice()
+                            [..values_read * value_size];
+                        read_bytes(level_value_bytes, values_read);
+                        total_values_read += values_read;
+                    } else {
+                        break;
+                    }
+                }
+                Ok((total_values_read, total_values_read))
+            }
+            LevelType::Repetition => {
+                let mut carried_over_levels = 0;
+                let mut record_values_read = 0;
+                while total_values_read < num_values {

Review comment:
       should this be `while record_values_read < num_values {` ? It's not very 
clear what the intention is here? - is it to read enough repetition level 
values for `num_values` full list of values? It's also not very clear which 
variables represent lists of values  and which variables represent values 
within lists; maybe some more comments will help?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to