tustvold commented on a change in pull request #1180:
URL: https://github.com/apache/arrow-rs/pull/1180#discussion_r788857097



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -167,7 +167,13 @@ where
                 break;
             }
 
-            let batch_size = max(num_records - records_read, MIN_BATCH_SIZE);
+            // If repetition levels present, we don't know how much more to 
read

Review comment:
       This change I think warrants some explanation:
   
   Parquet effectively flattens nested data when storing it, and stores 
sufficient information in the definition levels and repetition levels to 
reconstruct the original data. This presents a challenge for readers that want 
to read a certain number of rows, as a row count can't be mapped to a value 
count in the event of repeated fields - a single row might contain multiple 
values from the underlying leaf column.
   
   RecordReader acts to handle this, by reading data to an internal buffer and 
then splitting off a certain number of semantic records. Previously this logic 
would always read MIN_BATCH_SIZE rows, even if the exact number of values to 
read is actually known as there are no repetition levels.
   
   Unfortunately this can result in a situation where the buffers are never 
fully drained, consider the scenario where the user is requesting batches of 
1024 (the same as MIN_BATCH_SIZE). When transitioning across a row group 
boundary, this will read some remainder from the row group `r`, before reading 
1024 from the next row group, leaving `1024 + r` in the buffer. The client will 
then only split off the `1024` they actually wanted. This will continue 
indefinitely, with the buffer continuing to contain at least `r` values.
   
   Aside from wasting cycles splitting and shuffling buffers unnecessarily, 
this prevents dictionary preservation from functioning correctly as the buffer 
will never be emptied allowing a new dictionary to be registered. 
   
   With this change, this degenerate case still exists with repeated fields (as 
it did before) but is avoided for the more common case of a non-repeated field.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to