tustvold commented on a change in pull request #1180:
URL: https://github.com/apache/arrow-rs/pull/1180#discussion_r788857097
##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -167,7 +167,13 @@ where
break;
}
- let batch_size = max(num_records - records_read, MIN_BATCH_SIZE);
+ // If repetition levels present, we don't know how much more to
read
Review comment:
This change I think warrants some explanation:
Parquet effectively flattens nested data when storing it, and stores
sufficient information in the definition levels and repetition levels to
reconstruct the original data. This presents a challenge for readers that want
to read a certain number of rows, as a row count can't be mapped to a value
count in the event of repeated fields - a single row might contain multiple
values from the underlying leaf column.
RecordReader acts to handle this, by reading data to an internal buffer and
then splitting off a certain number of semantic records. Previously this logic
would always read MIN_BATCH_SIZE rows, even if the exact number of values to
read is actually known as there are no repetition levels.
Unfortunately this can result in a situation where the buffers are never
fully drained, consider the scenario where the user is requesting batches of
1024 (the same as MIN_BATCH_SIZE). When transitioning across a row group
boundary, this will some remainder from the row group `r`, before reading 1024
from the next row group, leaving `1024 + r` in the buffer. The client will the
only split off the `1024` they actually wanted. This will continue
indefinitely, with the buffer continuing to contain at least `r` values.
Aside from wasting cycles splitting and shuffling buffers unnecessarily,
this prevents dictionary preservation from functioning correctly as the buffer
will never be emptied allowing a new dictionary to be registered.
With this change, this degenerate case still exists with repeated fields (as
it did before) but is avoided for the more common case of a non-repeated field.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]