tustvold commented on a change in pull request #1180: URL: https://github.com/apache/arrow-rs/pull/1180#discussion_r788857097
########## File path: parquet/src/arrow/record_reader.rs ########## @@ -167,7 +167,13 @@ where break; } - let batch_size = max(num_records - records_read, MIN_BATCH_SIZE); + // If repetition levels present, we don't know how much more to read Review comment: This change I think warrants some explanation: Parquet effectively flattens nested data when storing it, and stores sufficient information in the definition levels and repetition levels to reconstruct the original data. This presents a challenge for readers that want to read a certain number of rows, as a row count can't be mapped to a value count in the event of repeated fields - a single row might contain multiple values from the underlying leaf column. RecordReader acts to handle this, by reading data to an internal buffer and then splitting off a certain number of semantic records. Previously this logic would always read MIN_BATCH_SIZE rows, even if the exact number of values to read is actually known as there are no repetition levels. Unfortunately this can result in a situation where the buffers are never fully drained, consider the scenario where the user is requesting batches of 1024 (the same as MIN_BATCH_SIZE). When transitioning across a row group boundary, this will read some remainder from the row group `r`, before reading 1024 from the next row group, leaving `1024 + r` in the buffer. The client will then only split off the `1024` they actually wanted. This will continue indefinitely, with the buffer continuing to contain at least `r` values. Aside from wasting cycles splitting and shuffling buffers unnecessarily, this prevents dictionary preservation from functioning correctly as the buffer will never be emptied allowing a new dictionary to be registered. With this change, this degenerate case still exists with repeated fields (as it did before) but is avoided for the more common case of a non-repeated field. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org