[GitHub] [arrow-rs] tustvold commented on a change in pull request #1180: Preserve dictionary encoding when decoding parquet into Arrow arrays (#171)

GitBox Thu, 20 Jan 2022 07:09:04 -0800


tustvold commented on a change in pull request #1180:
URL: https://github.com/apache/arrow-rs/pull/1180#discussion_r788857097




##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -167,7 +167,13 @@ where
                 break;
             }
 
-            let batch_size = max(num_records - records_read, MIN_BATCH_SIZE);
+            // If repetition levels present, we don't know how much more to 
read

Review comment:
       This change I think warrants some explanation:
   
   Parquet effectively flattens nested data when storing it, and stores 
sufficient information in the definition levels and repetition levels to 
reconstruct the original data. This presents a challenge for readers that want 
to read a certain number of rows, as a row count can't be mapped to a value 
count in the event of repeated fields - a single row might contain multiple 
values from the underlying leaf column.
   
   RecordReader acts to handle this, by reading data to an internal buffer and 
then splitting off a certain number of semantic records. Previously this logic 
would always read MIN_BATCH_SIZE rows, even if the exact number of values to 
read is actually known as there are no repetition levels.
   
   Unfortunately this can result in a situation where the buffers are never 
fully drained, consider the scenario where the user is requesting batches of 
1024 (the same as MIN_BATCH_SIZE). When transitioning across a row group 
boundary, this will some remainder from the row group `r`, before reading 1024 
from the next row group, leaving `1024 + r` in the buffer. The client will the 
only split off the `1024` they actually wanted. This will continue 
indefinitely, with the buffer continuing to contain at least `r` values.
   
   Aside from wasting cycles splitting and shuffling buffers unnecessarily, 
this prevents dictionary preservation from functioning correctly as the buffer 
will never be emptied allowing a new dictionary to be registered. 
   
   With this change, this degenerate case still exists with repeated fields (as 
it did before) but is avoided for the more common case of a non-repeated field.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on a change in pull request #1180: Preserve dictionary encoding when decoding parquet into Arrow arrays (#171)

Reply via email to