[GitHub] [arrow-rs] nevi-me opened a new pull request #381: Respect max rowgroup size in Arrow writer

GitBox Sat, 29 May 2021 23:24:06 -0700


nevi-me opened a new pull request #381:
URL: https://github.com/apache/arrow-rs/pull/381



   # Which issue does this PR close?
   
   Closes #257.
   
   # Rationale for this change
    
   Parquet splits batches into row groups, which are normally determined by a 
`max_row_group_size` setting.
   The Arrow writer could not respect this setting because we cannot slice into 
structs and arrays correctly.
   The issue is that when using `array.slice(offset: usize, len: usize)`, we 
don't propagate and calculate the slice of child data, leading to only the 
top-level data being sliced.
   
   # What changes are included in this PR?
   
   We use the `LevelInfo` struct to keep track of its array's offset and 
length. This allows us to track nested arrays' offsets, and calculate the 
correct list offsets and lengths.
   
   We then use the `arrow::array::slice` to perform 0-copy slices from a batch, 
to limit the row group size correctly.
   
   I have changed all writer tests to use a max row group size, ensuring that 
we aren't introducing bugs when slicing.
   
   Note that this is related to #225, but I don't think it quite covers all its 
use-cases.
   If we have a sliced recordbatch per #343, we would need to account for its 
individual array offsets, as there is never a guarantee that a record batch has 
all child arrays starting from the same offset.
   
   # Are there any user-facing changes?
   
   No. All changes are crate-internal.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] nevi-me opened a new pull request #381: Respect max rowgroup size in Arrow writer

Reply via email to