[GitHub] [arrow-rs] tustvold commented on issue #171: Implement returning dictionary arrays from parquet reader

GitBox Sun, 12 Dec 2021 09:17:22 -0800


tustvold commented on issue #171:
URL: https://github.com/apache/arrow-rs/issues/171#issuecomment-991936023



   >  It's just that the reader must be aware that for the same column, some of 
the returned record batches might contain a dictionary array and others might 
contain a plain array
   
   I think it would be pretty confusing, and break quite a lot of code if 
`ParquetRecordBatchReader` returned an iterator of `RecordBatch` with varying 
schema?
   
   > I wonder if the returned record batches could be split so that a single 
record batch only contains column data based on a single encoding (only 
dictionary or plain encoded values for each column)
   
   > might be of different length than requested and the requested batch size 
should be treated more as a max length).
   
   The encoding is per-page, and there is no relationship between what rows 
belong to what pages across column chunks. To get around this, the current 
logic requires that all ArrayReader return the batch_size number of rows, 
unless the `PageIterator` has been exhausted. This ensures that all children of 
a `StructArrrayReader`, `MapArrayReader`, etc... produce `ArrayRef` with the 
same number of rows.
   
   This is why I originally proposed a `delimit_row_groups` option, as a 
workaround to still ensure that all `ArrayReader` return `ArrayRef` with the 
same number of rows, but not producing `ArrayRef` spanning row groups and 
therefore dictionaries.
   
   However, my current plan I think sidesteps the need for this:
   
   * Compute new dictionaries for the returned `DictionaryArray` if
     *  It spans a RowGroup and therefore a dictionary boundary
     * One or more of the pages uses plain encoding
   * Modify `ParquetRecordBatchReader` to not request `RecordBatch` spanning 
row groups, likely by reducing the `batch_size` passed to 
`ArrayReader::next_batch`
   
   This avoids the need for config options, or changes to APIs with ambiguous 
termination criteria, whilst ensuring that most workloads will only compute 
dictionaries for sections of the parquet file where the dictionary encoding was 
incomplete. This should make it strictly better than current master, which just 
always computes the dictionaries again having first fully materialized the 
values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on issue #171: Implement returning dictionary arrays from parquet reader

Reply via email to