tustvold commented on issue #171:
URL: https://github.com/apache/arrow-rs/issues/171#issuecomment-991936023
> It's just that the reader must be aware that for the same column, some of
the returned record batches might contain a dictionary array and others might
contain a plain array
I think it would be pretty confusing, and break quite a lot of code if
`ParquetRecordBatchReader` returned an iterator of `RecordBatch` with varying
schema?
> I wonder if the returned record batches could be split so that a single
record batch only contains column data based on a single encoding (only
dictionary or plain encoded values for each column)
> might be of different length than requested and the requested batch size
should be treated more as a max length).
The encoding is per-page, and there is no relationship between what rows
belong to what pages across column chunks. To get around this, the current
logic requires that all ArrayReader return the batch_size number of rows,
unless the `PageIterator` has been exhausted. This ensures that all children of
a `StructArrrayReader`, `MapArrayReader`, etc... produce `ArrayRef` with the
same number of rows.
This is why I originally proposed a `delimit_row_groups` option, as a
workaround to still ensure that all `ArrayReader` return `ArrayRef` with the
same number of rows, but not producing `ArrayRef` spanning row groups and
therefore dictionaries.
However, my current plan I think sidesteps the need for this:
* Compute new dictionaries for the returned `DictionaryArray` if
* It spans a RowGroup and therefore a dictionary boundary
* One or more of the pages uses plain encoding
* Modify `ParquetRecordBatchReader` to not request `RecordBatch` spanning
row groups, likely by reducing the `batch_size` passed to
`ArrayReader::next_batch`
This avoids the need for config options, or changes to APIs with ambiguous
termination criteria, whilst ensuring that most workloads will only compute
dictionaries for sections of the parquet file where the dictionary encoding was
incomplete. This should make it strictly better than current master, which just
always computes the dictionaries again having first fully materialized the
values.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]