tustvold commented on issue #171: URL: https://github.com/apache/arrow-rs/issues/171#issuecomment-989744366
> I became a father Congratulations :tada: > can't remember if a single column can have pages encoded using dictionary and non dictionary encoding Unfortunately they definitely can, it is in fact explicitly highlighted in the arrow C++ blog post on the same topic [here](https://arrow.apache.org/blog/2019/09/05/faster-strings-cpp-parquet/). IIRC it occurs when the dictionary grows too large for a single page. > What is the reason for defaulting this one to false? For any string dictionary column this is likely to almost always be what is wanted. > I wonder if any (or both) of the proposed two new config values The problem I was trying to avoid is `ArrayReader` doesn't document its termination condition, and so some codepaths may be making the assumption that if the length of the returned Array is less than the batch size, the reader is exhausted. If I changed the `ArrayReader` to delimit row groups by default, this would then break such code. However, having looked again I'm confident that I can avoid needing to expose `delimit_row_groups`, as `ParquetRecordBatchReader` is a proper `Iterator` and so doesn't have the same problem. I'm not sure yet if it will be simpler to have a default off `delimit_row_groups` on the `ArrayReader` impls or just have `ParquetRecordBatchReader` only give them one row group at a time, but it should be possible to avoid exposing this to users in the higher-level APIs :smile: My reasoning for having `preserve_dictionaries` off by default was that theoretically if you had a column chunk with mixed encodings it would have worked before, and now might not. I guess I should just implement the handling of these PLAIN pages and avoid it needing to be an opt-in config... :thinking: > performance as a result of dictionary array preservation depends very much on upstream processing Indeed, if the upstream is just going to fully materialize the dictionary in order to do anything with it, there is limited benefit to using dictionaries at all :laughing: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
