[GitHub] [arrow-rs] tustvold commented on issue #171: Implement returning dictionary arrays from parquet reader

GitBox Thu, 09 Dec 2021 03:01:26 -0800


tustvold commented on issue #171:
URL: https://github.com/apache/arrow-rs/issues/171#issuecomment-989744366



   > I became a father
   
   Congratulations :tada:
   
   >  can't remember if a single column can have pages encoded using dictionary 
and non dictionary encoding
   
   Unfortunately they definitely can, it is in fact explicitly highlighted in 
the arrow C++ blog post on the same topic 
[here](https://arrow.apache.org/blog/2019/09/05/faster-strings-cpp-parquet/). 
IIRC it occurs when the dictionary grows too large for a single page.
   
   > What is the reason for defaulting this one to false? For any string 
dictionary column this is likely to almost always be what is wanted.
   
   > I wonder if any (or both) of the proposed two new config values 
   
   The problem I was trying to avoid is `ArrayReader` doesn't document its 
termination condition, and so some codepaths may be making the assumption that 
if the length of the returned Array is less than the batch size, the reader is 
exhausted. If I changed the `ArrayReader` to delimit row groups by default, 
this would then break such code.
   
   However, having looked again I'm confident that I can avoid needing to 
expose `delimit_row_groups`, as `ParquetRecordBatchReader` is a proper 
`Iterator` and so doesn't have the same problem. I'm not sure yet if it will be 
simpler to have a default off `delimit_row_groups` on the `ArrayReader` impls 
or just have  `ParquetRecordBatchReader` only give them one row group at a 
time, but it should be possible to avoid exposing this to users in the 
higher-level APIs :smile: 
   
   My reasoning for having `preserve_dictionaries` off by default was that 
theoretically if you had a column chunk with mixed encodings it would have 
worked before, and now might not. I guess I should just implement the handling 
of these PLAIN pages and avoid it needing to be an opt-in config... :thinking: 
   
   > performance as a result of dictionary array preservation depends very much 
on upstream processing
   
   Indeed, if the upstream is just going to fully materialize the dictionary in 
order to do anything with it, there is limited benefit to using dictionaries at 
all :laughing: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on issue #171: Implement returning dictionary arrays from parquet reader

Reply via email to