alippai opened a new issue, #38818:
URL: https://github.com/apache/arrow/issues/38818

   ### Describe the enhancement requested
   
   I have a sorted column of type `pa.utf8()`.
   If I write it directly and set the parquet `use_dictionary`, the 
`read_table(..., filters=[('column', '=', 'value')])` is fast.
   If I convert it to dictionary first using `pc.dictionary_encode()` and save 
it similarly, the same filter is 10-20x slower.
   
   Checking the file with parquet-cli the files the metadata and the rowgroups 
are almost identical (eg. stats). However when checking the pages they differ 
wildly, I guess that's the reason for the speed difference. Each dict page has 
every entry when saving the dictionary, each dict page has a dozen entries only 
when saving the raw strings.
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to