alippai opened a new issue, #38818: URL: https://github.com/apache/arrow/issues/38818
### Describe the enhancement requested I have a sorted column of type `pa.utf8()`. If I write it directly and set the parquet `use_dictionary`, the `read_table(..., filters=[('column', '=', 'value')])` is fast. If I convert it to dictionary first using `pc.dictionary_encode()` and save it similarly, the same filter is 10-20x slower. Checking the file with parquet-cli the files the metadata and the rowgroups are almost identical (eg. stats). However when checking the pages they differ wildly, I guess that's the reason for the speed difference. Each dict page has every entry when saving the dictionary, each dict page has a dozen entries only when saving the raw strings. ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org