[
https://issues.apache.org/jira/browse/ARROW-15855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche closed ARROW-15855.
-----------------------------------------
Fix Version/s: (was: 8.0.0)
Assignee: (was: Alenka Frim)
Resolution: Duplicate
> [Python] Add dictionary_pagesize_limit to Parquet writer
> --------------------------------------------------------
>
> Key: ARROW-15855
> URL: https://issues.apache.org/jira/browse/ARROW-15855
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Parquet, Python
> Reporter: Xinyu Zeng
> Priority: Major
>
> Although the python Parquet api is a wrapper of C+\+, there are some tuning
> knobs not included in python. For example, dictionary_pagesize_limit_. The
> dictionary page size will easily exceed the limit when any or many of the
> followings happen: 1. The row_group_size is relatively large e.g. the default
> is 64M. 2. The size per entry is large e.g large string column 3. the
> repeatability of data is not so high. This may result in the dictionary
> encoding not being fully utilized if this parameter cannot be tuned. In C+\+,
> however, this parameter can be tuned to the optimized setting.
>
> There are also other parameters not exposed in python, for example,
> max_statistics_size.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)