Xinyu Zeng created ARROW-15855:
----------------------------------

             Summary: [Python]Add dictionary_pagesize_limit to Parquet writer
                 Key: ARROW-15855
                 URL: https://issues.apache.org/jira/browse/ARROW-15855
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Parquet, Python
            Reporter: Xinyu Zeng
             Fix For: 7.0.0


Although the python Parquet api is a wrapper of c++, there are some
tuning knobs not included in python. For example,
dictionary_pagesize_limit_. The dictionary page size will easily
exceed the limit when any or many of the following happen: 1. The
row_group_size is relatively large e.g. the default is 64M. 2. The
size per entry is large e.g large string column 3. the repeatability
of data is not so high. This may result in the dictionary encoding not
being fully utilized if this parameter cannot be tuned. In C++,
however, this parameter can be tuned to the optimized setting.

There are also other parameters not exposed in python, for example,
max_statistics_size.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to