[ 
https://issues.apache.org/jira/browse/ARROW-15855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyu Zeng updated ARROW-15855:
-------------------------------
    Description: 
Although the python Parquet api is a wrapper of c+, there are some tuning knobs 
not included in python. For example, dictionary_pagesize_limit_. The dictionary 
page size will easily exceed the limit when any or many of the following 
happen: 1. The row_group_size is relatively large e.g. the default is 64M. 2. 
The size per entry is large e.g large string column 3. the repeatability of 
data is not so high. This may result in the dictionary encoding not being fully 
utilized if this parameter cannot be tuned. In C+, however, this parameter can 
be tuned to the optimized setting.

 

There are also other parameters not exposed in python, for example, 
max_statistics_size.

  was:Although the python Parquet api is a wrapper of c++, there are some 
tuning knobs not included in python. For example, dictionary_pagesize_limit_. 
The dictionary page size will easily exceed the limit when any or many of the 
following happen: 1. The row_group_size is relatively large e.g. the default is 
64M. 2. The size per entry is large e.g large string column 3. the 
repeatability of data is not so high. This may result in the dictionary 
encoding not being fully utilized if this parameter cannot be tuned. In C++, 
however, this parameter can be tuned to the optimized setting. There are also 
other parameters not exposed in python, for example, max_statistics_size.


> [Python]Add dictionary_pagesize_limit to Parquet writer
> -------------------------------------------------------
>
>                 Key: ARROW-15855
>                 URL: https://issues.apache.org/jira/browse/ARROW-15855
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Parquet, Python
>            Reporter: Xinyu Zeng
>            Priority: Major
>             Fix For: 7.0.0
>
>
> Although the python Parquet api is a wrapper of c+, there are some tuning 
> knobs not included in python. For example, dictionary_pagesize_limit_. The 
> dictionary page size will easily exceed the limit when any or many of the 
> following happen: 1. The row_group_size is relatively large e.g. the default 
> is 64M. 2. The size per entry is large e.g large string column 3. the 
> repeatability of data is not so high. This may result in the dictionary 
> encoding not being fully utilized if this parameter cannot be tuned. In C+, 
> however, this parameter can be tuned to the optimized setting.
>  
> There are also other parameters not exposed in python, for example, 
> max_statistics_size.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to