[
https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276149#comment-17276149
]
Gabor Szadovszky commented on PARQUET-1805:
-------------------------------------------
[~yumwang], I think this performance issue is not related to this jira but the
whole bloom filter feature (PARQUET-41). If you turn on the writing of the
bloom filters for all the columns it will impact writing performance. (You may
check the related configuration parameters at
https://github.com/apache/parquet-mr/tree/master/parquet-hadoop for details.)
I am not an expert of this feature and maybe we can improve the writing
performance but generating bloom filters will have performance impact. It is up
to the user to decide if this impact worth for the potential benefit at read
time. That's why it is highly suggested to specify which exact columns are the
bloom filters required for and also to specify the other parameters for bloom
filter.
[~junjie], any comments on this?
> Refactor the configuration for bloom filters
> --------------------------------------------
>
> Key: PARQUET-1805
> URL: https://issues.apache.org/jira/browse/PARQUET-1805
> Project: Parquet
> Issue Type: Improvement
> Reporter: Gabor Szadovszky
> Assignee: Gabor Szadovszky
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.12.0
>
>
> Refactor the hadoop configuration for bloom filters according to PARQUET-1784.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)