[jira] [Commented] (PARQUET-1805) Refactor the configuration for bloom filters

Gabor Szadovszky (Jira) Mon, 01 Feb 2021 00:49:05 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276149#comment-17276149
 ]


Gabor Szadovszky commented on PARQUET-1805:
-------------------------------------------

[~yumwang], I think this performance issue is not related to this jira but the 
whole bloom filter feature (PARQUET-41). If you turn on the writing of the 
bloom filters for all the columns it will impact writing performance. (You may 
check the related configuration parameters at 
https://github.com/apache/parquet-mr/tree/master/parquet-hadoop for details.)

I am not an expert of this feature and maybe we can improve the writing 
performance but generating bloom filters will have performance impact. It is up 
to the user to decide if this impact worth for the potential benefit at read 
time. That's why it is highly suggested to specify which exact columns are the 
bloom filters required for and also to specify the other parameters for bloom 
filter.

[~junjie], any comments on this?

> Refactor the configuration for bloom filters
> --------------------------------------------
>
>                 Key: PARQUET-1805
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1805
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.12.0
>
>
> Refactor the hadoop configuration for bloom filters according to PARQUET-1784.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1805) Refactor the configuration for bloom filters

Reply via email to