[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016790#comment-16016790 ]
Junjie Chen commented on PARQUET-41: ------------------------------------ Hi [~rdblue] The distinct values in each column is increasing always. We may need to care more about the distinct values in a window such as row group or page. Take Telecom company as example, they produce about one row group (256MB) every minute, almost all records in this window are not repeated. Also smaller window may contains less repeated value while it needs more metadata overall. As for effect of BF, it depends on the time spend on HDFS scan of a query, in other words, data scale. It takes about 5-6 minutes for a query w/o BF and takes 10+s with BF with one day Telecom data workload in a 8-nodes cluster. For optimization, I agree your point that bloom filter become useless with wrong config. It needs users understand their data clearly and set correct parameters. It also need to take into account in future that dynamically setting the bloom filter parameters according to sampling or change parameters at run time etc.. Right now, a 'static' BF should be a good option to users who know their data. > Add bloom filters to parquet statistics > --------------------------------------- > > Key: PARQUET-41 > URL: https://issues.apache.org/jira/browse/PARQUET-41 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr > Reporter: Alex Levenson > Assignee: Ferdinand Xu > Labels: filter2 > > For row groups with no dictionary, we could still produce a bloom filter. > This could be very useful in filtering entire row groups. > Pull request: > https://github.com/apache/parquet-mr/pull/215 -- This message was sent by Atlassian JIRA (v6.3.15#6346)