[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338706#comment-16338706 ]
Junjie Chen commented on PARQUET-41: ------------------------------------ In Parquet-mr, when we set dictionary encoding to true, the valueWriter is FallbackValuesWriter which consist of a dictionaryValueWriter and FallbackWriter. In the DictionaryValuesWriter, it first determines whether it should use dictionary encoding by check maxDictionaryBytes which is DEFAULT_PAGE_SIZE in ParquetProperties.java, and then in FallValuesrWriter, it also check whether dic index plus dic data is large than raw data, you can see [line 123|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L123] and [line 130|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L130]. So the column with large cardinality should fallback to plain encoding. > Add bloom filters to parquet statistics > --------------------------------------- > > Key: PARQUET-41 > URL: https://issues.apache.org/jira/browse/PARQUET-41 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr > Reporter: Alex Levenson > Assignee: Ferdinand Xu > Priority: Major > Labels: filter2 > > For row groups with no dictionary, we could still produce a bloom filter. > This could be very useful in filtering entire row groups. > Pull request: > https://github.com/apache/parquet-mr/pull/215 -- This message was sent by Atlassian JIRA (v7.6.3#76005)