[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690099#comment-17690099 ]
ASF GitHub Bot commented on PARQUET-2244: ----------------------------------------- wgtmac commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1434018036 > I don't have a strong opinion on whether to keep or revert the fix. The fix won't cause any correctness issue on the engine side because engine will filter again. Same here. This is not a correctness issue. Rather we should be careful with NULL behavior since different engines may have different assumptions. I'd say we might lose some optimization with this fix but it is much safer now. > Dictionary filter may skip row-groups incorrectly when evaluating notIn > ----------------------------------------------------------------------- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr > Affects Versions: 1.12.2 > Reporter: Yujiang Zhong > Assignee: Yujiang Zhong > Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)