[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

ASF GitHub Bot (Jira) Thu, 16 Feb 2023 18:44:46 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690099#comment-17690099
 ]


ASF GitHub Bot commented on PARQUET-2244:
-----------------------------------------

wgtmac commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1434018036

   > I don't have a strong opinion on whether to keep or revert the fix. The 
fix won't cause any correctness issue on the engine side because engine will 
filter again.
   
   Same here.
   
   This is not a correctness issue. Rather we should be careful with NULL 
behavior since different engines may have different assumptions. I'd say we 
might lose some optimization with this fix but it is much safer now.




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> -----------------------------------------------------------------------
>
>                 Key: PARQUET-2244
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2244
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: Yujiang Zhong
>            Assignee: Yujiang Zhong
>            Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

Reply via email to