[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690334#comment-17690334
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

zhongyujiang commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1434445841

   Then I will not open a revert PR, thanks everyone for commenting on this!




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690099#comment-17690099
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

wgtmac commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1434018036

   > I don't have a strong opinion on whether to keep or revert the fix. The 
fix won't cause any correctness issue on the engine side because engine will 
filter again.
   
   Same here.
   
   This is not a correctness issue. Rather we should be careful with NULL 
behavior since different engines may have different assumptions. I'd say we 
might lose some optimization with this fix but it is much safer now.




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690097#comment-17690097
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

huaxingao commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1434016401

   I don't have a strong opinion on whether to keep or revert the fix. The fix 
won't cause any correctness issue on the engine side because engine will filter 
again. 




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689729#comment-17689729
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

zhongyujiang commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1433005114

   >  I don't know if there is a downstream that relies on Parquet judge value 
<> null as TRUE instead of UNKNOW, I guess that might be in some non-ansi 
standard engines.
   
   I just want to make sure it's safe to revert this, I did some search and 
find this: 
https://stackoverflow.com/questions/129077/null-values-inside-not-in-clause
   Seems SQL Server 2005 has a switch `ansi_null` to control wheather the 
result of  `value <> null` is `TRUE` or `UNKNOWN`,  when `ansi_nulls` is off, 
`3 <> null` is true. Though I didn't find such a switch in engines like Hive or 
Spark. Do you have any comments on this? Or do you think it's worth caring 
about?  @gszadovszky @wgtmac @huaxingao.




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689602#comment-17689602
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

gszadovszky commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1432685172

   It seems I pushed it too quickly. Sorry for not giving the time to give 
feedback, @huaxingao and @wgtmac.
   @zhongyujiang, feel free to put up another PR with the revert. 




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689557#comment-17689557
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

zhongyujiang commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1432596607

   Didn't think about comparisons with non-null values before submitting this 
PR.  I don't know if there is a downstream that relies on Parquet judge `value 
<> null` as TRUE instead of UNKNOW, I guess that might be in some non-ansi 
standard engines. If there is no such situation, I think this PR can be 
reverted safely.




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689546#comment-17689546
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

zhongyujiang commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1432572977

   I haven't encountered any troubles caused by this situation in practice. I 
found this while looking at the code, when evaluating  `notIn`, dictionary 
filter returns `BLOCK_MIGHT_MATCH` when the column isn't in the file which 
means all values are null(see L450-L453), but it does not consider whether 
there will be a null value when the column really exists. I think it's 
inconsistent so opened this fix.




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689522#comment-17689522
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

wgtmac commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1432527608

   > I did a quick test using Spark
   > 
   > ```
   >   Seq("A", "A", 
null).toDF("column").repartition(1).write.mode("overwrite").parquet("t")
   >   spark.read.parquet("t").where("NOT (column <=> 'A')").show// 
this returns null
   >   spark.read.parquet("t").where("NOT (column = 'A')").show// 
this returns empty
   >   spark.read.parquet("t").where("column IN ('A')").show// this 
returns A A
   >   spark.read.parquet("t").where("column NOT IN ('A')").show// 
this returns empty
   > ```
   > 
   > If we only has `A` and `null` for `column` and we have predicate `column 
not in ('A')`, should we return empty instead of null?
   
   IIUC, 
   - `col IN (A, B)` is equal to `col = A OR col = B`
   - `col NOT IN (A, B)` is equal to `col <> A AND col <> B`. where `col <> A` 
means `col IS NOT NULL and col != A`
   
   So my answer to your question above is empty. @huaxingao 
   
   It seems that we lose the chance to skip the row group with this fix. 
@gszadovszky 




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689493#comment-17689493
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

huaxingao commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1432481988

   I did a quick test using Spark
   
   ```
 Seq("A", "A", 
null).toDF("column").repartition(1).write.mode("overwrite").parquet("t")
 spark.read.parquet("t").where("NOT (column <=> 'A')").show// 
this returns null
 spark.read.parquet("t").where("NOT (column = 'A')").show// 
this returns empty
 spark.read.parquet("t").where("column IN ('A')").show// this 
returns A A
 spark.read.parquet("t").where("column NOT IN ('A')").show// 
this returns empty
   ```
   
   If we only has `A` and `null` for `column` and we have predicate `column not 
in ('A')`, should we return empty instead of null?




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689045#comment-17689045
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

zhongyujiang commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1431180145

   @gszadovszky Thanks for reviewing and the quick merge!




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689035#comment-17689035
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

gszadovszky merged PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689034#comment-17689034
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

gszadovszky commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1431153336

   @shangxinli, it might require a backport and releases on the branches `In` 
and `NotIn` were released.




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689002#comment-17689002
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

zhongyujiang commented on PR #1028:
URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1431067528

   @huaxingao @gszadovszky Can you help review this? Thanks!




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688999#comment-17688999
 ] 

ASF GitHub Bot commented on PARQUET-2244:
-

zhongyujiang opened a new pull request, #1028:
URL: https://github.com/apache/parquet-mr/pull/1028

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2244
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)