[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690334#comment-17690334 ] ASF GitHub Bot commented on PARQUET-2244: - zhongyujiang commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1434445841 Then I will not open a revert PR, thanks everyone for commenting on this! > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690099#comment-17690099 ] ASF GitHub Bot commented on PARQUET-2244: - wgtmac commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1434018036 > I don't have a strong opinion on whether to keep or revert the fix. The fix won't cause any correctness issue on the engine side because engine will filter again. Same here. This is not a correctness issue. Rather we should be careful with NULL behavior since different engines may have different assumptions. I'd say we might lose some optimization with this fix but it is much safer now. > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690097#comment-17690097 ] ASF GitHub Bot commented on PARQUET-2244: - huaxingao commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1434016401 I don't have a strong opinion on whether to keep or revert the fix. The fix won't cause any correctness issue on the engine side because engine will filter again. > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689729#comment-17689729 ] ASF GitHub Bot commented on PARQUET-2244: - zhongyujiang commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1433005114 > I don't know if there is a downstream that relies on Parquet judge value <> null as TRUE instead of UNKNOW, I guess that might be in some non-ansi standard engines. I just want to make sure it's safe to revert this, I did some search and find this: https://stackoverflow.com/questions/129077/null-values-inside-not-in-clause Seems SQL Server 2005 has a switch `ansi_null` to control wheather the result of `value <> null` is `TRUE` or `UNKNOWN`, when `ansi_nulls` is off, `3 <> null` is true. Though I didn't find such a switch in engines like Hive or Spark. Do you have any comments on this? Or do you think it's worth caring about? @gszadovszky @wgtmac @huaxingao. > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689602#comment-17689602 ] ASF GitHub Bot commented on PARQUET-2244: - gszadovszky commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1432685172 It seems I pushed it too quickly. Sorry for not giving the time to give feedback, @huaxingao and @wgtmac. @zhongyujiang, feel free to put up another PR with the revert. > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689557#comment-17689557 ] ASF GitHub Bot commented on PARQUET-2244: - zhongyujiang commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1432596607 Didn't think about comparisons with non-null values before submitting this PR. I don't know if there is a downstream that relies on Parquet judge `value <> null` as TRUE instead of UNKNOW, I guess that might be in some non-ansi standard engines. If there is no such situation, I think this PR can be reverted safely. > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689546#comment-17689546 ] ASF GitHub Bot commented on PARQUET-2244: - zhongyujiang commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1432572977 I haven't encountered any troubles caused by this situation in practice. I found this while looking at the code, when evaluating `notIn`, dictionary filter returns `BLOCK_MIGHT_MATCH` when the column isn't in the file which means all values are null(see L450-L453), but it does not consider whether there will be a null value when the column really exists. I think it's inconsistent so opened this fix. > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689522#comment-17689522 ] ASF GitHub Bot commented on PARQUET-2244: - wgtmac commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1432527608 > I did a quick test using Spark > > ``` > Seq("A", "A", null).toDF("column").repartition(1).write.mode("overwrite").parquet("t") > spark.read.parquet("t").where("NOT (column <=> 'A')").show// this returns null > spark.read.parquet("t").where("NOT (column = 'A')").show// this returns empty > spark.read.parquet("t").where("column IN ('A')").show// this returns A A > spark.read.parquet("t").where("column NOT IN ('A')").show// this returns empty > ``` > > If we only has `A` and `null` for `column` and we have predicate `column not in ('A')`, should we return empty instead of null? IIUC, - `col IN (A, B)` is equal to `col = A OR col = B` - `col NOT IN (A, B)` is equal to `col <> A AND col <> B`. where `col <> A` means `col IS NOT NULL and col != A` So my answer to your question above is empty. @huaxingao It seems that we lose the chance to skip the row group with this fix. @gszadovszky > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689493#comment-17689493 ] ASF GitHub Bot commented on PARQUET-2244: - huaxingao commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1432481988 I did a quick test using Spark ``` Seq("A", "A", null).toDF("column").repartition(1).write.mode("overwrite").parquet("t") spark.read.parquet("t").where("NOT (column <=> 'A')").show// this returns null spark.read.parquet("t").where("NOT (column = 'A')").show// this returns empty spark.read.parquet("t").where("column IN ('A')").show// this returns A A spark.read.parquet("t").where("column NOT IN ('A')").show// this returns empty ``` If we only has `A` and `null` for `column` and we have predicate `column not in ('A')`, should we return empty instead of null? > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689045#comment-17689045 ] ASF GitHub Bot commented on PARQUET-2244: - zhongyujiang commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1431180145 @gszadovszky Thanks for reviewing and the quick merge! > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689035#comment-17689035 ] ASF GitHub Bot commented on PARQUET-2244: - gszadovszky merged PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028 > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689034#comment-17689034 ] ASF GitHub Bot commented on PARQUET-2244: - gszadovszky commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1431153336 @shangxinli, it might require a backport and releases on the branches `In` and `NotIn` were released. > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689002#comment-17689002 ] ASF GitHub Bot commented on PARQUET-2244: - zhongyujiang commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1431067528 @huaxingao @gszadovszky Can you help review this? Thanks! > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688999#comment-17688999 ] ASF GitHub Bot commented on PARQUET-2244: - zhongyujiang opened a new pull request, #1028: URL: https://github.com/apache/parquet-mr/pull/1028 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2244 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)