[ https://issues.apache.org/jira/browse/SPARK-29162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen updated SPARK-29162: ------------------------------- Description: I propose the following expression rewrite optimizations: {code} NOT isnull(x) -> isnotnull(x) NOT isnotnull(x) -> isnull(x) {code} This might seem contrived, but I saw negated versions of these expressions appear in a user-written query after that query had undergone optimization. For example: {code} spark.createDataset(Seq[(String, java.lang.Boolean)](("true", true), ("false", false), ("null", null))).write.parquet("/tmp/bools") spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == false)").explain spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == false)").explain(true) == Parsed Logical Plan == 'Filter NOT ('isnull('_2) OR ('_2 = false)) +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools == Analyzed Logical Plan == _1: string, _2: boolean Filter NOT (isnull(_2#5) OR (_2#5 = false)) +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools == Optimized Logical Plan == Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false)) +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools == Physical Plan == *(1) Project [_1#4, _2#5] +- *(1) Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false)) +- *(1) ColumnarToRow +- BatchScan[_1#4, _2#5] ParquetScan Location: InMemoryFileIndex[file:/tmp/bools], ReadSchema: struct<_1:string,_2:boolean> {code} This rewrite is also useful for query canonicalization. was: I propose the following expression rewrite optimizations: {code} NOT isnull(x) -> isnotnull(x) NOT isnotnull(x) -> isnull(x) {code} This might seem contrived, but I saw negated versions of these expressions appear in a user-written query after that query had undergone optimization. For example: {code} spark.createDataset(Seq[(String, java.lang.Boolean)](("true", true), ("false", false), ("null", null))).write.parquet("/tmp/bools") spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == false)").explain spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == false)").explain(true) == Parsed Logical Plan == 'Filter NOT ('isnull('_2) OR ('_2 = false)) +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools == Analyzed Logical Plan == _1: string, _2: boolean Filter NOT (isnull(_2#5) OR (_2#5 = false)) +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools == Optimized Logical Plan == Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false)) +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools == Physical Plan == *(1) Project [_1#4, _2#5] +- *(1) Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false)) +- *(1) ColumnarToRow +- BatchScan[_1#4, _2#5] ParquetScan Location: InMemoryFileIndex[file:/tmp/bools], ReadSchema: struct<_1:string,_2:boolean> {code} > Simplify NOT(isnull(x)) and NOT(isnotnull(x)) > --------------------------------------------- > > Key: SPARK-29162 > URL: https://issues.apache.org/jira/browse/SPARK-29162 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Josh Rosen > Priority: Major > > I propose the following expression rewrite optimizations: > {code} > NOT isnull(x) -> isnotnull(x) > NOT isnotnull(x) -> isnull(x) > {code} > This might seem contrived, but I saw negated versions of these expressions > appear in a user-written query after that query had undergone optimization. > For example: > {code} > spark.createDataset(Seq[(String, java.lang.Boolean)](("true", true), > ("false", false), ("null", null))).write.parquet("/tmp/bools") > spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == > false)").explain > spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == > false)").explain(true) > == Parsed Logical Plan == > 'Filter NOT ('isnull('_2) OR ('_2 = false)) > +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools > == Analyzed Logical Plan == > _1: string, _2: boolean > Filter NOT (isnull(_2#5) OR (_2#5 = false)) > +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools > == Optimized Logical Plan == > Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false)) > +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools > == Physical Plan == > *(1) Project [_1#4, _2#5] > +- *(1) Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false)) > +- *(1) ColumnarToRow > +- BatchScan[_1#4, _2#5] ParquetScan Location: > InMemoryFileIndex[file:/tmp/bools], ReadSchema: struct<_1:string,_2:boolean> > {code} > This rewrite is also useful for query canonicalization. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org