[spark] branch branch-3.1 updated: [SPARK-33669] Wrong error message from YARN application state monitor when sc.stop in yarn client mode
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 57ed4df [SPARK-33669] Wrong error message from YARN application state monitor when sc.stop in yarn client mode 57ed4df is described below commit 57ed4df53e3dba01a20ab38ba8098b48594054fd Author: suqilong AuthorDate: Wed Dec 9 01:21:13 2020 -0600 [SPARK-33669] Wrong error message from YARN application state monitor when sc.stop in yarn client mode ### What changes were proposed in this pull request? This change make InterruptedIOException to be treated as InterruptedException when closing YarnClientSchedulerBackend, which doesn't log error with "YARN application has exited unexpectedly xxx" ### Why are the changes needed? For YarnClient mode, when stopping YarnClientSchedulerBackend, it first tries to interrupt Yarn application monitor thread. In MonitorThread.run() it catches InterruptedException to gracefully response to stopping request. But client.monitorApplication method also throws InterruptedIOException when the hadoop rpc call is calling. In this case, MonitorThread will not know it is interrupted, a Yarn App failed is returned with "Failed to contact YARN for application x; YARN application has exited unexpectedly with state x" is logged with error level. which confuse user a lot. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? very simple patch, seems no need? Closes #30617 from sqlwindspeaker/yarn-client-interrupt-monitor. Authored-by: suqilong Signed-off-by: Mridul Muralidharan gmail.com> (cherry picked from commit 48f93af9f3d40de5bf087eb1a06c1b9954b2ad76) Signed-off-by: Mridul Muralidharan --- .../yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala| 2 +- .../apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala | 5 - 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala index 7f791e0..618faef 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala @@ -1069,7 +1069,7 @@ private[spark] class Client( logError(s"Application $appId not found.") cleanupStagingDir() return YarnAppReport(YarnApplicationState.KILLED, FinalApplicationStatus.KILLED, None) - case NonFatal(e) => + case NonFatal(e) if !e.isInstanceOf[InterruptedIOException] => val msg = s"Failed to contact YARN for application $appId." logError(msg, e) // Don't necessarily clean up staging dir because status is unknown diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala index cb0de5a..8a55e61 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala @@ -17,6 +17,8 @@ package org.apache.spark.scheduler.cluster +import java.io.InterruptedIOException + import scala.collection.mutable.ArrayBuffer import org.apache.hadoop.yarn.api.records.YarnApplicationState @@ -121,7 +123,8 @@ private[spark] class YarnClientSchedulerBackend( allowInterrupt = false sc.stop() } catch { -case e: InterruptedException => logInfo("Interrupting monitor thread") +case _: InterruptedException | _: InterruptedIOException => + logInfo("Interrupting monitor thread") } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-33669] Wrong error message from YARN application state monitor when sc.stop in yarn client mode
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 48f93af [SPARK-33669] Wrong error message from YARN application state monitor when sc.stop in yarn client mode 48f93af is described below commit 48f93af9f3d40de5bf087eb1a06c1b9954b2ad76 Author: suqilong AuthorDate: Wed Dec 9 01:21:13 2020 -0600 [SPARK-33669] Wrong error message from YARN application state monitor when sc.stop in yarn client mode ### What changes were proposed in this pull request? This change make InterruptedIOException to be treated as InterruptedException when closing YarnClientSchedulerBackend, which doesn't log error with "YARN application has exited unexpectedly xxx" ### Why are the changes needed? For YarnClient mode, when stopping YarnClientSchedulerBackend, it first tries to interrupt Yarn application monitor thread. In MonitorThread.run() it catches InterruptedException to gracefully response to stopping request. But client.monitorApplication method also throws InterruptedIOException when the hadoop rpc call is calling. In this case, MonitorThread will not know it is interrupted, a Yarn App failed is returned with "Failed to contact YARN for application x; YARN application has exited unexpectedly with state x" is logged with error level. which confuse user a lot. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? very simple patch, seems no need? Closes #30617 from sqlwindspeaker/yarn-client-interrupt-monitor. Authored-by: suqilong Signed-off-by: Mridul Muralidharan gmail.com> --- .../yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala| 2 +- .../apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala | 5 - 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala index 7f791e0..618faef 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala @@ -1069,7 +1069,7 @@ private[spark] class Client( logError(s"Application $appId not found.") cleanupStagingDir() return YarnAppReport(YarnApplicationState.KILLED, FinalApplicationStatus.KILLED, None) - case NonFatal(e) => + case NonFatal(e) if !e.isInstanceOf[InterruptedIOException] => val msg = s"Failed to contact YARN for application $appId." logError(msg, e) // Don't necessarily clean up staging dir because status is unknown diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala index cb0de5a..8a55e61 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala @@ -17,6 +17,8 @@ package org.apache.spark.scheduler.cluster +import java.io.InterruptedIOException + import scala.collection.mutable.ArrayBuffer import org.apache.hadoop.yarn.api.records.YarnApplicationState @@ -121,7 +123,8 @@ private[spark] class YarnClientSchedulerBackend( allowInterrupt = false sc.stop() } catch { -case e: InterruptedException => logInfo("Interrupting monitor thread") +case _: InterruptedException | _: InterruptedIOException => + logInfo("Interrupting monitor thread") } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-33641][SQL][DOC][FOLLOW-UP] Add migration guide for CHAR VARCHAR types
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new fa50fa1 [SPARK-33641][SQL][DOC][FOLLOW-UP] Add migration guide for CHAR VARCHAR types fa50fa1 is described below commit fa50fa1bc17c1904121b35bad381063d9c7c8e70 Author: Kent Yao AuthorDate: Wed Dec 9 06:44:10 2020 + [SPARK-33641][SQL][DOC][FOLLOW-UP] Add migration guide for CHAR VARCHAR types ### What changes were proposed in this pull request? Add migration guide for CHAR VARCHAR types ### Why are the changes needed? for migration ### Does this PR introduce _any_ user-facing change? doc change ### How was this patch tested? passing ci Closes #30654 from yaooqinn/SPARK-33641-F. Authored-by: Kent Yao Signed-off-by: Wenchen Fan (cherry picked from commit c88eddac3bf860d04bba91fc913f8b2069a94153) Signed-off-by: Wenchen Fan --- docs/sql-migration-guide.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 2c86e7a..2bc04a0 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -54,6 +54,8 @@ license: | - In Spark 3.1, creating or altering a view will capture runtime SQL configs and store them as view properties. These configs will be applied during the parsing and analysis phases of the view resolution. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.useCurrentConfigsForView` to `true`. + - Since Spark 3.1, CHAR/CHARACTER and VARCHAR types are supported in the table schema. Table scan/insertion will respect the char/varchar semantic. If char/varchar is used in places other than table schema, an exception will be thrown (CAST is an exception that simply treats char/varchar as string like before). To restore the behavior before Spark 3.1, which treats them as STRING types and ignores a length parameter, e.g. `CHAR(4)`, you can set `spark.sql.legacy.charVarcharAsString` to [...] + ## Upgrading from Spark SQL 3.0 to 3.0.1 - In Spark 3.0, JSON datasource and JSON function `schema_of_json` infer TimestampType from string values if they match to the pattern defined by the JSON option `timestampFormat`. Since version 3.0.1, the timestamp type inference is disabled by default. Set the JSON option `inferTimestamp` to `true` to enable such type inference. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (29fed23 -> c88edda)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 29fed23 [SPARK-33703][SQL] Migrate MSCK REPAIR TABLE to use UnresolvedTable to resolve the identifier add c88edda [SPARK-33641][SQL][DOC][FOLLOW-UP] Add migration guide for CHAR VARCHAR types No new revisions were added by this update. Summary of changes: docs/sql-migration-guide.md | 2 ++ 1 file changed, 2 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (f021f6d -> 29fed23)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from f021f6d [MINOR][ML] Increase Bounded MLOR (without regularization) test error tolerance add 29fed23 [SPARK-33703][SQL] Migrate MSCK REPAIR TABLE to use UnresolvedTable to resolve the identifier No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala | 5 +++-- .../org/apache/spark/sql/catalyst/plans/logical/statements.scala | 5 - .../org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala | 7 +++ .../org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala| 2 +- .../spark/sql/catalyst/analysis/ResolveSessionCatalog.scala | 7 ++- .../sql/execution/datasources/v2/DataSourceV2Strategy.scala | 3 +++ .../org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala| 9 + 7 files changed, 17 insertions(+), 21 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [MINOR][ML] Increase Bounded MLOR (without regularization) test error tolerance
This is an automated email from the ASF dual-hosted git repository. weichenxu123 pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new b0a70ab [MINOR][ML] Increase Bounded MLOR (without regularization) test error tolerance b0a70ab is described below commit b0a70abed383cedd59f28cec75aa898df0c0b4bd Author: Weichen Xu AuthorDate: Wed Dec 9 11:18:09 2020 +0800 [MINOR][ML] Increase Bounded MLOR (without regularization) test error tolerance ### What changes were proposed in this pull request? Improve LogisticRegression test error tolerance ### Why are the changes needed? When we switch BLAS version, some of the tests will fail due to too strict error tolerance in test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30587 from WeichenXu123/fix_lor_test. Authored-by: Weichen Xu Signed-off-by: Weichen Xu (cherry picked from commit f021f6d3c72e1c84637798b4ddcb7e208fdfbf46) Signed-off-by: Weichen Xu --- .../ml/classification/LogisticRegressionSuite.scala | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala index d0b282d..d2814b4 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala @@ -1548,9 +1548,9 @@ class LogisticRegressionSuite extends MLTest with DefaultReadWriteTest { val interceptsExpected1 = Vectors.dense( 1.152482448372, 3.591773288423673, 5.079685953744937) -checkCoefficientsEquivalent(model1.coefficientMatrix, coefficientsExpected1) +checkBoundedMLORCoefficientsEquivalent(model1.coefficientMatrix, coefficientsExpected1) assert(model1.interceptVector ~== interceptsExpected1 relTol 0.01) -checkCoefficientsEquivalent(model2.coefficientMatrix, coefficientsExpected1) +checkBoundedMLORCoefficientsEquivalent(model2.coefficientMatrix, coefficientsExpected1) assert(model2.interceptVector ~== interceptsExpected1 relTol 0.01) // Bound constrained optimization with bound on both side. @@ -1585,9 +1585,9 @@ class LogisticRegressionSuite extends MLTest with DefaultReadWriteTest { isTransposed = true) val interceptsExpected3 = Vectors.dense(1.0, 2.0, 2.0) -checkCoefficientsEquivalent(model3.coefficientMatrix, coefficientsExpected3) +checkBoundedMLORCoefficientsEquivalent(model3.coefficientMatrix, coefficientsExpected3) assert(model3.interceptVector ~== interceptsExpected3 relTol 0.01) -checkCoefficientsEquivalent(model4.coefficientMatrix, coefficientsExpected3) +checkBoundedMLORCoefficientsEquivalent(model4.coefficientMatrix, coefficientsExpected3) assert(model4.interceptVector ~== interceptsExpected3 relTol 0.01) // Bound constrained optimization with infinite bound on both side. @@ -1621,9 +1621,9 @@ class LogisticRegressionSuite extends MLTest with DefaultReadWriteTest { val interceptsExpected5 = Vectors.dense( -2.2231282183460723, 0.3669496747012527, 1.856178543644802) -checkCoefficientsEquivalent(model5.coefficientMatrix, coefficientsExpected5) +checkBoundedMLORCoefficientsEquivalent(model5.coefficientMatrix, coefficientsExpected5) assert(model5.interceptVector ~== interceptsExpected5 relTol 0.01) -checkCoefficientsEquivalent(model6.coefficientMatrix, coefficientsExpected5) +checkBoundedMLORCoefficientsEquivalent(model6.coefficientMatrix, coefficientsExpected5) assert(model6.interceptVector ~== interceptsExpected5 relTol 0.01) } @@ -1719,9 +1719,9 @@ class LogisticRegressionSuite extends MLTest with DefaultReadWriteTest { 1.7524631428961193, 1.2292565990448736, 1.3433784431904323, 1.5846063017678864), isTransposed = true) -checkCoefficientsEquivalent(model1.coefficientMatrix, coefficientsExpected) +checkBoundedMLORCoefficientsEquivalent(model1.coefficientMatrix, coefficientsExpected) assert(model1.interceptVector.toArray === Array.fill(3)(0.0)) -checkCoefficientsEquivalent(model2.coefficientMatrix, coefficientsExpected) +checkBoundedMLORCoefficientsEquivalent(model2.coefficientMatrix, coefficientsExpected) assert(model2.interceptVector.toArray === Array.fill(3)(0.0)) } @@ -2953,16 +2953,17 @@ object LogisticRegressionSuite { } /** + * Note: This method is only used in Bounded MLOR (without regularization) test * When no regularization is applied, the multinomial coefficients lack identifiability * because we do not use a pivot class. We can add any constant value to
[spark] branch master updated (3ac70f1 -> f021f6d)
This is an automated email from the ASF dual-hosted git repository. weichenxu123 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3ac70f1 [SPARK-33695][BUILD] Upgrade to jackson to 2.10.5 and jackson-databind to 2.10.5.1 add f021f6d [MINOR][ML] Increase Bounded MLOR (without regularization) test error tolerance No new revisions were added by this update. Summary of changes: .../ml/classification/LogisticRegressionSuite.scala | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (6fd2345 -> 3ac70f1)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 6fd2345 [SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ add 3ac70f1 [SPARK-33695][BUILD] Upgrade to jackson to 2.10.5 and jackson-databind to 2.10.5.1 No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 16 dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 16 pom.xml | 5 +++-- 3 files changed, 19 insertions(+), 18 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new eae6a3e [SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ eae6a3e is described below commit eae6a3e9dc75912e2fbe80d86f05cce629de8022 Author: Wenchen Fan AuthorDate: Tue Dec 8 11:41:35 2020 -0800 [SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ ### What changes were proposed in this pull request? Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set. The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places: 1. GROUP BY 2. join keys 3. window partition keys This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++ ### Why are the changes needed? Fix the query result ### Does this PR introduce _any_ user-facing change? Yes, the result of HyperLogLog++ becomes correct now. ### How was this patch tested? a new test case, and a few more test cases that pass before this PR to improve test coverage. Closes #30673 from cloud-fan/bug. Authored-by: Wenchen Fan Signed-off-by: Dongjoon Hyun (cherry picked from commit 6fd234503cf1e85715ccd3bda42f29dae1daa71b) Signed-off-by: Dongjoon Hyun --- .../optimizer/NormalizeFloatingNumbers.scala | 45 ++- .../catalyst/util/HyperLogLogPlusPlusHelper.scala | 8 +- .../sql/catalyst/expressions/PredicateSuite.scala | 90 ++ .../aggregate/HyperLogLogPlusPlusSuite.scala | 24 +- 4 files changed, 144 insertions(+), 23 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala index f0cf671..59265221 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala @@ -134,6 +134,28 @@ object NormalizeFloatingNumbers extends Rule[LogicalPlan] { case _ => throw new IllegalStateException(s"fail to normalize $expr") } + + val FLOAT_NORMALIZER: Any => Any = (input: Any) => { +val f = input.asInstanceOf[Float] +if (f.isNaN) { + Float.NaN +} else if (f == -0.0f) { + 0.0f +} else { + f +} + } + + val DOUBLE_NORMALIZER: Any => Any = (input: Any) => { +val d = input.asInstanceOf[Double] +if (d.isNaN) { + Double.NaN +} else if (d == -0.0d) { + 0.0d +} else { + d +} + } } case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with ExpectsInputTypes { @@ -143,27 +165,8 @@ case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with E override def inputTypes: Seq[AbstractDataType] = Seq(TypeCollection(FloatType, DoubleType)) private lazy val normalizer: Any => Any = child.dataType match { -case FloatType => (input: Any) => { - val f = input.asInstanceOf[Float] - if (f.isNaN) { -Float.NaN - } else if (f == -0.0f) { -0.0f - } else { -f - } -} - -case DoubleType => (input: Any) => { - val d = input.asInstanceOf[Double] - if (d.isNaN) { -Double.NaN - } else if (d == -0.0d) { -0.0d - } else { -d - } -} +case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER +case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER } override def nullSafeEval(input: Any): Any = { diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala index ea619c6..6471a74 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala @@ -22,6 +22,7 @@ import java.util import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.XxHash64Function +import org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers.{DOUBLE_NORMALIZER, FLOAT_NORMALIZER} import org.apache.spark.sql.types._ // A helper class for HyperLogLogPlusPlus. @@
[spark] branch branch-3.1 updated: [SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 1093c0f [SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ 1093c0f is described below commit 1093c0f3cbd107eccba77dc39dae60e5019578cc Author: Wenchen Fan AuthorDate: Tue Dec 8 11:41:35 2020 -0800 [SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ ### What changes were proposed in this pull request? Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set. The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places: 1. GROUP BY 2. join keys 3. window partition keys This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++ ### Why are the changes needed? Fix the query result ### Does this PR introduce _any_ user-facing change? Yes, the result of HyperLogLog++ becomes correct now. ### How was this patch tested? a new test case, and a few more test cases that pass before this PR to improve test coverage. Closes #30673 from cloud-fan/bug. Authored-by: Wenchen Fan Signed-off-by: Dongjoon Hyun (cherry picked from commit 6fd234503cf1e85715ccd3bda42f29dae1daa71b) Signed-off-by: Dongjoon Hyun --- .../optimizer/NormalizeFloatingNumbers.scala | 45 ++- .../catalyst/util/HyperLogLogPlusPlusHelper.scala | 8 +- .../sql/catalyst/expressions/PredicateSuite.scala | 90 ++ .../aggregate/HyperLogLogPlusPlusSuite.scala | 24 +- 4 files changed, 144 insertions(+), 23 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala index 4434c29..ac8766c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala @@ -143,6 +143,28 @@ object NormalizeFloatingNumbers extends Rule[LogicalPlan] { case _ => throw new IllegalStateException(s"fail to normalize $expr") } + + val FLOAT_NORMALIZER: Any => Any = (input: Any) => { +val f = input.asInstanceOf[Float] +if (f.isNaN) { + Float.NaN +} else if (f == -0.0f) { + 0.0f +} else { + f +} + } + + val DOUBLE_NORMALIZER: Any => Any = (input: Any) => { +val d = input.asInstanceOf[Double] +if (d.isNaN) { + Double.NaN +} else if (d == -0.0d) { + 0.0d +} else { + d +} + } } case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with ExpectsInputTypes { @@ -152,27 +174,8 @@ case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with E override def inputTypes: Seq[AbstractDataType] = Seq(TypeCollection(FloatType, DoubleType)) private lazy val normalizer: Any => Any = child.dataType match { -case FloatType => (input: Any) => { - val f = input.asInstanceOf[Float] - if (f.isNaN) { -Float.NaN - } else if (f == -0.0f) { -0.0f - } else { -f - } -} - -case DoubleType => (input: Any) => { - val d = input.asInstanceOf[Double] - if (d.isNaN) { -Double.NaN - } else if (d == -0.0d) { -0.0d - } else { -d - } -} +case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER +case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER } override def nullSafeEval(input: Any): Any = { diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala index ea619c6..6471a74 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala @@ -22,6 +22,7 @@ import java.util import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.XxHash64Function +import org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers.{DOUBLE_NORMALIZER, FLOAT_NORMALIZER} import org.apache.spark.sql.types._ // A helper class for HyperLogLogPlusPlus. @@ -88,7
[spark] branch master updated (c001dd4 -> 6fd2345)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c001dd4 [SPARK-33675][INFRA][FOLLOWUP] Schedule branch-3.1 snapshot at master branch add 6fd2345 [SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ No new revisions were added by this update. Summary of changes: .../optimizer/NormalizeFloatingNumbers.scala | 45 ++- .../catalyst/util/HyperLogLogPlusPlusHelper.scala | 8 +- .../sql/catalyst/expressions/PredicateSuite.scala | 90 ++ .../aggregate/HyperLogLogPlusPlusSuite.scala | 24 +- 4 files changed, 144 insertions(+), 23 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c05ee06 -> a093d6f)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c05ee06 [SPARK-33685][SQL] Migrate DROP VIEW command to use UnresolvedView to resolve the identifier add a093d6f [MINOR] Spelling sql/core No new revisions were added by this update. Summary of changes: .../spark/sql/execution/ui/static/spark-sql-viz.js | 8 +++--- .../main/scala/org/apache/spark/sql/Dataset.scala | 10 +++ .../spark/sql/execution/DataSourceScanExec.scala | 6 ++-- .../apache/spark/sql/execution/ExplainUtils.scala | 8 +++--- .../ExternalAppendOnlyUnsafeRowArray.scala | 2 +- .../spark/sql/execution/SparkSqlParser.scala | 14 +- .../sql/execution/WholeStageCodegenExec.scala | 2 +- .../adaptive/AdaptiveSparkPlanHelper.scala | 2 +- .../command/InsertIntoDataSourceDirCommand.scala | 2 +- .../apache/spark/sql/execution/command/ddl.scala | 4 +-- .../spark/sql/execution/command/tables.scala | 2 +- .../sql/execution/datasources/DataSource.scala | 2 +- .../datasources/FileFormatDataWriter.scala | 14 +- .../execution/datasources/FileFormatWriter.scala | 2 +- .../execution/datasources/PartitioningUtils.scala | 2 +- .../datasources/v2/WriteToDataSourceV2Exec.scala | 2 +- .../spark/sql/execution/joins/HashedRelation.scala | 4 +-- .../sql/execution/python/ExtractPythonUDFs.scala | 6 ++-- .../streaming/CompactibleFileStreamLog.scala | 2 +- .../sql/execution/streaming/StreamExecution.scala | 2 +- .../state/FlatMapGroupsWithStateExecHelper.scala | 2 +- .../org/apache/spark/sql/internal/HiveSerDe.scala | 2 +- .../spark/sql/streaming/DataStreamWriter.scala | 4 +-- .../spark/sql/Java8DatasetAggregatorSuite.java | 16 +-- .../spark/sql/JavaDatasetAggregatorSuite.java | 24 .../inputs/ansi/decimalArithmeticOperations.sql| 2 +- .../sql-tests/inputs/postgreSQL/create_view.sql| 2 +- .../org/apache/spark/sql/CachedTableSuite.scala| 8 +++--- .../org/apache/spark/sql/DataFrameSuite.scala | 2 +- .../org/apache/spark/sql/DatasetCacheSuite.scala | 13 + .../apache/spark/sql/DatasetPrimitiveSuite.scala | 8 +++--- .../scala/org/apache/spark/sql/DatasetSuite.scala | 32 +++--- .../org/apache/spark/sql/DateFunctionsSuite.scala | 6 ++-- .../scala/org/apache/spark/sql/SQLQuerySuite.scala | 6 ++-- .../org/apache/spark/sql/SQLQueryTestSuite.scala | 10 +++ .../spark/sql/SparkSessionExtensionSuite.scala | 18 ++-- .../org/apache/spark/sql/TPCDSTableStats.scala | 2 +- .../spark/sql/connector/DataSourceV2SQLSuite.scala | 12 .../sql/execution/SQLWindowFunctionSuite.scala | 2 +- .../spark/sql/execution/SparkSqlParserSuite.scala | 2 +- .../sql/execution/WholeStageCodegenSuite.scala | 4 +-- .../adaptive/AdaptiveQueryExecSuite.scala | 8 +++--- .../sql/execution/arrow/ArrowConvertersSuite.scala | 2 +- .../spark/sql/execution/command/DDLSuite.scala | 12 .../execution/command/PlanResolutionSuite.scala| 16 +-- .../execution/datasources/DataSourceSuite.scala| 4 +-- .../execution/datasources/SchemaPruningSuite.scala | 8 +++--- .../parquet/ParquetInteroperabilitySuite.scala | 2 +- .../parquet/ParquetPartitionDiscoverySuite.scala | 4 +-- .../datasources/parquet/ParquetQuerySuite.scala| 4 +-- .../exchange/EnsureRequirementsSuite.scala | 2 +- .../sql/execution/metric/SQLMetricsSuite.scala | 2 +- .../execution/streaming/HDFSMetadataLogSuite.scala | 2 +- .../sql/execution/ui/SparkPlanInfoSuite.scala | 6 ++-- .../sql/internal/ExecutorSideSQLConfSuite.scala| 4 +-- .../org/apache/spark/sql/jdbc/JDBCSuite.scala | 12 .../spark/sql/sources/BucketedReadSuite.scala | 18 ++-- .../sql/sources/CreateTableAsSelectSuite.scala | 2 +- .../apache/spark/sql/sources/TableScanSuite.scala | 6 ++-- .../sql/streaming/FileStreamSourceSuite.scala | 4 +-- .../apache/spark/sql/streaming/StreamSuite.scala | 8 +++--- .../streaming/test/DataStreamTableAPISuite.scala | 8 +++--- .../org/apache/spark/sql/test/SQLTestData.scala| 4 +-- .../apache/spark/sql/test/SharedSparkSession.scala | 2 +- 64 files changed, 208 insertions(+), 205 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-33685][SQL] Migrate DROP VIEW command to use UnresolvedView to resolve the identifier
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c05ee06 [SPARK-33685][SQL] Migrate DROP VIEW command to use UnresolvedView to resolve the identifier c05ee06 is described below commit c05ee06f5b711dd261dc94a01b4ba4ffccdf2ea0 Author: Terry Kim AuthorDate: Tue Dec 8 14:07:58 2020 + [SPARK-33685][SQL] Migrate DROP VIEW command to use UnresolvedView to resolve the identifier ### What changes were proposed in this pull request? This PR introduces `UnresolvedView` in the resolution framework to resolve the identifier. This PR then migrates `DROP VIEW` to use `UnresolvedView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? To use `UnresolvedView` for view resolution. Note that there is no resolution behavior change with this PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #30636 from imback82/drop_view_v2. Authored-by: Terry Kim Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/analysis/Analyzer.scala | 17 ++-- .../sql/catalyst/analysis/CheckAnalysis.scala | 13 +++-- .../sql/catalyst/analysis/ResolveCatalogs.scala| 5 ...ble.scala => ResolveCommandsWithIfExists.scala} | 14 +- .../sql/catalyst/analysis/v2ResolutionPlans.scala | 13 + .../spark/sql/catalyst/parser/AstBuilder.scala | 9 --- .../sql/catalyst/plans/logical/statements.scala| 7 - .../sql/catalyst/plans/logical/v2Commands.scala| 15 +-- .../spark/sql/catalyst/parser/DDLParserSuite.scala | 17 +++- .../catalyst/analysis/ResolveSessionCatalog.scala | 5 ++-- .../datasources/v2/DataSourceV2Strategy.scala | 2 +- .../spark/sql/connector/DataSourceV2SQLSuite.scala | 14 +- .../spark/sql/execution/command/DDLSuite.scala | 5 ++-- .../execution/command/PlanResolutionSuite.scala| 31 +++--- .../spark/sql/hive/execution/HiveDDLSuite.scala| 3 ++- 15 files changed, 118 insertions(+), 52 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 680ec98..6b0cf4b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -150,7 +150,7 @@ object AnalysisContext { * [[UnresolvedRelation]]s into fully typed objects using information in a [[SessionCatalog]]. */ class Analyzer(override val catalogManager: CatalogManager) - extends RuleExecutor[LogicalPlan] with CheckAnalysis with LookupCatalog with SQLConfHelper { + extends RuleExecutor[LogicalPlan] with CheckAnalysis with SQLConfHelper { private val v1SessionCatalog: SessionCatalog = catalogManager.v1SessionCatalog @@ -277,7 +277,7 @@ class Analyzer(override val catalogManager: CatalogManager) TypeCoercion.typeCoercionRules ++ extendedResolutionRules : _*), Batch("Post-Hoc Resolution", Once, - Seq(ResolveNoopDropTable) ++ + Seq(ResolveCommandsWithIfExists) ++ postHocResolutionRules: _*), Batch("Normalize Alter Table", Once, ResolveAlterTableChanges), Batch("Remove Unresolved Hints", Once, @@ -889,6 +889,11 @@ class Analyzer(override val catalogManager: CatalogManager) u.failAnalysis(s"${ident.quoted} is a temp view. '$cmd' expects a table") } u + case u @ UnresolvedView(ident, _, _) => +lookupTempView(ident).map { _ => + ResolvedView(ident.asIdentifier, isTemp = true) +} +.getOrElse(u) case u @ UnresolvedTableOrView(ident, cmd, allowTempView) => lookupTempView(ident) .map { _ => @@ -1113,6 +1118,14 @@ class Analyzer(override val catalogManager: CatalogManager) case table => table }.getOrElse(u) + case u @ UnresolvedView(identifier, cmd, relationTypeMismatchHint) => +lookupTableOrView(identifier).map { + case v: ResolvedView => v + case _ => +u.failAnalysis(s"${identifier.quoted} is a table. '$cmd' expects a view." + + relationTypeMismatchHint.map(" " + _).getOrElse("")) +}.getOrElse(u) + case u @
[spark] branch master updated (99613cd -> 2b30dde)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 99613cd [SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains any escapeChar add 2b30dde [SPARK-33688][SQL] Migrate SHOW TABLE EXTENDED to new resolution framework No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/parser/SqlBase.g4| 2 +- .../spark/sql/catalyst/analysis/Analyzer.scala | 2 ++ .../spark/sql/catalyst/parser/AstBuilder.scala | 15 .../sql/catalyst/plans/logical/statements.scala| 9 .../sql/catalyst/plans/logical/v2Commands.scala| 20 ++-- .../catalyst/analysis/ResolveSessionCatalog.scala | 20 ++-- .../datasources/v2/DataSourceV2Strategy.scala | 3 +++ .../execution/command/ShowTablesParserSuite.scala | 27 ++ .../sql/execution/command/v2/ShowTablesSuite.scala | 7 +++--- 9 files changed, 67 insertions(+), 38 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains any escapeChar
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new ea7c2a1 [SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains any escapeChar ea7c2a1 is described below commit ea7c2a15a02d8c8cf3e3f5a1260da76829d59596 Author: luluorta AuthorDate: Tue Dec 8 20:45:25 2020 +0900 [SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains any escapeChar ### What changes were proposed in this pull request? `LikeSimplification` rule does not work correctly for many cases that have patterns containing escape characters, for example: `SELECT s LIKE 'm%aca' ESCAPE '%' FROM t` `SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t` For simpilicy, this PR makes this rule just be skipped if `pattern` contains any `escapeChar`. ### Why are the changes needed? Result corrupt. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Unit test. Closes #30625 from luluorta/SPARK-33677. Authored-by: luluorta Signed-off-by: Takeshi Yamamuro (cherry picked from commit 99613cd5815b2de12274027dee0c0a6c0c57bd95) Signed-off-by: Takeshi Yamamuro --- .../spark/sql/catalyst/optimizer/expressions.scala | 18 +--- .../optimizer/LikeSimplificationSuite.scala| 48 ++ .../scala/org/apache/spark/sql/SQLQuerySuite.scala | 14 +++ 3 files changed, 74 insertions(+), 6 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala index 7773f5c..7cbeb47 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala @@ -525,27 +525,33 @@ object LikeSimplification extends Rule[LogicalPlan] { private val equalTo = "([^_%]*)".r def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions { -case Like(input, Literal(pattern, StringType), escapeChar) => +case l @ Like(input, Literal(pattern, StringType), escapeChar) => if (pattern == null) { // If pattern is null, return null value directly, since "col like null" == null. Literal(null, BooleanType) } else { -val escapeStr = String.valueOf(escapeChar) pattern.toString match { - case startsWith(prefix) if !prefix.endsWith(escapeStr) => + // There are three different situations when pattern containing escapeChar: + // 1. pattern contains invalid escape sequence, e.g. 'm\aca' + // 2. pattern contains escaped wildcard character, e.g. 'ma\%ca' + // 3. pattern contains escaped escape character, e.g. 'ma\\ca' + // Although there are patterns can be optimized if we handle the escape first, we just + // skip this rule if pattern contains any escapeChar for simplicity. + case p if p.contains(escapeChar) => l + case startsWith(prefix) => StartsWith(input, Literal(prefix)) case endsWith(postfix) => EndsWith(input, Literal(postfix)) // 'a%a' pattern is basically same with 'a%' && '%a'. // However, the additional `Length` condition is required to prevent 'a' match 'a%a'. - case startsAndEndsWith(prefix, postfix) if !prefix.endsWith(escapeStr) => + case startsAndEndsWith(prefix, postfix) => And(GreaterThanOrEqual(Length(input), Literal(prefix.length + postfix.length)), And(StartsWith(input, Literal(prefix)), EndsWith(input, Literal(postfix - case contains(infix) if !infix.endsWith(escapeStr) => + case contains(infix) => Contains(input, Literal(infix)) case equalTo(str) => EqualTo(input, Literal(str)) - case _ => Like(input, Literal.create(pattern, StringType), escapeChar) + case _ => l } } } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LikeSimplificationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LikeSimplificationSuite.scala index 436f62e..1812dce 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LikeSimplificationSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LikeSimplificationSuite.scala @@ -116,4 +116,52 @@ class LikeSimplificationSuite extends PlanTest { val optimized2 = Optimize.execute(originalQuery2.analyze) comparePlans(optimized2, originalQuery2.analyze) } + + test("SPARK-33677: LikeSimplification should be skipped if pattern
[spark] branch master updated (031c5ef -> 99613cd)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 031c5ef [SPARK-33679][SQL] Enable spark.sql.adaptive.enabled by default add 99613cd [SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains any escapeChar No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/optimizer/expressions.scala | 18 +--- .../optimizer/LikeSimplificationSuite.scala| 48 ++ .../scala/org/apache/spark/sql/SQLQuerySuite.scala | 14 +++ 3 files changed, 74 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains any escapeChar
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 54a73ab [SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains any escapeChar 54a73ab is described below commit 54a73ab7bb062b0fad3b8925d03bb4dca9fdc17a Author: luluorta AuthorDate: Tue Dec 8 20:45:25 2020 +0900 [SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains any escapeChar ### What changes were proposed in this pull request? `LikeSimplification` rule does not work correctly for many cases that have patterns containing escape characters, for example: `SELECT s LIKE 'm%aca' ESCAPE '%' FROM t` `SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t` For simpilicy, this PR makes this rule just be skipped if `pattern` contains any `escapeChar`. ### Why are the changes needed? Result corrupt. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Unit test. Closes #30625 from luluorta/SPARK-33677. Authored-by: luluorta Signed-off-by: Takeshi Yamamuro (cherry picked from commit 99613cd5815b2de12274027dee0c0a6c0c57bd95) Signed-off-by: Takeshi Yamamuro --- .../spark/sql/catalyst/optimizer/expressions.scala | 18 +--- .../optimizer/LikeSimplificationSuite.scala| 48 ++ .../scala/org/apache/spark/sql/SQLQuerySuite.scala | 14 +++ 3 files changed, 74 insertions(+), 6 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala index 1b1e2ad..b2fc334 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala @@ -543,27 +543,33 @@ object LikeSimplification extends Rule[LogicalPlan] { private val equalTo = "([^_%]*)".r def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions { -case Like(input, Literal(pattern, StringType), escapeChar) => +case l @ Like(input, Literal(pattern, StringType), escapeChar) => if (pattern == null) { // If pattern is null, return null value directly, since "col like null" == null. Literal(null, BooleanType) } else { -val escapeStr = String.valueOf(escapeChar) pattern.toString match { - case startsWith(prefix) if !prefix.endsWith(escapeStr) => + // There are three different situations when pattern containing escapeChar: + // 1. pattern contains invalid escape sequence, e.g. 'm\aca' + // 2. pattern contains escaped wildcard character, e.g. 'ma\%ca' + // 3. pattern contains escaped escape character, e.g. 'ma\\ca' + // Although there are patterns can be optimized if we handle the escape first, we just + // skip this rule if pattern contains any escapeChar for simplicity. + case p if p.contains(escapeChar) => l + case startsWith(prefix) => StartsWith(input, Literal(prefix)) case endsWith(postfix) => EndsWith(input, Literal(postfix)) // 'a%a' pattern is basically same with 'a%' && '%a'. // However, the additional `Length` condition is required to prevent 'a' match 'a%a'. - case startsAndEndsWith(prefix, postfix) if !prefix.endsWith(escapeStr) => + case startsAndEndsWith(prefix, postfix) => And(GreaterThanOrEqual(Length(input), Literal(prefix.length + postfix.length)), And(StartsWith(input, Literal(prefix)), EndsWith(input, Literal(postfix - case contains(infix) if !infix.endsWith(escapeStr) => + case contains(infix) => Contains(input, Literal(infix)) case equalTo(str) => EqualTo(input, Literal(str)) - case _ => Like(input, Literal.create(pattern, StringType), escapeChar) + case _ => l } } } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LikeSimplificationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LikeSimplificationSuite.scala index 436f62e..1812dce 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LikeSimplificationSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LikeSimplificationSuite.scala @@ -116,4 +116,52 @@ class LikeSimplificationSuite extends PlanTest { val optimized2 = Optimize.execute(originalQuery2.analyze) comparePlans(optimized2, originalQuery2.analyze) } + + test("SPARK-33677: LikeSimplification should be skipped if pattern