[spark] branch master updated (fab4ceb -> b425156)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from fab4ceb [SPARK-38240][SQL] Improve RuntimeReplaceable and add a guideline for adding new functions add b425156 [SPARK-38162][SQL] Optimize one row plan in normal and AQE Optimizer No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/dsl/package.scala| 2 +- .../catalyst/optimizer/OptimizeOneRowPlan.scala| 49 ++ .../spark/sql/catalyst/optimizer/Optimizer.scala | 13 ++- .../sql/catalyst/rules/RuleIdCollection.scala | 1 + .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 2 +- .../catalyst/optimizer/EliminateSortsSuite.scala | 10 -- .../optimizer/OptimizeOneRowPlanSuite.scala| 104 + .../sql/execution/adaptive/AQEOptimizer.scala | 5 +- .../adaptive/AdaptiveQueryExecSuite.scala | 54 +++ 9 files changed, 219 insertions(+), 21 deletions(-) create mode 100644 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowPlan.scala create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowPlanSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-38240][SQL] Improve RuntimeReplaceable and add a guideline for adding new functions
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fab4ceb [SPARK-38240][SQL] Improve RuntimeReplaceable and add a guideline for adding new functions fab4ceb is described below commit fab4ceb157baac870f6d50b942084bb9b2cd4ad2 Author: Wenchen Fan AuthorDate: Wed Feb 23 15:32:00 2022 +0800 [SPARK-38240][SQL] Improve RuntimeReplaceable and add a guideline for adding new functions ### What changes were proposed in this pull request? This PR improves `RuntimeReplaceable` so that it can 1. Customize the type coercion behavior instead of always inheriting from the replacement expression. This is useful for expressions like `ToBinary`, where its replacement expression can be `Cast` that does not have type coercion. 2. Support aggregate functions. This PR also adds a guideline for adding new SQL functions, with `RuntimeReplaceable` and `ExpressionBuilder`. See https://github.com/apache/spark/pull/35534/files#diff-6c6ba3e220b9d155160e4e25305fdd3a4835b7ce9eba230a7ae70bdd97047313R330 ### Why are the changes needed? Since we are keep adding new functions, it's better to make `RuntimeReplaceable` more useful and set up a standard for adding functions. ### Does this PR introduce _any_ user-facing change? Improves error messages of some functions. ### How was this patch tested? existing tests Closes #35534 from cloud-fan/refactor. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan --- .../spark/examples/extensions/AgeExample.scala | 13 +- .../sql/catalyst/analysis/CheckAnalysis.scala | 4 + .../sql/catalyst/analysis/FunctionRegistry.scala | 63 +++- .../sql/catalyst/analysis/TimeTravelSpec.scala | 2 +- .../sql/catalyst/expressions/Expression.scala | 81 +++-- .../spark/sql/catalyst/expressions/TryEval.scala | 51 ++- .../catalyst/expressions/aggregate/CountIf.scala | 35 +-- .../catalyst/expressions/aggregate/RegrCount.scala | 19 +- ...{UnevaluableAggs.scala => boolAggregates.scala} | 41 +-- .../expressions/collectionOperations.scala | 53 +++- .../catalyst/expressions/datetimeExpressions.scala | 343 ++--- .../catalyst/expressions/intervalExpressions.scala | 10 +- .../sql/catalyst/expressions/mathExpressions.scala | 97 +++--- .../spark/sql/catalyst/expressions/misc.scala | 91 +++--- .../sql/catalyst/expressions/nullExpressions.scala | 54 +--- .../catalyst/expressions/regexpExpressions.scala | 19 +- .../catalyst/expressions/stringExpressions.scala | 207 ++--- .../sql/catalyst/optimizer/finishAnalysis.scala| 21 +- .../spark/sql/catalyst/parser/AstBuilder.scala | 2 +- .../spark/sql/catalyst/trees/TreePatterns.scala| 3 - .../apache/spark/sql/catalyst/util/package.scala | 4 +- .../spark/sql/errors/QueryCompilationErrors.scala | 24 +- .../spark/sql/errors/QueryExecutionErrors.scala| 8 +- .../expressions/DateExpressionsSuite.scala | 8 +- .../scala/org/apache/spark/sql/functions.scala | 4 +- .../sql-functions/sql-expression-schema.md | 20 +- .../sql-tests/inputs/string-functions.sql | 9 +- .../resources/sql-tests/results/ansi/map.sql.out | 4 +- .../results/ansi/string-functions.sql.out | 28 +- .../results/ceil-floor-with-scale-param.sql.out| 14 +- .../resources/sql-tests/results/extract.sql.out| 4 +- .../resources/sql-tests/results/group-by.sql.out | 12 +- .../test/resources/sql-tests/results/map.sql.out | 4 +- .../sql-tests/results/string-functions.sql.out | 28 +- .../sql-tests/results/timestamp-ltz.sql.out| 2 +- .../sql-tests/results/udf/udf-group-by.sql.out | 8 +- .../apache/spark/sql/DataFrameAggregateSuite.scala | 3 +- 37 files changed, 657 insertions(+), 736 deletions(-) diff --git a/examples/src/main/scala/org/apache/spark/examples/extensions/AgeExample.scala b/examples/src/main/scala/org/apache/spark/examples/extensions/AgeExample.scala index d25f220..e484024 100644 --- a/examples/src/main/scala/org/apache/spark/examples/extensions/AgeExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/extensions/AgeExample.scala @@ -18,14 +18,15 @@ package org.apache.spark.examples.extensions import org.apache.spark.sql.catalyst.expressions.{CurrentDate, Expression, RuntimeReplaceable, SubtractDates} +import org.apache.spark.sql.catalyst.trees.UnaryLike /** * How old are you in days? */ -case class AgeExample(birthday: Expression, child: Expression) extends RuntimeReplaceable { - - def this(birthday: Expression) = this(birthday, SubtractDates(CurrentDate(), birthday)) - override def exprsReplaced: Seq[Expression] = Seq(birthday) - - override prote
[GitHub] [spark-website] AngersZhuuuu commented on pull request #380: Fix wrong issue link
AngersZh commented on pull request #380: URL: https://github.com/apache/spark-website/pull/380#issuecomment-1048486813 > oh wait, we should also update generated HTMLs too. updated and double checked. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] AngersZhuuuu commented on pull request #380: Fix wrong issue link
AngersZh commented on pull request #380: URL: https://github.com/apache/spark-website/pull/380#issuecomment-1048485159 > oh wait, we should also update generated HTMLs too. OK, let me change it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] HyukjinKwon commented on pull request #380: Fix wrong issue link
HyukjinKwon commented on pull request #380: URL: https://github.com/apache/spark-website/pull/380#issuecomment-1048484840 oh wait, we should also update generated HTMLs too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] AngersZhuuuu opened a new pull request #380: Fix wrong issue link
AngersZh opened a new pull request #380: URL: https://github.com/apache/spark-website/pull/380 Fix wrong issue link -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-38297][PYTHON] Explicitly cast the return value at DataFrame.to_numpy in POS
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new a0d2be5 [SPARK-38297][PYTHON] Explicitly cast the return value at DataFrame.to_numpy in POS a0d2be5 is described below commit a0d2be565486367abd6b637c98634c35420994ce Author: Hyukjin Kwon AuthorDate: Wed Feb 23 14:12:39 2022 +0900 [SPARK-38297][PYTHON] Explicitly cast the return value at DataFrame.to_numpy in POS MyPy build currently fails as below: ``` starting mypy annotations test... annotations failed mypy checks: python/pyspark/pandas/generic.py:585: error: Incompatible return value type (got "Union[ndarray[Any, Any], ExtensionArray]", expected "ndarray[Any, Any]") [return-value] Found 1 error in 1 file (checked 324 source files) 1 ``` https://github.com/apache/spark/runs/5298261168?check_suite_focus=true I tried to reproduce in my local by matching NumPy and MyPy versions but failed. So I decided to work around the problem first by explicitly casting to make MyPy happy. To make the build pass. No, dev-only. CI in this PR should verify if it's fixed. Closes #35617 from HyukjinKwon/SPARK-38297. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit b46b74ce0521d1d5e7c09cadad0e9639e31214cb) Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/generic.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/pyspark/pandas/generic.py b/python/pyspark/pandas/generic.py index cdd8f67..c26b516 100644 --- a/python/pyspark/pandas/generic.py +++ b/python/pyspark/pandas/generic.py @@ -573,7 +573,7 @@ class Frame(object, metaclass=ABCMeta): >>> ps.Series(['a', 'b', 'a']).to_numpy() array(['a', 'b', 'a'], dtype=object) """ -return self.to_pandas().values +return cast(np.ndarray, self._to_pandas().values) @property def values(self) -> np.ndarray: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (43e93b5 -> b46b74c)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 43e93b5 [SPARK-38241][K8S][TESTS] Close KubernetesClient in K8S integrations tests add b46b74c [SPARK-38297][PYTHON] Explicitly cast the return value at DataFrame.to_numpy in POS No new revisions were added by this update. Summary of changes: python/pyspark/pandas/generic.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (2534217 -> 43e93b5)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 2534217 [SPARK-38260][BUILD][CORE] Remove `commons-net` dependency in `hadoop-3` profile add 43e93b5 [SPARK-38241][K8S][TESTS] Close KubernetesClient in K8S integrations tests No new revisions were added by this update. Summary of changes: .../deploy/k8s/integrationtest/backend/cloud/KubeConfigBackend.scala | 3 +++ .../k8s/integrationtest/backend/minikube/MinikubeTestBackend.scala | 3 +++ 2 files changed, 6 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-38260][BUILD][CORE] Remove `commons-net` dependency in `hadoop-3` profile
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2534217 [SPARK-38260][BUILD][CORE] Remove `commons-net` dependency in `hadoop-3` profile 2534217 is described below commit 25342179447914d76123b8d3ae7bddf34e4bcfba Author: yangjie01 AuthorDate: Tue Feb 22 18:47:11 2022 -0800 [SPARK-38260][BUILD][CORE] Remove `commons-net` dependency in `hadoop-3` profile ### What changes were proposed in this pull request? [SPARK-1189](https://github.com/apache/spark/pull/33/files) introduces maven dependence on `commons-net`, and `org.apache.commons.net.util.Base64` is used in `SparkSaslServer`, but `SparkSaslServer` has changed to use `io.netty.handler.codec.base64.Base64` and there is no explicit dependency on `commons-net` in Spark code, so this pr removed this dependency. After this pr Spark with `hadoop-3` profile no longer need `commons-net`, but Spark with `hadoop-2` still need it due to `hadoop-2.7.4` use `commons-net` directly. ### Why are the changes needed? Remove unnecessary maven dependency. ### Does this PR introduce _any_ user-facing change? `commons-net` jar no longer exists in Spark-Client with hadoop-3.x ### How was this patch tested? Pass GA Closes #35582 from LuciferYang/SPARK-38260. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- core/pom.xml | 4 dev/deps/spark-deps-hadoop-3-hive-2.3 | 1 - pom.xml | 5 - 3 files changed, 10 deletions(-) diff --git a/core/pom.xml b/core/pom.xml index ac429fc..3d09591 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -251,10 +251,6 @@ RoaringBitmap - commons-net - commons-net - - org.scala-lang.modules scala-xml_${scala.binary.version} diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 73644ee..2de677e 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -49,7 +49,6 @@ commons-lang/2.6//commons-lang-2.6.jar commons-lang3/3.12.0//commons-lang3-3.12.0.jar commons-logging/1.1.3//commons-logging-1.1.3.jar commons-math3/3.6.1//commons-math3-3.6.1.jar -commons-net/3.1//commons-net-3.1.jar commons-pool/1.5.4//commons-pool-1.5.4.jar commons-text/1.9//commons-text-1.9.jar compress-lzf/1.0.3//compress-lzf-1.0.3.jar diff --git a/pom.xml b/pom.xml index 23e567c..d1e391c 100644 --- a/pom.xml +++ b/pom.xml @@ -804,11 +804,6 @@ 0.9.23 -commons-net -commons-net -3.1 - - io.netty netty-all 4.1.74.Final - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (4d75d47 -> ceb32c9)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4d75d47 [SPARK-38062][CORE] Avoid resolving placeholder hostname for FallbackStorage in BlockManagerDecommissioner add ceb32c9 [SPARK-38272][K8S][TESTS] Use `docker-desktop` instead of `docker-for-desktop` for Docker K8S IT deployMode and context name No new revisions were added by this update. Summary of changes: resource-managers/kubernetes/integration-tests/README.md | 8 .../integration-tests/scripts/setup-integration-test-env.sh | 2 +- .../apache/spark/deploy/k8s/integrationtest/TestConstants.scala | 1 + .../k8s/integrationtest/backend/IntegrationTestBackend.scala | 2 +- .../integrationtest/backend/docker/DockerForDesktopBackend.scala | 2 +- 5 files changed, 8 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (a11f799 -> 4d75d47)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a11f799 [SPARK-38121][PYTHON][SQL][FOLLOW-UP] Make df.sparkSession return the session that created DataFrame when SQLContext is used add 4d75d47 [SPARK-38062][CORE] Avoid resolving placeholder hostname for FallbackStorage in BlockManagerDecommissioner No new revisions were added by this update. Summary of changes: .../spark/storage/BlockManagerDecommissioner.scala | 31 +- .../spark/storage/FallbackStorageSuite.scala | 14 +++--- 2 files changed, 22 insertions(+), 23 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (27dbf6f -> a11f799)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 27dbf6f [SPARK-38291][BUILD][TESTS] Upgrade `postgresql` from 42.3.0 to 42.3.3 add a11f799 [SPARK-38121][PYTHON][SQL][FOLLOW-UP] Make df.sparkSession return the session that created DataFrame when SQLContext is used No new revisions were added by this update. Summary of changes: python/pyspark/sql/dataframe.py | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-38291][BUILD][TESTS] Upgrade `postgresql` from 42.3.0 to 42.3.3
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 27dbf6f [SPARK-38291][BUILD][TESTS] Upgrade `postgresql` from 42.3.0 to 42.3.3 27dbf6f is described below commit 27dbf6fe67c81887ee656a69fc327f3cb5ae56f2 Author: bjornjorgensen AuthorDate: Tue Feb 22 13:02:14 2022 -0800 [SPARK-38291][BUILD][TESTS] Upgrade `postgresql` from 42.3.0 to 42.3.3 ### What changes were proposed in this pull request? Upgrade Postgresql 42.3.0 to 42.3.3 [Postgresql changelog 42.3.3](https://jdbc.postgresql.org/documentation/changelog.html#version_42.3.3) ### Why are the changes needed? [CVE-2022-21724](https://nvd.nist.gov/vuln/detail/CVE-2022-21724) and [Arbitrary File Write Vulnerability](https://github.com/advisories/GHSA-673j-qm5f-xpv8) By upgrading postgresql from 42.3.0 to 42.3.3 we will resolve these issues. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All test must pass. Closes #35614 from bjornjorgensen/postgresql-from-42.3.0-to-42.3.3. Authored-by: bjornjorgensen Signed-off-by: Dongjoon Hyun --- pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pom.xml b/pom.xml index 788cf8c..23e567c 100644 --- a/pom.xml +++ b/pom.xml @@ -1181,7 +1181,7 @@ org.postgresql postgresql -42.3.0 +42.3.3 test - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (43822cd -> bd44611)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 43822cd [SPARK-38060][SQL] Respect allowNonNumericNumbers when parsing quoted NaN and Infinity values in JSON reader add bd44611 [SPARK-38290][SQL] Fix JsonSuite and ParquetIOSuite under ANSI mode No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/json/JsonSuite.scala | 39 +- .../datasources/parquet/ParquetIOSuite.scala | 7 +++- 2 files changed, 29 insertions(+), 17 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-38060][SQL] Respect allowNonNumericNumbers when parsing quoted NaN and Infinity values in JSON reader
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 43822cd [SPARK-38060][SQL] Respect allowNonNumericNumbers when parsing quoted NaN and Infinity values in JSON reader 43822cd is described below commit 43822cdd228a3ba49c47637c525d731d00772f64 Author: Andy Grove AuthorDate: Tue Feb 22 08:42:47 2022 -0600 [SPARK-38060][SQL] Respect allowNonNumericNumbers when parsing quoted NaN and Infinity values in JSON reader Signed-off-by: Andy Grove ### What changes were proposed in this pull request? When parsing JSON unquoted `NaN` and `Infinity`values for floating-point columns we get the expected behavior as shown below where valid values are returned when the parsing option `allowNonNumericNumbers` is enabled and `null` otherwise. | Value | allowNonNumericNumbers=true | allowNonNumericNumbers=false | | - | --- | | | NaN | Double.NaN | null | | +INF | Double.PositiveInfinity | null | | +Infinity | Double.PositiveInfinity | null | | Infinity | Double.PositiveInfinity | null | | -INF | Double.NegativeInfinity | null | | -Infinity | Double.NegativeInfinity | null | However, when these values are quoted we get the following unexpected behavior due to a different code path being used that is inconsistent with Jackson's parsing and that ignores the `allowNonNumericNumbers` parser option. | Value | allowNonNumericNumbers=true | allowNonNumericNumbers=false | | --- | --- | | | "NaN" | Double.NaN | Double.NaN | | "+INF" | null| null | | "+Infinity" | null| null | | "Infinity" | Double.PositiveInfinity | Double.PositiveInfinity | | "-INF" | null| null | | "-Infinity" | Double.NegativeInfinity | Double.NegativeInfinity | This PR updates the code path that handles quoted non-numeric numbers to make it consistent with the path that handles the unquoted values. ### Why are the changes needed? The current behavior does not match the documented behavior in https://spark.apache.org/docs/latest/sql-data-sources-json.html ### Does this PR introduce _any_ user-facing change? Yes, parsing of quoted `NaN` and `Infinity` values will now be consistent with the unquoted versions. ### How was this patch tested? Unit tests are updated. Closes #35573 from andygrove/SPARK-38060. Authored-by: Andy Grove Signed-off-by: Sean Owen --- docs/core-migration-guide.md | 2 ++ .../spark/sql/catalyst/json/JacksonParser.scala| 18 ++ .../datasources/json/JsonParsingOptionsSuite.scala | 39 ++ .../sql/execution/datasources/json/JsonSuite.scala | 6 4 files changed, 59 insertions(+), 6 deletions(-) diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md index 745b80d..588433c 100644 --- a/docs/core-migration-guide.md +++ b/docs/core-migration-guide.md @@ -26,6 +26,8 @@ license: | - Since Spark 3.3, Spark migrates its log4j dependency from 1.x to 2.x because log4j 1.x has reached end of life and is no longer supported by the community. Vulnerabilities reported after August 2015 against log4j 1.x were not checked and will not be fixed. Users should rewrite original log4j properties files using log4j2 syntax (XML, JSON, YAML, or properties format). Spark rewrites the `conf/log4j.properties.template` which is included in Spark distribution, to `conf/log4j2.properties [...] +- Since Spark 3.3, when reading values from a JSON attribute defined as `FloatType` or `DoubleType`, the strings `"+Infinity"`, `"+INF"`, and `"-INF"` are now parsed to the appropriate values, in addition to the already supported `"Infinity"` and `"-Infinity"` variations. This change was made to improve consistency with Jackson's parsing of the unquoted versions of these values. Also, the `allowNonNumericNumbers` option is now respected so these strings will now be considered invalid if [...] + ## Upgrading from Core 3.1 to 3.2 - Since Spark 3.2, `spark.scheduler.allocation.file` supports read remote file using hadoop filesystem which means if the path has no scheme Spark will respect hadoop configuration to read it. To restore the behavior before Sp
[spark] branch branch-3.2 updated: [SPARK-38271] PoissonSampler may output more rows than MaxRows
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 7d36329 [SPARK-38271] PoissonSampler may output more rows than MaxRows 7d36329 is described below commit 7d363294b7af212836e7a444ad82c716f3560278 Author: Ruifeng Zheng AuthorDate: Tue Feb 22 21:04:43 2022 +0800 [SPARK-38271] PoissonSampler may output more rows than MaxRows ### What changes were proposed in this pull request? when `replacement=true`, `Sample.maxRows` returns `None` ### Why are the changes needed? the underlying impl of `SampleExec` can not guarantee that its number of output rows <= `Sample.maxRows` ``` scala> val df = spark.range(0, 1000) df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.count res0: Long = 1000 scala> df.sample(true, 0.99, 10).count res1: Long = 1004 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #35593 from zhengruifeng/fix_sample_maxRows. Authored-by: Ruifeng Zheng Signed-off-by: Wenchen Fan (cherry picked from commit b68327968a7a5f7ac1afa9cc270204c9eaddcb75) Signed-off-by: Wenchen Fan --- .../sql/catalyst/plans/logical/basicLogicalOperators.scala | 6 +- .../spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala | 13 + 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala index 7f33f28..6748db5 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala @@ -1344,7 +1344,11 @@ case class Sample( s"Sampling fraction ($fraction) must be on interval [0, 1] without replacement") } - override def maxRows: Option[Long] = child.maxRows + override def maxRows: Option[Long] = { +// when withReplacement is true, PoissonSampler is applied in SampleExec, +// which may output more rows than child.maxRows. +if (withReplacement) None else child.maxRows + } override def output: Seq[Attribute] = child.output override protected def withNewChildInternal(newChild: LogicalPlan): Sample = diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala index 46e9dea..d3cbaa8 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala @@ -159,6 +159,19 @@ class CombiningLimitsSuite extends PlanTest { ) } + test("SPARK-38271: PoissonSampler may output more rows than child.maxRows") { +val query = testRelation.select().sample(0, 0.2, true, 1) +assert(query.maxRows.isEmpty) +val optimized = Optimize.execute(query.analyze) +assert(optimized.maxRows.isEmpty) +// can not eliminate Limit since Sample.maxRows is None +checkPlanAndMaxRow( + query.limit(10), + query.limit(10), + 10 +) + } + test("SPARK-33497: Eliminate Limit if Deduplicate max rows not larger than Limit") { checkPlanAndMaxRow( testRelation.deduplicate("a".attr).limit(10), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c82e0fe -> b683279)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c82e0fe [SPARK-37422][PYTHON][MLLIB] Inline typehints for pyspark.mllib.feature add b683279 [SPARK-38271] PoissonSampler may output more rows than MaxRows No new revisions were added by this update. Summary of changes: .../sql/catalyst/plans/logical/basicLogicalOperators.scala | 6 +- .../spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala | 13 + 2 files changed, 18 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (ef818ed -> c82e0fe)
This is an automated email from the ASF dual-hosted git repository. zero323 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ef818ed [SPARK-38283][SQL] Test invalid datetime parsing under ANSI mode add c82e0fe [SPARK-37422][PYTHON][MLLIB] Inline typehints for pyspark.mllib.feature No new revisions were added by this update. Summary of changes: python/pyspark/mllib/feature.py | 218 --- python/pyspark/mllib/feature.pyi | 169 -- 2 files changed, 155 insertions(+), 232 deletions(-) delete mode 100644 python/pyspark/mllib/feature.pyi - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-38283][SQL] Test invalid datetime parsing under ANSI mode
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ef818ed [SPARK-38283][SQL] Test invalid datetime parsing under ANSI mode ef818ed is described below commit ef818ed86ce41be55bd962a5c809974f957f8734 Author: Gengliang Wang AuthorDate: Tue Feb 22 19:12:02 2022 +0800 [SPARK-38283][SQL] Test invalid datetime parsing under ANSI mode ### What changes were proposed in this pull request? Run datetime-parsing-invalid.sql under ANSI mode in SQLQueryTestSuite for improving test coverage. Also, we can simply set ANSI mode as off in DateFunctionsSuite, so that the test suite can pass after we set up a new test job with ANSI on. ### Why are the changes needed? Improve test coverage and fix DateFunctionsSuite under ANSI mode. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #35606 from gengliangwang/fixDateFuncSuite. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../inputs/ansi/datetime-parsing-invalid.sql | 2 + .../results/ansi/datetime-parsing-invalid.sql.out | 263 + .../org/apache/spark/sql/DateFunctionsSuite.scala | 6 +- 3 files changed, 270 insertions(+), 1 deletion(-) diff --git a/sql/core/src/test/resources/sql-tests/inputs/ansi/datetime-parsing-invalid.sql b/sql/core/src/test/resources/sql-tests/inputs/ansi/datetime-parsing-invalid.sql new file mode 100644 index 000..70022f3 --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/ansi/datetime-parsing-invalid.sql @@ -0,0 +1,2 @@ +--IMPORT datetime-parsing-invalid.sql + diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/datetime-parsing-invalid.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/datetime-parsing-invalid.sql.out new file mode 100644 index 000..e6dd07b --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/results/ansi/datetime-parsing-invalid.sql.out @@ -0,0 +1,263 @@ +-- Automatically generated by SQLQueryTestSuite +-- Number of queries: 29 + + +-- !query +select to_timestamp('294248', 'y') +-- !query schema +struct<> +-- !query output +java.lang.ArithmeticException +long overflow + + +-- !query +select to_timestamp('1', 'yy') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to parse '1' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string. + + +-- !query +select to_timestamp('-12', 'yy') +-- !query schema +struct<> +-- !query output +java.time.format.DateTimeParseException +Text '-12' could not be parsed at index 0. If necessary set spark.sql.ansi.enabled to false to bypass this error. + + +-- !query +select to_timestamp('123', 'yy') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to parse '123' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string. + + +-- !query +select to_timestamp('1', 'yyy') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to parse '1' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string. + + +-- !query +select to_timestamp('1234567', 'yyy') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select to_timestamp('366', 'D') +-- !query schema +struct<> +-- !query output +java.time.DateTimeException +Invalid date 'DayOfYear 366' as '1970' is not a leap year. If necessary set spark.sql.ansi.enabled to false to bypass this error. + + +-- !query +select to_timestamp('9', 'DD') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to parse '9' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark
[spark] branch master updated (a103a49 -> 48b56c0)
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a103a49 [SPARK-38279][TESTS][3.2] Pin MarkupSafe to 2.0.1 fix linter failure add 48b56c0 [SPARK-38278][PYTHON] Add SparkContext.addArchive in PySpark No new revisions were added by this update. Summary of changes: python/docs/source/reference/pyspark.rst | 1 + python/pyspark/context.py| 44 2 files changed, 45 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org