[spark] branch master updated: [SPARK-39095][PYTHON] Adjust `GroupBy.std` to match pandas 1.4
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2d175986906 [SPARK-39095][PYTHON] Adjust `GroupBy.std` to match pandas 1.4 2d175986906 is described below commit 2d175986906b9ddf4b10b2b50d635b8bc07908fd Author: Xinrong Meng AuthorDate: Fri May 6 10:40:08 2022 +0900 [SPARK-39095][PYTHON] Adjust `GroupBy.std` to match pandas 1.4 ### What changes were proposed in this pull request? Adjust `GroupBy.std` to match pandas 1.4. Specifically, raise the TypeError when all aggregation columns are of unaccepted data types. ### Why are the changes needed? Improve API compatibility with pandas. ### Does this PR introduce _any_ user-facing change? Yes. ```py >>> psdf = ps.DataFrame( ... { ... "A": [1, 2, 1, 2], ... "B": [3.1, 4.1, 4.1, 3.1], ... "C": ["a", "b", "b", "a"], ... "D": [True, False, False, True], ... } ... ) >>> psdf AB C D 0 1 3.1 a True 1 2 4.1 b False 2 1 4.1 b False 3 2 3.1 a True ### Before >>> psdf.groupby('A')[['C']].std() Empty DataFrame Columns: [] Index: [1, 2] ### After >>> psdf.groupby('A')[['C']].std() ... TypeError: Unaccepted data types of aggregation columns; numeric or bool expected. ``` ### How was this patch tested? Unit tests. Closes #36444 from xinrong-databricks/groupby.std. Authored-by: Xinrong Meng Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/groupby.py| 15 +-- python/pyspark/pandas/tests/test_groupby.py | 19 +++ 2 files changed, 28 insertions(+), 6 deletions(-) diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py index 386b24c1916..20f7ec55660 100644 --- a/python/pyspark/pandas/groupby.py +++ b/python/pyspark/pandas/groupby.py @@ -640,6 +640,17 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta): """ assert ddof in (0, 1) +# Raise the TypeError when all aggregation columns are of unaccepted data types +all_unaccepted = True +for _agg_col in self._agg_columns: +if isinstance(_agg_col.spark.data_type, (NumericType, BooleanType)): +all_unaccepted = False +break +if all_unaccepted: +raise TypeError( +"Unaccepted data types of aggregation columns; numeric or bool expected." +) + return self._reduce_for_stat_function( F.stddev_pop if ddof == 0 else F.stddev_samp, accepted_spark_types=(NumericType,), @@ -2756,9 +2767,9 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta): Parameters -- -sfun : The aggregate function to apply per column +sfun : The aggregate function to apply per column. accepted_spark_types: Accepted spark types of columns to be aggregated; - default None means all spark types are accepted + default None means all spark types are accepted. bool_to_numeric: If True, boolean columns are converted to numeric columns, which are accepted for all statistical functions regardless of `accepted_spark_types`. diff --git a/python/pyspark/pandas/tests/test_groupby.py b/python/pyspark/pandas/tests/test_groupby.py index f645373eb3c..33f24a5e2be 100644 --- a/python/pyspark/pandas/tests/test_groupby.py +++ b/python/pyspark/pandas/tests/test_groupby.py @@ -1286,13 +1286,24 @@ class GroupByTest(PandasOnSparkTestCase, TestUtils): ps.DataFrame({"B": [3.1, 3.1], "D": [0, 0]}, index=pd.Index([1, 2], name="A")), ) -# TODO: fix bug of `std` and re-enable the test below -# self._test_stat_func(lambda groupby_obj: groupby_obj.std(), check_exact=False) -self.assert_eq(psdf.groupby("A").std(), pdf.groupby("A").std(), check_exact=False) +with self.assertRaisesRegex( +TypeError, "Unaccepted data types of aggregation columns; numeric or bool expected." +): +psdf.groupby("A")[["C"]].std() + +self.assert_eq( +psdf.groupby("A").std().sort_index(), +pdf.groupby("A").std().sort_index(), +check_exact=False, +) # TODO: fix bug of `sum` and re-enable the test below # self._test_stat_func(lambda groupby_obj: groupby_obj.sum(), check_exact=False) -self.assert_eq(psdf.groupby("A").sum(), pdf.groupby("A").sum(), check_exact=False) +self.assert_eq( +psdf.groupby("A").sum().
[spark] branch master updated: [SPARK-39108][SQL] Show hints for try_add/try_substract/try_multiply in int/long overflow errors
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c274812284a [SPARK-39108][SQL] Show hints for try_add/try_substract/try_multiply in int/long overflow errors c274812284a is described below commit c274812284a3b7ec725e6b8afc2e7ab0f91b923e Author: Gengliang Wang AuthorDate: Thu May 5 23:03:44 2022 +0300 [SPARK-39108][SQL] Show hints for try_add/try_substract/try_multiply in int/long overflow errors ### What changes were proposed in this pull request? Show hints for try_add/try_substract/try_multiply in int/long overflow errors ### Why are the changes needed? Better error message for resolving the overflow errors under ANSI mode. ### Does this PR introduce _any_ user-facing change? No, minor error message improvement ### How was this patch tested? UT Closes #36456 from gengliangwang/tryHint. Authored-by: Gengliang Wang Signed-off-by: Max Gekk --- .../scala/org/apache/spark/sql/catalyst/util/MathUtils.scala | 12 ++-- .../test/resources/sql-tests/results/postgreSQL/int4.sql.out | 12 ++-- .../test/resources/sql-tests/results/postgreSQL/int8.sql.out | 8 .../sql-tests/results/postgreSQL/window_part2.sql.out| 4 ++-- 4 files changed, 18 insertions(+), 18 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala index f96c9fba5a3..e5c87a41ea8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala @@ -27,32 +27,32 @@ object MathUtils { def addExact(a: Int, b: Int): Int = withOverflow(Math.addExact(a, b)) def addExact(a: Int, b: Int, errorContext: String): Int = -withOverflow(Math.addExact(a, b), errorContext = errorContext) +withOverflow(Math.addExact(a, b), hint = "try_add", errorContext = errorContext) def addExact(a: Long, b: Long): Long = withOverflow(Math.addExact(a, b)) def addExact(a: Long, b: Long, errorContext: String): Long = -withOverflow(Math.addExact(a, b), errorContext = errorContext) +withOverflow(Math.addExact(a, b), hint = "try_add", errorContext = errorContext) def subtractExact(a: Int, b: Int): Int = withOverflow(Math.subtractExact(a, b)) def subtractExact(a: Int, b: Int, errorContext: String): Int = -withOverflow(Math.subtractExact(a, b), errorContext = errorContext) +withOverflow(Math.subtractExact(a, b), hint = "try_subtract", errorContext = errorContext) def subtractExact(a: Long, b: Long): Long = withOverflow(Math.subtractExact(a, b)) def subtractExact(a: Long, b: Long, errorContext: String): Long = -withOverflow(Math.subtractExact(a, b), errorContext = errorContext) +withOverflow(Math.subtractExact(a, b), hint = "try_subtract", errorContext = errorContext) def multiplyExact(a: Int, b: Int): Int = withOverflow(Math.multiplyExact(a, b)) def multiplyExact(a: Int, b: Int, errorContext: String): Int = -withOverflow(Math.multiplyExact(a, b), errorContext = errorContext) +withOverflow(Math.multiplyExact(a, b), hint = "try_multiply", errorContext = errorContext) def multiplyExact(a: Long, b: Long): Long = withOverflow(Math.multiplyExact(a, b)) def multiplyExact(a: Long, b: Long, errorContext: String): Long = -withOverflow(Math.multiplyExact(a, b), errorContext = errorContext) +withOverflow(Math.multiplyExact(a, b), hint = "try_multiply", errorContext = errorContext) def negateExact(a: Int): Int = withOverflow(Math.negateExact(a)) diff --git a/sql/core/src/test/resources/sql-tests/results/postgreSQL/int4.sql.out b/sql/core/src/test/resources/sql-tests/results/postgreSQL/int4.sql.out index 6b42e31340f..a39cdbc340c 100755 --- a/sql/core/src/test/resources/sql-tests/results/postgreSQL/int4.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/postgreSQL/int4.sql.out @@ -200,7 +200,7 @@ SELECT '' AS five, i.f1, i.f1 * smallint('2') AS x FROM INT4_TBL i struct<> -- !query output org.apache.spark.SparkArithmeticException -[ARITHMETIC_OVERFLOW] integer overflow. If necessary set spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass this error. +[ARITHMETIC_OVERFLOW] integer overflow. To return NULL instead, use 'try_multiply'. If necessary set spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass this error. == SQL(line 1, position 25) == SELECT '' AS five, i.f1, i.f1 * smallint('2') AS x FROM INT4_TBL i @@ -223,7 +223,7 @@ SELECT '' AS five, i.f1, i.f1 * int('2') AS x FROM INT4_TBL i struct<
[spark] branch master updated: [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4b1c2fb7a27 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases 4b1c2fb7a27 is described below commit 4b1c2fb7a27757ebf470416c8ec02bb5c1f7fa49 Author: Max Gekk AuthorDate: Thu May 5 20:10:06 2022 +0300 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases ### What changes were proposed in this pull request? Add missed dependencies to `dev/create-release/spark-rm/Dockerfile`. ### Why are the changes needed? To be able to build Spark releases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By building the Spark 3.3 release via: ``` $ dev/create-release/do-release-docker.sh -d /home/ubuntu/max/spark-3.3-rc1 ``` Closes #36449 from MaxGekk/deps-Dockerfile. Authored-by: Max Gekk Signed-off-by: Max Gekk --- dev/create-release/spark-rm/Dockerfile | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/dev/create-release/spark-rm/Dockerfile b/dev/create-release/spark-rm/Dockerfile index ffd60c07af0..c6555e0463d 100644 --- a/dev/create-release/spark-rm/Dockerfile +++ b/dev/create-release/spark-rm/Dockerfile @@ -42,7 +42,7 @@ ARG APT_INSTALL="apt-get install --no-install-recommends -y" # We should use the latest Sphinx version once this is fixed. # TODO(SPARK-35375): Jinja2 3.0.0+ causes error when building with Sphinx. # See also https://issues.apache.org/jira/browse/SPARK-35375. -ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.19.4 pydata_sphinx_theme==0.4.1 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 pandas==1.1.5 pyarrow==3.0.0 plotly==5.4.0" +ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.19.4 pydata_sphinx_theme==0.4.1 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 pandas==1.1.5 pyarrow==3.0.0 plotly==5.4.0 markupsafe==2.0.1 docutils<0.17" ARG GEM_PKGS="bundler:2.2.9" # Install extra needed repos and refresh. @@ -79,9 +79,9 @@ RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \ # Note that PySpark doc generation also needs pandoc due to nbsphinx $APT_INSTALL r-base r-base-dev && \ $APT_INSTALL libcurl4-openssl-dev libgit2-dev libssl-dev libxml2-dev && \ - $APT_INSTALL texlive-latex-base texlive texlive-fonts-extra texinfo qpdf && \ + $APT_INSTALL texlive-latex-base texlive texlive-fonts-extra texinfo qpdf texlive-latex-extra && \ $APT_INSTALL libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev && \ - Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && \ + Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'markdown', 'roxygen2', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && \ Rscript -e "devtools::install_github('jimhester/lintr')" && \ Rscript -e "devtools::install_version('pkgdown', version='2.0.1', repos='https://cloud.r-project.org')" && \ Rscript -e "devtools::install_version('preferably', version='0.4', repos='https://cloud.r-project.org')" && \ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 6a61f95a359 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases 6a61f95a359 is described below commit 6a61f95a359e6aa9d09f8044019074dc7effcf30 Author: Max Gekk AuthorDate: Thu May 5 20:10:06 2022 +0300 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases ### What changes were proposed in this pull request? Add missed dependencies to `dev/create-release/spark-rm/Dockerfile`. ### Why are the changes needed? To be able to build Spark releases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By building the Spark 3.3 release via: ``` $ dev/create-release/do-release-docker.sh -d /home/ubuntu/max/spark-3.3-rc1 ``` Closes #36449 from MaxGekk/deps-Dockerfile. Authored-by: Max Gekk Signed-off-by: Max Gekk (cherry picked from commit 4b1c2fb7a27757ebf470416c8ec02bb5c1f7fa49) Signed-off-by: Max Gekk --- dev/create-release/spark-rm/Dockerfile | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/dev/create-release/spark-rm/Dockerfile b/dev/create-release/spark-rm/Dockerfile index ffd60c07af0..c6555e0463d 100644 --- a/dev/create-release/spark-rm/Dockerfile +++ b/dev/create-release/spark-rm/Dockerfile @@ -42,7 +42,7 @@ ARG APT_INSTALL="apt-get install --no-install-recommends -y" # We should use the latest Sphinx version once this is fixed. # TODO(SPARK-35375): Jinja2 3.0.0+ causes error when building with Sphinx. # See also https://issues.apache.org/jira/browse/SPARK-35375. -ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.19.4 pydata_sphinx_theme==0.4.1 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 pandas==1.1.5 pyarrow==3.0.0 plotly==5.4.0" +ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.19.4 pydata_sphinx_theme==0.4.1 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 pandas==1.1.5 pyarrow==3.0.0 plotly==5.4.0 markupsafe==2.0.1 docutils<0.17" ARG GEM_PKGS="bundler:2.2.9" # Install extra needed repos and refresh. @@ -79,9 +79,9 @@ RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \ # Note that PySpark doc generation also needs pandoc due to nbsphinx $APT_INSTALL r-base r-base-dev && \ $APT_INSTALL libcurl4-openssl-dev libgit2-dev libssl-dev libxml2-dev && \ - $APT_INSTALL texlive-latex-base texlive texlive-fonts-extra texinfo qpdf && \ + $APT_INSTALL texlive-latex-base texlive texlive-fonts-extra texinfo qpdf texlive-latex-extra && \ $APT_INSTALL libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev && \ - Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && \ + Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'markdown', 'roxygen2', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && \ Rscript -e "devtools::install_github('jimhester/lintr')" && \ Rscript -e "devtools::install_version('pkgdown', version='2.0.1', repos='https://cloud.r-project.org')" && \ Rscript -e "devtools::install_version('preferably', version='0.4', repos='https://cloud.r-project.org')" && \ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR] Remove unused import
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bf447046327 [MINOR] Remove unused import bf447046327 is described below commit bf447046327b80f176fd638db418d0513b9c2516 Author: panbingkun AuthorDate: Thu May 5 19:25:32 2022 +0300 [MINOR] Remove unused import ### What changes were proposed in this pull request? Remove unused import in `numerics`. ### Why are the changes needed? Cleanup ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #36454 from panbingkun/minor. Authored-by: panbingkun Signed-off-by: Max Gekk --- sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala | 1 - 1 file changed, 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala index fea792f08d0..c3d893d82fc 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala @@ -18,7 +18,6 @@ package org.apache.spark.sql.types import scala.math.Numeric._ -import scala.math.Ordering import org.apache.spark.sql.catalyst.util.{MathUtils, SQLOrderingUtil} import org.apache.spark.sql.errors.QueryExecutionErrors - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-37938][SQL][TESTS] Use error classes in the parsing errors of partitions
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 29ff671933e [SPARK-37938][SQL][TESTS] Use error classes in the parsing errors of partitions 29ff671933e is described below commit 29ff671933e3b432e69a26761bc79856f21b82c7 Author: panbingkun AuthorDate: Thu May 5 19:22:28 2022 +0300 [SPARK-37938][SQL][TESTS] Use error classes in the parsing errors of partitions ## What changes were proposed in this pull request? Migrate the following errors in QueryParsingErrors onto use error classes: - emptyPartitionKeyError => INVALID_SQL_SYNTAX - partitionTransformNotExpectedError => INVALID_SQL_SYNTAX - descColumnForPartitionUnsupportedError => UNSUPPORTED_FEATURE.DESC_TABLE_COLUMN_PARTITION - incompletePartitionSpecificationError => INVALID_SQL_SYNTAX ### Why are the changes needed? Porting parsing errors of partitions to new error framework, improve test coverage, and document expected error messages in tests. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running new test: ``` $ build/sbt "sql/testOnly *QueryParsingErrorsSuite*" ``` Closes #36416 from panbingkun/SPARK-37938. Authored-by: panbingkun Signed-off-by: Max Gekk --- core/src/main/resources/error/error-classes.json | 3 ++ .../spark/sql/errors/QueryParsingErrors.scala | 22 ++-- .../spark/sql/catalyst/parser/DDLParserSuite.scala | 2 +- .../resources/sql-tests/results/describe.sql.out | 2 +- .../spark/sql/errors/QueryErrorsSuiteBase.scala| 16 -- .../spark/sql/errors/QueryParsingErrorsSuite.scala | 60 ++ .../command/ShowPartitionsParserSuite.scala| 22 +--- .../command/TruncateTableParserSuite.scala | 21 +--- 8 files changed, 125 insertions(+), 23 deletions(-) diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index 24b50c4209a..3a7bc757f73 100644 --- a/core/src/main/resources/error/error-classes.json +++ b/core/src/main/resources/error/error-classes.json @@ -206,6 +206,9 @@ "AES_MODE" : { "message" : [ "AES- with the padding by the function." ] }, + "DESC_TABLE_COLUMN_PARTITION" : { +"message" : [ "DESC TABLE COLUMN for a specific partition." ] + }, "DISTRIBUTE_BY" : { "message" : [ "DISTRIBUTE BY clause." ] }, diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala index ed5773f4f82..1d15557c9d0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala @@ -77,7 +77,11 @@ object QueryParsingErrors extends QueryErrorsBase { } def emptyPartitionKeyError(key: String, ctx: PartitionSpecContext): Throwable = { -new ParseException(s"Found an empty partition key '$key'.", ctx) +new ParseException( + errorClass = "INVALID_SQL_SYNTAX", + messageParameters = +Array(s"Partition key ${toSQLId(key)} must set value (can't be empty)."), + ctx) } def combinationQueryResultClausesUnsupportedError(ctx: QueryOrganizationContext): Throwable = { @@ -243,7 +247,11 @@ object QueryParsingErrors extends QueryErrorsBase { def partitionTransformNotExpectedError( name: String, describe: String, ctx: ApplyTransformContext): Throwable = { -new ParseException(s"Expected a column reference for transform $name: $describe", ctx) +new ParseException( + errorClass = "INVALID_SQL_SYNTAX", + messageParameters = +Array(s"Expected a column reference for transform ${toSQLId(name)}: $describe"), + ctx) } def tooManyArgumentsForTransformError(name: String, ctx: ApplyTransformContext): Throwable = { @@ -298,12 +306,18 @@ object QueryParsingErrors extends QueryErrorsBase { } def descColumnForPartitionUnsupportedError(ctx: DescribeRelationContext): Throwable = { -new ParseException("DESC TABLE COLUMN for a specific partition is not supported", ctx) +new ParseException( + errorClass = "UNSUPPORTED_FEATURE", + messageParameters = Array("DESC_TABLE_COLUMN_PARTITION"), + ctx) } def incompletePartitionSpecificationError( key: String, ctx: DescribeRelationContext): Throwable = { -new ParseException(s"PARTITION specification is incomplete: `$key`", ctx) +new ParseException( + errorClass = "INVALID_SQL_SYNTAX", + messageParameters = Array(s"PARTITION specification is incomplete: ${toSQLId(key)}"), + ctx) } def computeStatisticsNot
[spark] branch master updated (ba499b1dcc1 -> 215b1b9e518)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from ba499b1dcc1 [SPARK-39068][SQL] Make thriftserver and sparksql-cli support in-memory catalog add 215b1b9e518 [SPARK-30661][ML][PYTHON] KMeans blockify input vectors No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/ml/linalg/BLAS.scala| 83 +++- .../org/apache/spark/ml/linalg/Matrices.scala | 72 ++-- .../scala/org/apache/spark/ml/linalg/Vectors.scala | 7 + .../org/apache/spark/ml/clustering/KMeans.scala| 428 +++-- .../org/apache/spark/mllib/clustering/KMeans.scala | 16 + .../apache/spark/ml/clustering/KMeansSuite.scala | 373 +- python/pyspark/ml/clustering.py| 48 ++- 7 files changed, 787 insertions(+), 240 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (6689b97ec76 -> ba499b1dcc1)
This is an automated email from the ASF dual-hosted git repository. yao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 6689b97ec76 [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) add ba499b1dcc1 [SPARK-39068][SQL] Make thriftserver and sparksql-cli support in-memory catalog No new revisions were added by this update. Summary of changes: .../spark/sql/hive/thriftserver/SparkSQLEnv.scala | 29 +++--- .../spark/sql/hive/thriftserver/CliSuite.scala | 20 +++ 2 files changed, 40 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r54275 - in /dev/spark/v3.3.0-rc1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/R/articles/ _site/api/R/deps/ _site/api/R/deps/bootstrap-5.1.0/ _site/api/R/deps/jquery-3.6.0/ _site/api
Author: maxgekk Date: Thu May 5 08:51:39 2022 New Revision: 54275 Log: Apache Spark v3.3.0-rc1 docs [This commit notification would consist of 2649 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r54273 - /dev/spark/v3.3.0-rc1-bin/
Author: maxgekk Date: Thu May 5 08:17:05 2022 New Revision: 54273 Log: Apache Spark v3.3.0-rc1 Added: dev/spark/v3.3.0-rc1-bin/ dev/spark/v3.3.0-rc1-bin/SparkR_3.3.0.tar.gz (with props) dev/spark/v3.3.0-rc1-bin/SparkR_3.3.0.tar.gz.asc dev/spark/v3.3.0-rc1-bin/SparkR_3.3.0.tar.gz.sha512 dev/spark/v3.3.0-rc1-bin/pyspark-3.3.0.tar.gz (with props) dev/spark/v3.3.0-rc1-bin/pyspark-3.3.0.tar.gz.asc dev/spark/v3.3.0-rc1-bin/pyspark-3.3.0.tar.gz.sha512 dev/spark/v3.3.0-rc1-bin/spark-3.3.0-bin-hadoop2.tgz (with props) dev/spark/v3.3.0-rc1-bin/spark-3.3.0-bin-hadoop2.tgz.asc dev/spark/v3.3.0-rc1-bin/spark-3.3.0-bin-hadoop2.tgz.sha512 dev/spark/v3.3.0-rc1-bin/spark-3.3.0-bin-hadoop3-scala2.13.tgz (with props) dev/spark/v3.3.0-rc1-bin/spark-3.3.0-bin-hadoop3-scala2.13.tgz.asc dev/spark/v3.3.0-rc1-bin/spark-3.3.0-bin-hadoop3-scala2.13.tgz.sha512 dev/spark/v3.3.0-rc1-bin/spark-3.3.0-bin-hadoop3.tgz (with props) dev/spark/v3.3.0-rc1-bin/spark-3.3.0-bin-hadoop3.tgz.asc dev/spark/v3.3.0-rc1-bin/spark-3.3.0-bin-hadoop3.tgz.sha512 dev/spark/v3.3.0-rc1-bin/spark-3.3.0-bin-without-hadoop.tgz (with props) dev/spark/v3.3.0-rc1-bin/spark-3.3.0-bin-without-hadoop.tgz.asc dev/spark/v3.3.0-rc1-bin/spark-3.3.0-bin-without-hadoop.tgz.sha512 dev/spark/v3.3.0-rc1-bin/spark-3.3.0.tgz (with props) dev/spark/v3.3.0-rc1-bin/spark-3.3.0.tgz.asc dev/spark/v3.3.0-rc1-bin/spark-3.3.0.tgz.sha512 Added: dev/spark/v3.3.0-rc1-bin/SparkR_3.3.0.tar.gz == Binary file - no diff available. Propchange: dev/spark/v3.3.0-rc1-bin/SparkR_3.3.0.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v3.3.0-rc1-bin/SparkR_3.3.0.tar.gz.asc == --- dev/spark/v3.3.0-rc1-bin/SparkR_3.3.0.tar.gz.asc (added) +++ dev/spark/v3.3.0-rc1-bin/SparkR_3.3.0.tar.gz.asc Thu May 5 08:17:05 2022 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJHBAABCgAxFiEEgPuOvo66aFBJiXA0kbXcgV2/ENMFAmJzh6QTHG1heGdla2tA +YXBhY2hlLm9yZwAKCRCRtdyBXb8Q07HcEACkCSXRG7LXd0+/jBU49syIUIpOsUrN +bgbq90ifbo6eCidbhj4wJl5OZO7tKCsV2IrbQYRHVP0Lq7GTCw1Fg4/mY4QiLkhi +RWDizZrKrr9CbHXVFo7ZTlIiaxjnTOcIxauKRtu6rbIJdfIzZyRZwhAYerdK6WOx +atrcWfrY/MhKW/v6/25b8R4SWpLssNXaGj5RRqhs/cn/Kjwus8WkBDzQIibcE2ac +TJA+agMH2fkyC1sUaZOVEo1E68nUBV/vv5GyEtctjnESGDsh90/d+6X8L2cmME9H +YGUO91cT1byN3LCR0FDqMSTea8yh3HsdTQ4Ly+s1Ia7h5UCwnDlpFXTyHsHX9sv7 +osXKz4b1ejogjxHlCiPpFgZ+P3gNa31mpJWmOwMLE49Cgxcn7DdZUXTZaAwZmwhH +YURgYtpqrG+4oKpAOLGR+wx+2ZGv0a0QeLd4iTUEhxhiPFRw9QkNG5VUmHgz237b +ZJzz9Ef0wLbaS5F6ZySk0FBqHTPgCsPZS3ZtmdU76zg37mNPej2xotLrLon2TXhN +TJkcLI8azbRoqcrNSOWKjBWYbLJ3nG4bDNqEkqdi/QApiisnneuXX89w152SI8vF +/GoyJK0xs6rjCsUURXWUZ/kzeVQHxtXfBNLk967+TSOHVDaKFehhS0hJbRNUP0jp +O+gTjMZQfQh+Uw== +=saiU +-END PGP SIGNATURE- Added: dev/spark/v3.3.0-rc1-bin/SparkR_3.3.0.tar.gz.sha512 == --- dev/spark/v3.3.0-rc1-bin/SparkR_3.3.0.tar.gz.sha512 (added) +++ dev/spark/v3.3.0-rc1-bin/SparkR_3.3.0.tar.gz.sha512 Thu May 5 08:17:05 2022 @@ -0,0 +1,3 @@ +SparkR_3.3.0.tar.gz: 98A2665A 04513C1A BE26952E 7396E3B7 AF63715B B6CCFAF3 + CD8C04EC A9F2374F F9E159D3 635CA631 22E4DCEE 1F6B6FE9 + F91F2E18 C9518AAF 713DC95A 3D39D496 Added: dev/spark/v3.3.0-rc1-bin/pyspark-3.3.0.tar.gz == Binary file - no diff available. Propchange: dev/spark/v3.3.0-rc1-bin/pyspark-3.3.0.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v3.3.0-rc1-bin/pyspark-3.3.0.tar.gz.asc == --- dev/spark/v3.3.0-rc1-bin/pyspark-3.3.0.tar.gz.asc (added) +++ dev/spark/v3.3.0-rc1-bin/pyspark-3.3.0.tar.gz.asc Thu May 5 08:17:05 2022 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJHBAABCgAxFiEEgPuOvo66aFBJiXA0kbXcgV2/ENMFAmJzh6YTHG1heGdla2tA +YXBhY2hlLm9yZwAKCRCRtdyBXb8Q0+4LD/wMGUzSXVcBCbUsVYtEtmoWjqBDZks7 +wN0SrnaI4UNXKlV0/rRbSMGRnVuqdwAlwJsb2RYNS56wswgTz9bhUB9cUUiSWftp +Pf5XE9LqarekEF48kSYv6XOGCoXIA4wa9BdfzBF8Q43kCI4WTRibv9xaMv+F60or +0xwgLl+8666M0L+Jg2tzrdI+cnkf42j07pL1HfqCsoZJSjxFmgSexXigZj+oSw+p +4bTTofAWUfj+jILpPw8s7Vnf0Gvi7YEGpfchUv9oB8N1LzKLyS1HYNLGSAqbE1vm +CvG9X8IzWQr4wIVqWSMWnsfImJL7EcA+G1SrUZP//d5UitvbF3ZZ5tMUvPYqgfKz +S7kwyxuI1/uQ6CpJ5vxdrQQfRauYA4oWws4jWf2O6xOF5VIB1F0aF0//SLdauR+r +GX4aYzQF+2DG6pIGJWYfrE9I4U4/LQLbdVVawItNnMKjphxD3Vi1kn9ITzJAtpLE +75T9wPvlqSY7bLQlpBLd2+mModF2K+Gonr8Z06Xe0kr/R+tyrjrP5Oa++egLcaFo +ZCr+L6WvkW8XnCfzU7T7d7wNKlskw7sh9BqOluMr+YW9rL+CKEYiM4JZrlUZCT3R +rcLnVX47qigSw+WETHtMLA/TWYS6FQpKqs49cYbWAAT2K6mvmPiM1MupZSo6HgS+ +/KROoSIKLGVTRA==
[spark] branch branch-3.3 updated: [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 0f2e3ecb994 [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) 0f2e3ecb994 is described below commit 0f2e3ecb9943aec91204c168b6402f3e5de53ca2 Author: Hyukjin Kwon AuthorDate: Thu May 5 16:23:28 2022 +0900 [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189): ```scala import org.apache.spark.sql.types._ val ds = Seq("a,", "a,b").toDS spark.read.schema( StructType( StructField("f1", StringType, nullable = false) :: StructField("f2", StringType, nullable = false) :: Nil) ).option("mode", "DROPMALFORMED").csv(ds).show() ``` **Before:** ``` +---+---+ | f1| f2| +---+---+ | a| b| +---+---+ ``` **After:** ``` +---++ | f1| f2| +---++ | a|null| | a| b| +---++ ``` This PR adds a configuration to restore **Before** behaviour. ### Why are the changes needed? To avoid breakage of valid usecases. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided. ### How was this patch tested? Unittests were added. Closes #36435 from HyukjinKwon/SPARK-35912. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 6689b97ec76abe5bab27f02869f8f16b32530d1a) Signed-off-by: Hyukjin Kwon --- docs/sql-migration-guide.md| 2 +- .../main/scala/org/apache/spark/sql/internal/SQLConf.scala | 11 +++ .../main/scala/org/apache/spark/sql/DataFrameReader.scala | 13 +++-- .../spark/sql/execution/datasources/csv/CSVSuite.scala | 10 ++ .../spark/sql/execution/datasources/json/JsonSuite.scala | 14 +- 5 files changed, 46 insertions(+), 4 deletions(-) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index b6bfb0ed2be..a7757d6c9a0 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -30,7 +30,7 @@ license: | - Since Spark 3.3, the functions `lpad` and `rpad` have been overloaded to support byte sequences. When the first argument is a byte sequence, the optional padding pattern must also be a byte sequence and the result is a BINARY value. The default padding pattern in this case is the zero byte. To restore the legacy behavior of always returning string types, set `spark.sql.legacy.lpadRpadAlwaysReturnString` to `true`. - - Since Spark 3.3, Spark turns a non-nullable schema into nullable for API `DataFrameReader.schema(schema: StructType).json(jsonDataset: Dataset[String])` and `DataFrameReader.schema(schema: StructType).csv(csvDataset: Dataset[String])` when the schema is specified by the user and contains non-nullable fields. + - Since Spark 3.3, Spark turns a non-nullable schema into nullable for API `DataFrameReader.schema(schema: StructType).json(jsonDataset: Dataset[String])` and `DataFrameReader.schema(schema: StructType).csv(csvDataset: Dataset[String])` when the schema is specified by the user and contains non-nullable fields. To restore the legacy behavior of respecting the nullability, set `spark.sql.legacy.respectNullabilityInTextDatasetConversion` to `true`. - Since Spark 3.3, when the date or timestamp pattern is not specified, Spark converts an input string to a date/timestamp using the `CAST` expression approach. The changes affect CSV/JSON datasources and parsing of partition values. In Spark 3.2 or earlier, when the date or timestamp pattern is not set, Spark uses the default patterns: `-MM-dd` for dates and `-MM-dd HH:mm:ss` for timestamps. After the changes, Spark still recognizes the pattern together with diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 76f3d1f5a84..b6230f71383 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/in
[spark] branch master updated: [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6689b97ec76 [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) 6689b97ec76 is described below commit 6689b97ec76abe5bab27f02869f8f16b32530d1a Author: Hyukjin Kwon AuthorDate: Thu May 5 16:23:28 2022 +0900 [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189): ```scala import org.apache.spark.sql.types._ val ds = Seq("a,", "a,b").toDS spark.read.schema( StructType( StructField("f1", StringType, nullable = false) :: StructField("f2", StringType, nullable = false) :: Nil) ).option("mode", "DROPMALFORMED").csv(ds).show() ``` **Before:** ``` +---+---+ | f1| f2| +---+---+ | a| b| +---+---+ ``` **After:** ``` +---++ | f1| f2| +---++ | a|null| | a| b| +---++ ``` This PR adds a configuration to restore **Before** behaviour. ### Why are the changes needed? To avoid breakage of valid usecases. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided. ### How was this patch tested? Unittests were added. Closes #36435 from HyukjinKwon/SPARK-35912. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- docs/sql-migration-guide.md| 2 +- .../main/scala/org/apache/spark/sql/internal/SQLConf.scala | 11 +++ .../main/scala/org/apache/spark/sql/DataFrameReader.scala | 13 +++-- .../spark/sql/execution/datasources/csv/CSVSuite.scala | 10 ++ .../spark/sql/execution/datasources/json/JsonSuite.scala | 14 +- 5 files changed, 46 insertions(+), 4 deletions(-) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 32b90da1917..59b8d47d306 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -30,7 +30,7 @@ license: | - Since Spark 3.3, the functions `lpad` and `rpad` have been overloaded to support byte sequences. When the first argument is a byte sequence, the optional padding pattern must also be a byte sequence and the result is a BINARY value. The default padding pattern in this case is the zero byte. To restore the legacy behavior of always returning string types, set `spark.sql.legacy.lpadRpadAlwaysReturnString` to `true`. - - Since Spark 3.3, Spark turns a non-nullable schema into nullable for API `DataFrameReader.schema(schema: StructType).json(jsonDataset: Dataset[String])` and `DataFrameReader.schema(schema: StructType).csv(csvDataset: Dataset[String])` when the schema is specified by the user and contains non-nullable fields. + - Since Spark 3.3, Spark turns a non-nullable schema into nullable for API `DataFrameReader.schema(schema: StructType).json(jsonDataset: Dataset[String])` and `DataFrameReader.schema(schema: StructType).csv(csvDataset: Dataset[String])` when the schema is specified by the user and contains non-nullable fields. To restore the legacy behavior of respecting the nullability, set `spark.sql.legacy.respectNullabilityInTextDatasetConversion` to `true`. - Since Spark 3.3, when the date or timestamp pattern is not specified, Spark converts an input string to a date/timestamp using the `CAST` expression approach. The changes affect CSV/JSON datasources and parsing of partition values. In Spark 3.2 or earlier, when the date or timestamp pattern is not set, Spark uses the default patterns: `-MM-dd` for dates and `-MM-dd HH:mm:ss` for timestamps. After the changes, Spark still recognizes the pattern together with diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 8876d780799..4c0eccbf35d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -3025,6 +3025,17 @@ object SQLConf { .intConf .createOptional + val LEGACY_RE