[spark] branch master updated: [SPARK-40557][CONNECT] Update generated proto files for Spark Connect
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 072575c9e6f [SPARK-40557][CONNECT] Update generated proto files for Spark Connect 072575c9e6f is described below commit 072575c9e6fc304f09e01ad0ee180c8f309ede91 Author: Martin Grund AuthorDate: Tue Sep 27 10:55:47 2022 +0900 [SPARK-40557][CONNECT] Update generated proto files for Spark Connect ### What changes were proposed in this pull request? This patch cleans up the generated proto files from the initial Spark Connect import. The previous files had a Databricks specific go module path embedded in the generated Python descriptor. This is now removed. No new functionality added. ### Why are the changes needed? Cleanup. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The generated files are used during the regular testing for Spark Connect. Closes #37993 from grundprinzip/spark-connect-clean1. Authored-by: Martin Grund Signed-off-by: Hyukjin Kwon --- connect/src/main/buf.gen.yaml | 9 - python/pyspark/sql/connect/proto/base_pb2.py| 14 -- python/pyspark/sql/connect/proto/commands_pb2.py| 6 ++ python/pyspark/sql/connect/proto/expressions_pb2.py | 6 ++ python/pyspark/sql/connect/proto/relations_pb2.py | 10 +++--- python/pyspark/sql/connect/proto/types_pb2.py | 6 ++ 6 files changed, 13 insertions(+), 38 deletions(-) diff --git a/connect/src/main/buf.gen.yaml b/connect/src/main/buf.gen.yaml index 01e31d3c8b4..e3e15e549e8 100644 --- a/connect/src/main/buf.gen.yaml +++ b/connect/src/main/buf.gen.yaml @@ -26,15 +26,6 @@ plugins: out: gen/proto/python - remote: buf.build/grpc/plugins/python:v1.47.0-1 out: gen/proto/python - - remote: buf.build/protocolbuffers/plugins/go:v1.28.0-1 -out: gen/proto/go -opt: - - paths=source_relative - - remote: buf.build/grpc/plugins/go:v1.2.0-1 -out: gen/proto/go -opt: - - paths=source_relative - - require_unimplemented_servers=false - remote: buf.build/grpc/plugins/ruby:v1.47.0-1 out: gen/proto/ruby - remote: buf.build/protocolbuffers/plugins/ruby:v21.2.0-1 diff --git a/python/pyspark/sql/connect/proto/base_pb2.py b/python/pyspark/sql/connect/proto/base_pb2.py index 3adb77f77d6..910f9d644e3 100644 --- a/python/pyspark/sql/connect/proto/base_pb2.py +++ b/python/pyspark/sql/connect/proto/base_pb2.py @@ -28,16 +28,12 @@ from google.protobuf import symbol_database as _symbol_database _sym_db = _symbol_database.Default() -from pyspark.sql.connect.proto import ( -commands_pb2 as spark_dot_connect_dot_commands__pb2, -) -from pyspark.sql.connect.proto import ( -relations_pb2 as spark_dot_connect_dot_relations__pb2, -) +from pyspark.sql.connect.proto import commands_pb2 as spark_dot_connect_dot_commands__pb2 +from pyspark.sql.connect.proto import relations_pb2 as spark_dot_connect_dot_relations__pb2 DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile( - b'\n\x18spark/connect/base.proto\x12\rspark.connect\x1a\x1cspark/connect/commands.proto\x1a\x1dspark/connect/relations.proto"t\n\x04Plan\x12-\n\x04root\x18\x01 \x01(\x0b\x32\x17.spark.connect.RelationH\x00R\x04root\x12\x32\n\x07\x63ommand\x18\x02 \x01(\x0b\x32\x16.spark.connect.CommandH\x00R\x07\x63ommandB\t\n\x07op_type"\xdb\x01\n\x07Request\x12\x1b\n\tclient_id\x18\x01 \x01(\tR\x08\x63lientId\x12\x45\n\x0cuser_context\x18\x02 \x01(\x0b\x32".spark.connect.Request.UserContextR\x0buse [...] + b'\n\x18spark/connect/base.proto\x12\rspark.connect\x1a\x1cspark/connect/commands.proto\x1a\x1dspark/connect/relations.proto"t\n\x04Plan\x12-\n\x04root\x18\x01 \x01(\x0b\x32\x17.spark.connect.RelationH\x00R\x04root\x12\x32\n\x07\x63ommand\x18\x02 \x01(\x0b\x32\x16.spark.connect.CommandH\x00R\x07\x63ommandB\t\n\x07op_type"\xdb\x01\n\x07Request\x12\x1b\n\tclient_id\x18\x01 \x01(\tR\x08\x63lientId\x12\x45\n\x0cuser_context\x18\x02 \x01(\x0b\x32".spark.connect.Request.UserContextR\x0buse [...] ) _builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals()) @@ -45,9 +41,7 @@ _builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, "spark.connect.base_pb2", gl if _descriptor._USE_C_DESCRIPTORS == False: DESCRIPTOR._options = None -DESCRIPTOR._serialized_options = ( - b"\n\036org.apache.spark.connect.protoP\001Z)github.com/databricks/spark-connect/proto" -) +DESCRIPTOR._serialized_options = b"\n\036org.apache.spark.connect.protoP\001" _RESPONSE_METRICS_METRICOBJECT_EXECUTIONMETRICSENTRY._options = None _RESPONSE_METRICS_METRICOBJECT_EXECUTIONMETRICSENTRY._serialized_options = b"8\001" _PLAN._serialized_start = 104 diff --git
[spark] branch master updated: [SPARK-40561][PS] Implement `min_count` in `GroupBy.min`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 211ce40888d [SPARK-40561][PS] Implement `min_count` in `GroupBy.min` 211ce40888d is described below commit 211ce40888dcaaa3c3ffbd316109e17d0caad4e3 Author: Ruifeng Zheng AuthorDate: Tue Sep 27 09:53:09 2022 +0800 [SPARK-40561][PS] Implement `min_count` in `GroupBy.min` ### What changes were proposed in this pull request? Implement `min_count` in `GroupBy.min` ### Why are the changes needed? for API coverage ### Does this PR introduce _any_ user-facing change? yes, new parameter `min_count` supported ``` >>> df.groupby("D").min(min_count=3).sort_index() A BC D a 1.0 False 3.0 b NaN None NaN ``` ### How was this patch tested? added UT and doctest Closes #37998 from zhengruifeng/ps_groupby_min. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- python/pyspark/pandas/groupby.py| 41 ++--- python/pyspark/pandas/tests/test_groupby.py | 2 ++ 2 files changed, 40 insertions(+), 3 deletions(-) diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py index 6d36cfecce6..7085d2ec059 100644 --- a/python/pyspark/pandas/groupby.py +++ b/python/pyspark/pandas/groupby.py @@ -643,16 +643,23 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta): bool_to_numeric=True, ) -def min(self, numeric_only: Optional[bool] = False) -> FrameLike: +def min(self, numeric_only: Optional[bool] = False, min_count: int = -1) -> FrameLike: """ Compute min of group values. +.. versionadded:: 3.3.0 + Parameters -- numeric_only : bool, default False Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. +.. versionadded:: 3.4.0 +min_count : bool, default -1 +The required number of valid values to perform the operation. If fewer +than min_count non-NA values are present the result will be NA. + .. versionadded:: 3.4.0 See Also @@ -663,7 +670,7 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta): Examples >>> df = ps.DataFrame({"A": [1, 2, 1, 2], "B": [True, False, False, True], -..."C": [3, 4, 3, 4], "D": ["a", "b", "b", "a"]}) +..."C": [3, 4, 3, 4], "D": ["a", "a", "b", "a"]}) >>> df.groupby("A").min().sort_index() B C D A @@ -677,9 +684,37 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta): A 1 False 3 2 False 4 + +>>> df.groupby("D").min().sort_index() + A B C +D +a 1 False 3 +b 1 False 3 + + +>>> df.groupby("D").min(min_count=3).sort_index() + A BC +D +a 1.0 False 3.0 +b NaN None NaN """ +if not isinstance(min_count, int): +raise TypeError("min_count must be integer") + +if min_count > 0: + +def min(col: Column) -> Column: +return F.when( +F.count(F.when(~F.isnull(col), F.lit(0))) < min_count, F.lit(None) +).otherwise(F.min(col)) + +else: + +def min(col: Column) -> Column: +return F.min(col) + return self._reduce_for_stat_function( -F.min, accepted_spark_types=(NumericType, BooleanType) if numeric_only else None +min, accepted_spark_types=(NumericType, BooleanType) if numeric_only else None ) # TODO: sync the doc. diff --git a/python/pyspark/pandas/tests/test_groupby.py b/python/pyspark/pandas/tests/test_groupby.py index 4a57a3421df..f0b3a04be17 100644 --- a/python/pyspark/pandas/tests/test_groupby.py +++ b/python/pyspark/pandas/tests/test_groupby.py @@ -1401,8 +1401,10 @@ class GroupByTest(PandasOnSparkTestCase, TestUtils): def test_min(self): self._test_stat_func(lambda groupby_obj: groupby_obj.min()) +self._test_stat_func(lambda groupby_obj: groupby_obj.min(min_count=2)) self._test_stat_func(lambda groupby_obj: groupby_obj.min(numeric_only=None)) self._test_stat_func(lambda groupby_obj: groupby_obj.min(numeric_only=True)) +self._test_stat_func(lambda groupby_obj: groupby_obj.min(numeric_only=True, min_count=2)) def test_max(self): self._test_stat_func(lambda groupby_obj: groupby_obj.max()) - To
svn commit: r57014 - /dev/spark/v3.3.1-rc2-bin/
Author: yumwang Date: Tue Sep 27 00:54:34 2022 New Revision: 57014 Log: Apache Spark v3.3.1-rc2 Added: dev/spark/v3.3.1-rc2-bin/ dev/spark/v3.3.1-rc2-bin/SparkR_3.3.1.tar.gz (with props) dev/spark/v3.3.1-rc2-bin/SparkR_3.3.1.tar.gz.asc dev/spark/v3.3.1-rc2-bin/SparkR_3.3.1.tar.gz.sha512 dev/spark/v3.3.1-rc2-bin/pyspark-3.3.1.tar.gz (with props) dev/spark/v3.3.1-rc2-bin/pyspark-3.3.1.tar.gz.asc dev/spark/v3.3.1-rc2-bin/pyspark-3.3.1.tar.gz.sha512 dev/spark/v3.3.1-rc2-bin/spark-3.3.1-bin-hadoop2.tgz (with props) dev/spark/v3.3.1-rc2-bin/spark-3.3.1-bin-hadoop2.tgz.asc dev/spark/v3.3.1-rc2-bin/spark-3.3.1-bin-hadoop2.tgz.sha512 dev/spark/v3.3.1-rc2-bin/spark-3.3.1-bin-hadoop3-scala2.13.tgz (with props) dev/spark/v3.3.1-rc2-bin/spark-3.3.1-bin-hadoop3-scala2.13.tgz.asc dev/spark/v3.3.1-rc2-bin/spark-3.3.1-bin-hadoop3-scala2.13.tgz.sha512 dev/spark/v3.3.1-rc2-bin/spark-3.3.1-bin-hadoop3.tgz (with props) dev/spark/v3.3.1-rc2-bin/spark-3.3.1-bin-hadoop3.tgz.asc dev/spark/v3.3.1-rc2-bin/spark-3.3.1-bin-hadoop3.tgz.sha512 dev/spark/v3.3.1-rc2-bin/spark-3.3.1-bin-without-hadoop.tgz (with props) dev/spark/v3.3.1-rc2-bin/spark-3.3.1-bin-without-hadoop.tgz.asc dev/spark/v3.3.1-rc2-bin/spark-3.3.1-bin-without-hadoop.tgz.sha512 dev/spark/v3.3.1-rc2-bin/spark-3.3.1.tgz (with props) dev/spark/v3.3.1-rc2-bin/spark-3.3.1.tgz.asc dev/spark/v3.3.1-rc2-bin/spark-3.3.1.tgz.sha512 Added: dev/spark/v3.3.1-rc2-bin/SparkR_3.3.1.tar.gz == Binary file - no diff available. Propchange: dev/spark/v3.3.1-rc2-bin/SparkR_3.3.1.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v3.3.1-rc2-bin/SparkR_3.3.1.tar.gz.asc == --- dev/spark/v3.3.1-rc2-bin/SparkR_3.3.1.tar.gz.asc (added) +++ dev/spark/v3.3.1-rc2-bin/SparkR_3.3.1.tar.gz.asc Tue Sep 27 00:54:34 2022 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJHBAABCgAxFiEEhnJ9Q+c6QV9noLGhTmiz5s1HNlMFAmMxxpYTHHl1bXdhbmdA +YXBhY2hlLm9yZwAKCRBOaLPmzUc2Ux1eD/45yzmu90koq1UPSOU73/WYlSGRVPTd +afaGT5NEXqYyQbqkGIaDrd8ga/uKAXAmNiyZH4AsJn/Br6QDKR0sVgcG20z+QWn0 +xNfr/wRVKxGQlclNtIYjapTWPaa0WCLzyFseDN9NDwmLsoAcEaOulN+hnRlJ39ou +nTC8EjRnohu3qRnPpX10j2m5VNpp41+bSXjlwN+uID8iDF06YW08JqRixYsUBVgD +abv616xpkKgpFlg13iGathS/gybi0g5xRcHdCHfVx7IWnYDZlfGqnPURqoO3zTVO +pI43eJG+7T76zPuST1dm8E5g/4IZu7cUgP2AUEgNDu8X2WwqOEYFIqssfRkeAbEJ +onZ2TGr4hlwOGKLElIcDZuxR2iQlnzYdlVVujISCmDfv9KqKmGhSlRxrm4Zd8kQz +V11iHePNdBTuP2AcDqGq0wjnSFaBa3JMv0bSrYzZT1WJXz31foKo2yZgX0SZOVyE +tLvQTPH8NQWDQiMdcIaM3LmoQyMEjGGH7UGaa7/PfxxaCYm23KHr7ia+joGR8iMv +sR5CHHm/5erJYMvd0x9QqPuHGLgXpOlXEN/88sS+UI5r8d3/hFnrX6yGp5EhB9ND +IlTlA+3p7JHkEbWva5wrzuMZbDNUmjuq0RTxDY0/ocmZF2//BFwNSwIJXkM+mlFt +n7GlQtpMsSa2gQ== +=MLEK +-END PGP SIGNATURE- Added: dev/spark/v3.3.1-rc2-bin/SparkR_3.3.1.tar.gz.sha512 == --- dev/spark/v3.3.1-rc2-bin/SparkR_3.3.1.tar.gz.sha512 (added) +++ dev/spark/v3.3.1-rc2-bin/SparkR_3.3.1.tar.gz.sha512 Tue Sep 27 00:54:34 2022 @@ -0,0 +1 @@ +84f14789024788fa0cf688a2456a7f738da5f503a82d3d59c2042c725ee97d7d40f6be1b938e749aaad5c859784a440a579a21663767b79adee7cdc34db72f8a SparkR_3.3.1.tar.gz Added: dev/spark/v3.3.1-rc2-bin/pyspark-3.3.1.tar.gz == Binary file - no diff available. Propchange: dev/spark/v3.3.1-rc2-bin/pyspark-3.3.1.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v3.3.1-rc2-bin/pyspark-3.3.1.tar.gz.asc == --- dev/spark/v3.3.1-rc2-bin/pyspark-3.3.1.tar.gz.asc (added) +++ dev/spark/v3.3.1-rc2-bin/pyspark-3.3.1.tar.gz.asc Tue Sep 27 00:54:34 2022 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJHBAABCgAxFiEEhnJ9Q+c6QV9noLGhTmiz5s1HNlMFAmMxxpsTHHl1bXdhbmdA +YXBhY2hlLm9yZwAKCRBOaLPmzUc2U6X2EADOU45dZxwVH+8O5Fl96v3YI0wEfcdK +VEvqb+weYfucXwT6ecrbT8/gw/ZoK/jLiJDAJUiU4IfA1jdRRsTYAuYt22NFJZA9 +SwIkk5kUwbOcrlZ4J0bvHcKD7DTiMeHyfP0t3WnaTW1kVe9tTggbGSECu+tAU+5S +s8LkrNYHmm73Ujoc8uqjdXgs1KO4RmejRUa3YZ/BKLPx5kCucJ1YR1bGs+g2eLxJ +6gxBgM2HHewWB2fj644Vzotg+pbBLWfsFDmLyjK5IUvUBDuLx0mOZMtGkt0LwJi3 +/iN0+x53H0s46lc3/+3/i93QmxwPox+jrVFq7R9a9eclc/V6MV3JPIVneb0AqF9u +xJLTt9jiaOIYa4SlAoYdRVWYzE/9vhNtR+ARS5bFRpFn3jm69YrKhDLDjEDa8GLT +Hyc4U/1liBVl3eOjwyrChNxjR6kmvqDIiHM5SCFwOxRTjeozxqNcJNgtlNwSXTdv +QvCqZoUnzw8YqCb4D42deN3T5th4drpRu6tK+XizEYhDf2sf1stgowoCa/ZjksSp +IP3qn7y8T5N3h2GQ0xkHXDBvDdYhHMM81gsATgnNUUjgb+gOHVqW4WmYTn+f+dlG +GQogqkpwuOSrg7NIFBOP0PegOvZDZ9KyAGWKgtW1lwR4La5OsJLGlV5pSBpkqOR5 +o9kBjXpMJn8tDg== +=WNPp +-END PGP SIGNATURE- Added:
[spark] branch master updated: [SPARK-35242][SQL] Support changing session catalog's default database
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 576c4046742 [SPARK-35242][SQL] Support changing session catalog's default database 576c4046742 is described below commit 576c4046742e1fd5a0e93b87e9b29bac0df83788 Author: Gabor Roczei AuthorDate: Tue Sep 27 08:28:26 2022 +0800 [SPARK-35242][SQL] Support changing session catalog's default database ### What changes were proposed in this pull request? This PR is a follow-up PR for #32364. It has been closed by github-actions because it hasn't been updated in a while. The previous PR has added a new custom parameter (spark.sql.catalog.$SESSION_CATALOG_NAME.defaultDatabase) to configure the session catalog's default database which is required by some use cases where the user does not have access to the default database. Therefore I have created a new PR based on this and added these changes in addition: - Rebased / updated the previous PR to the latest master branch version - Deleted the DEFAULT_DATABASE static member from sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala and refactored the code regarding this ### Why are the changes needed? If our user does not have any permissions for the Hive default database in Ranger, it will fail with the following error: ``` 22/08/26 18:36:21 INFO metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=hrt_10ROOT.HWX.SITE (auth:KERBEROS) retries=1 delay=1 lifetime=0 org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [hrt_10] does not have [USE] privilege on [default]) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:144) ``` The idea is that we introduce a new configuration parameter where we can set a different database name for the default database. Our user has enough permissions for this in Ranger. For example: ```spark-shell --conf spark.sql.catalog.spark_catalog.defaultDatabase=other_db``` ### Does this PR introduce _any_ user-facing change? There will be a new configuration parameter as I mentioned above but the default value is "default" as it was previously. ### How was this patch tested? 1) With github action (all tests passed) https://github.com/roczei/spark/actions/runs/2934863118 2) Manually tested with Ranger + Hive Scenario a) hrt_10 does not have access to the default database in Hive: ``` [hrt_10quasar-thbnqr-2 ~]$ spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/08/26 18:14:18 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:14:30 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: [dispatcher-event-loop-17]: Attempted to request executors before the AM has registered! ... scala> spark.sql("use other") 22/08/26 18:18:47 INFO conf.HiveConf: [main]: Found configuration file file:/etc/hive/conf/hive-site.xml 22/08/26 18:18:48 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:18:48 WARN client.HiveClientImpl: [main]: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic Hive Session ID = 2188764e-d0dc-41b3-b714-f89b03cb3d6d 22/08/26 18:18:48 INFO SessionState: [main]: Hive Session ID = 2188764e-d0dc-41b3-b714-f89b03cb3d6d 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: HMS client filtering is enabled. 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: Trying to connect to metastore with URI thrift://quasar-thbnqr-4.quasar-thbnqr.root.hwx.site:9083 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: HMSC::open(): Could not find delegation token. Creating KERBEROS-based thrift connection. 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: Opened a connection to metastore, current connections: 1 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: Connected to metastore. 22/08/26 18:18:50 INFO metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class
[spark] branch master updated (7e39d9bfef3 -> 7230f598688)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 7e39d9bfef3 [SPARK-40552][BUILD][INFRA] Upgrade `protobuf-python` to 4.21.6 add 7230f598688 [SPARK-40357][SQL] Migrate window type check failures onto error classes No new revisions were added by this update. Summary of changes: core/src/main/resources/error/error-classes.json | 50 + .../org/apache/spark/SparkThrowableHelper.scala| 3 +- .../sql/catalyst/analysis/TypeCheckResult.scala| 2 +- .../catalyst/expressions/windowExpressions.scala | 85 ++- .../results/postgreSQL/window_part3.sql.out| 15 ++- .../native/windowFrameCoercion.sql.out | 71 - .../sql-tests/results/udf/udf-window.sql.out | 99 +++-- .../resources/sql-tests/results/window.sql.out | 116 ++-- .../spark/sql/DataFrameWindowFramesSuite.scala | 118 +++-- 9 files changed, 484 insertions(+), 75 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (778acd411e3 -> 7e39d9bfef3)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 778acd411e3 [SPARK-40478][DOCS] Add create datasource table options docs add 7e39d9bfef3 [SPARK-40552][BUILD][INFRA] Upgrade `protobuf-python` to 4.21.6 No new revisions were added by this update. Summary of changes: dev/create-release/spark-rm/Dockerfile | 2 +- dev/requirements.txt | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (8fdaf548bcc -> 778acd411e3)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 8fdaf548bcc [SPARK-40560][SQL] Rename `message` to `messageTemplate` in the `STANDARD` format of errors add 778acd411e3 [SPARK-40478][DOCS] Add create datasource table options docs No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-ddl-create-table-datasource.md | 13 + 1 file changed, 13 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-40560][SQL] Rename `message` to `messageTemplate` in the `STANDARD` format of errors
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8fdaf548bcc [SPARK-40560][SQL] Rename `message` to `messageTemplate` in the `STANDARD` format of errors 8fdaf548bcc is described below commit 8fdaf548bcc51630f7cfae8a17930c987b29fbd3 Author: Max Gekk AuthorDate: Mon Sep 26 14:31:55 2022 +0300 [SPARK-40560][SQL] Rename `message` to `messageTemplate` in the `STANDARD` format of errors ### What changes were proposed in this pull request? In the `STANDARD` format of error messages, rename the `message` field to `messageTemplate`. ### Why are the changes needed? Because the field contains a template of an error message, actually. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "core/testOnly *SparkThrowableSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.hive.thriftserver.CliSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly *ThriftServerWithSparkContextInBinarySuite" ``` Closes #37997 from MaxGekk/messageTemplate. Authored-by: Max Gekk Signed-off-by: Max Gekk --- core/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala | 8 core/src/main/scala/org/apache/spark/SparkThrowableHelper.scala | 2 +- core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala| 6 +++--- .../scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala | 2 +- .../sql/hive/thriftserver/ThriftServerWithSparkContextSuite.scala | 2 +- 5 files changed, 10 insertions(+), 10 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala b/core/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala index 8d4ae3a877d..9d6dd9dde07 100644 --- a/core/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala +++ b/core/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala @@ -74,12 +74,12 @@ class ErrorClassesJsonReader(jsonFileURLs: Seq[URL]) { assert(errorInfo.subClass.isDefined == subErrorClass.isDefined) if (subErrorClass.isEmpty) { - errorInfo.messageFormat + errorInfo.messageTemplate } else { val errorSubInfo = errorInfo.subClass.get.getOrElse( subErrorClass.get, throw SparkException.internalError(s"Cannot find sub error class '$errorClass'")) - errorInfo.messageFormat + " " + errorSubInfo.messageFormat + errorInfo.messageTemplate + " " + errorSubInfo.messageTemplate } } @@ -102,7 +102,7 @@ private case class ErrorInfo( sqlState: Option[String]) { // For compatibility with multi-line error messages @JsonIgnore - val messageFormat: String = message.mkString("\n") + val messageTemplate: String = message.mkString("\n") } /** @@ -114,5 +114,5 @@ private case class ErrorInfo( private case class ErrorSubInfo(message: Seq[String]) { // For compatibility with multi-line error messages @JsonIgnore - val messageFormat: String = message.mkString("\n") + val messageTemplate: String = message.mkString("\n") } diff --git a/core/src/main/scala/org/apache/spark/SparkThrowableHelper.scala b/core/src/main/scala/org/apache/spark/SparkThrowableHelper.scala index d503f400d00..9073a73dec4 100644 --- a/core/src/main/scala/org/apache/spark/SparkThrowableHelper.scala +++ b/core/src/main/scala/org/apache/spark/SparkThrowableHelper.scala @@ -92,7 +92,7 @@ private[spark] object SparkThrowableHelper { if (errorSubClass != null) g.writeStringField("errorSubClass", errorSubClass) if (format == STANDARD) { val finalClass = errorClass + Option(errorSubClass).map("." + _).getOrElse("") -g.writeStringField("message", errorReader.getMessageTemplate(finalClass)) +g.writeStringField("messageTemplate", errorReader.getMessageTemplate(finalClass)) } val sqlState = e.getSqlState if (sqlState != null) g.writeStringField("sqlState", sqlState) diff --git a/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala b/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala index 266683b1eca..191304bc353 100644 --- a/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala +++ b/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala @@ -128,7 +128,7 @@ class SparkThrowableSuite extends SparkFunSuite { test("Message format invariants") { val messageFormats = errorReader.errorInfoMap .filterKeys(!_.startsWith("_LEGACY_ERROR_TEMP_")) - .values.toSeq.flatMap { i => Seq(i.messageFormat) } + .values.toSeq.flatMap { i => Seq(i.messageTemplate) } checkCondition(messageFormats, s => s != null) checkIfUnique(messageFormats)
[spark] branch master updated: [SPARK-40501][SQL] Add PushProjectionThroughLimit for Optimizer
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c02aef01bbc [SPARK-40501][SQL] Add PushProjectionThroughLimit for Optimizer c02aef01bbc is described below commit c02aef01bbc1530884c23850224293336a5430a2 Author: panbingkun AuthorDate: Mon Sep 26 15:50:07 2022 +0800 [SPARK-40501][SQL] Add PushProjectionThroughLimit for Optimizer ### What changes were proposed in this pull request? The pr aim to add `PushProjectionThroughLimit` for `Optimizer`, improve query performance. ### Why are the changes needed? "normalize" the order of project and limit operator **{project(..., limit(...)) => limit(..., project(...))}**, so that we can have more chances to merge adjacent projects or limits **{eg: by SpecialLimits}**. --- When I query a big table(size / per day: 10T, column size: 1219) with limit 1 A.Scenario 1(run sql in spark-sql) - The results will be fetched soon - The optimization of CollectLimitExec has taken effect 1.SQL: select * from xxx where ..._day = '20220919' limit 1 https://user-images.githubusercontent.com/15246973/191184857-fa3d7f08-f0ea-4d70-a406-48828a3bb761.png;> 2.Spark UI: https://user-images.githubusercontent.com/15246973/191204519-86de05d3-40d4-4acd-9833-86bf318ecb72.png;> B.Scenario 2(run sql in spark-shell) - It took a long time to fetch out(still running after 20 minutes...) 1.Code: spark.sql("select * from xxx where ..._day = '20220919' limit 1").show() https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png;> 2.Spark UI: https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png;> C.Scenario 3(run sql in spark-shell) - The results will be fetched soon 1.Code: spark.sql("select * from xxx where ..._day = '20220919'").show(1) https://user-images.githubusercontent.com/15246973/191215182-bf278f71-c3ee-4028-8372-c9ed69431b6f.png;> 2.Spark UI: https://user-images.githubusercontent.com/15246973/191215437-3a291b34-5257-485a-bc18-6c0e0865d7ce.png;> ## The diff between Scenario 2 and Scenario3 is focus on "Optimized Logical Plan" https://user-images.githubusercontent.com/15246973/191216863-367c764d-2aa0-4c79-b240-0ebb6735937f.png;> https://user-images.githubusercontent.com/15246973/191217175-02213b0c-5d09-4154-85f7-18751734afd0.png;> # After pr: Scenario 2(run sql in spark-shell) - The results will be fetched soon - The optimization of CollectLimitExec has taken effect 1.Code: spark.sql("select * from xxx where ..._day = '20220919' limit 1").show() https://user-images.githubusercontent.com/15246973/191880203-2f951644-b59b-4dc8-9e80-c02c458fc28b.png;> 2.Spark UI: https://user-images.githubusercontent.com/15246973/191219627-82655d01-1d5e-44ee-af49-e8da51c9ca72.png;> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new UT & Pass GA. Closes #37941 from panbingkun/improve_shell_limit. Authored-by: panbingkun Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/optimizer/Optimizer.scala | 1 + .../optimizer/PushProjectionThroughLimit.scala | 39 ++ .../PushProjectionThroughLimitSuite.scala | 90 ++ 3 files changed, 130 insertions(+) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index 3ac75554a2b..2664fd63806 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -81,6 +81,7 @@ abstract class Optimizer(catalogManager: CatalogManager) Seq( // Operator push down PushProjectionThroughUnion, +PushProjectionThroughLimit, ReorderJoin, EliminateOuterJoin, PushDownPredicates, diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushProjectionThroughLimit.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushProjectionThroughLimit.scala new file mode 100644 index 000..6280cc5e42c --- /dev/null +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushProjectionThroughLimit.scala @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0
[spark] branch master updated: [SPARK-40530][SQL] Add error-related developer APIs
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2bf26dfbd70 [SPARK-40530][SQL] Add error-related developer APIs 2bf26dfbd70 is described below commit 2bf26dfbd708cbbbe8a51dda6972296055661e07 Author: Wenchen Fan AuthorDate: Mon Sep 26 15:35:38 2022 +0800 [SPARK-40530][SQL] Add error-related developer APIs ### What changes were proposed in this pull request? Third-party Spark plugins may define their own errors using the same framework as Spark: put error definition in json files. This PR moves some error-related code to a new file and marks them as developer APIs, so that others can reuse them instead of writing its own json reader. ### Why are the changes needed? make it easier for Spark plugins to define errors. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #37969 from cloud-fan/error. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan --- .../org/apache/spark/ErrorClassesJSONReader.scala | 118 + .../org/apache/spark/SparkThrowableHelper.scala| 105 ++ .../org/apache/spark/SparkThrowableSuite.scala | 54 +++--- 3 files changed, 163 insertions(+), 114 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala b/core/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala new file mode 100644 index 000..8d4ae3a877d --- /dev/null +++ b/core/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import java.net.URL + +import scala.collection.JavaConverters._ +import scala.collection.immutable.SortedMap + +import com.fasterxml.jackson.annotation.JsonIgnore +import com.fasterxml.jackson.core.`type`.TypeReference +import com.fasterxml.jackson.databind.json.JsonMapper +import com.fasterxml.jackson.module.scala.DefaultScalaModule +import org.apache.commons.text.StringSubstitutor + +import org.apache.spark.annotation.DeveloperApi + +/** + * A reader to load error information from one or more JSON files. Note that, if one error appears + * in more than one JSON files, the latter wins. Please read core/src/main/resources/error/README.md + * for more details. + */ +@DeveloperApi +class ErrorClassesJsonReader(jsonFileURLs: Seq[URL]) { + assert(jsonFileURLs.nonEmpty) + + private def readAsMap(url: URL): SortedMap[String, ErrorInfo] = { +val mapper: JsonMapper = JsonMapper.builder() + .addModule(DefaultScalaModule) + .build() +mapper.readValue(url, new TypeReference[SortedMap[String, ErrorInfo]]() {}) + } + + // Exposed for testing + private[spark] val errorInfoMap = jsonFileURLs.map(readAsMap).reduce(_ ++ _) + + def getErrorMessage(errorClass: String, messageParameters: Map[String, String]): String = { +val messageTemplate = getMessageTemplate(errorClass) +val sub = new StringSubstitutor(messageParameters.asJava) +sub.setEnableUndefinedVariableException(true) +try { + sub.replace(messageTemplate.replaceAll("<([a-zA-Z0-9_-]+)>", "\\$\\{$1\\}")) +} catch { + case _: IllegalArgumentException => throw SparkException.internalError( +s"Undefined error message parameter for error class: '$errorClass'. " + + s"Parameters: $messageParameters") +} + } + + def getMessageTemplate(errorClass: String): String = { +val errorClasses = errorClass.split("\\.") +assert(errorClasses.length == 1 || errorClasses.length == 2) + +val mainErrorClass = errorClasses.head +val subErrorClass = errorClasses.tail.headOption +val errorInfo = errorInfoMap.getOrElse( + mainErrorClass, + throw SparkException.internalError(s"Cannot find main error class '$errorClass'")) +assert(errorInfo.subClass.isDefined == subErrorClass.isDefined) + +if (subErrorClass.isEmpty) { + errorInfo.messageFormat +} else { +
[spark] branch master updated: [SPARK-40540][SQL] Migrate compilation errors onto error classes
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 39f60578dc3 [SPARK-40540][SQL] Migrate compilation errors onto error classes 39f60578dc3 is described below commit 39f60578dc3c189e7f5cfb2a74f0502408aa123e Author: Max Gekk AuthorDate: Mon Sep 26 10:26:27 2022 +0300 [SPARK-40540][SQL] Migrate compilation errors onto error classes ### What changes were proposed in this pull request? In the PR, I propose to migrate 100 compilation errors onto temporary error classes with the prefix `_LEGACY_ERROR_TEMP_10xx`. The error message will not include the error classes, so, in this way we will preserve the existing behaviour. ### Why are the changes needed? The migration on temporary error classes allows to gather statistics about errors and detect most popular error classes. After that we could prioritise the work on migration. The new error class name prefix `_LEGACY_ERROR_TEMP_` proposed here kind of marks the error as developer-facing, not user-facing. Developers can still get the error class programmatically via the `SparkThrowable` interface, so that they can build error infra with it. End users won't see the error class in the message. This allows us to do the error migration very quickly, and we can refine the error classes and mark them as user-facing later (naming them properly, adding tests, etc.). ### Does this PR introduce _any_ user-facing change? No. The error messages should be almost the same by default. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "core/testOnly *SparkThrowableSuite" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" $ build/sbt "test:testOnly *AnalysisErrorSuite" $ build/sbt "test:testOnly *.HiveUDAFSuite" $ build/sbt "test:testOnly *.DataFrameSuite" $ build/sbt "test:testOnly *SQLQuerySuite" ``` Closes #37973 from MaxGekk/legacy-error-temp-compliation. Authored-by: Max Gekk Signed-off-by: Max Gekk --- core/src/main/resources/error/error-classes.json | 502 +++ .../spark/sql/errors/QueryCompilationErrors.scala | 536 ++--- .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 2 +- .../results/ansi/string-functions.sql.out | 32 +- .../results/ceil-floor-with-scale-param.sql.out| 32 +- .../sql-tests/results/change-column.sql.out| 34 +- .../results/columnresolution-negative.sql.out | 15 +- .../test/resources/sql-tests/results/count.sql.out | 7 +- .../sql-tests/results/csv-functions.sql.out| 24 +- .../sql-tests/results/group-by-ordinal.sql.out | 4 +- .../sql-tests/results/join-lateral.sql.out | 15 +- .../sql-tests/results/json-functions.sql.out | 72 ++- .../sql-tests/results/order-by-ordinal.sql.out | 45 +- .../test/resources/sql-tests/results/pivot.sql.out | 14 +- .../results/postgreSQL/window_part3.sql.out| 21 +- .../sql-tests/results/show_columns.sql.out | 8 +- .../results/sql-compatibility-functions.sql.out| 14 +- .../sql-tests/results/string-functions.sql.out | 32 +- .../sql-tests/results/table-aliases.sql.out| 30 +- .../results/table-valued-functions.sql.out | 2 +- .../sql-tests/results/timestamp-ntz.sql.out| 16 +- .../sql-tests/results/udaf/udaf-group-by.sql.out | 15 +- .../resources/sql-tests/results/udaf/udaf.sql.out | 31 +- .../sql-tests/results/udf/udf-pivot.sql.out| 14 +- .../sql-tests/results/udf/udf-udaf.sql.out | 31 +- .../sql-tests/results/udf/udf-window.sql.out | 7 +- .../resources/sql-tests/results/window.sql.out | 25 +- .../org/apache/spark/sql/DataFrameSuite.scala | 20 +- .../scala/org/apache/spark/sql/SQLQuerySuite.scala | 41 +- .../spark/sql/connector/DataSourceV2SQLSuite.scala | 13 +- .../execution/command/DescribeTableSuiteBase.scala | 2 +- .../execution/command/v1/DescribeTableSuite.scala | 4 +- .../spark/sql/hive/execution/HiveUDAFSuite.scala | 18 +- 33 files changed, 1428 insertions(+), 250 deletions(-) diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index 5694b2c9d0f..aff6576af73 100644 --- a/core/src/main/resources/error/error-classes.json +++ b/core/src/main/resources/error/error-classes.json @@ -1163,5 +1163,507 @@ "message" : [ "." ] + }, + "_LEGACY_ERROR_TEMP_1000" : { +"message" : [ + "LEGACY store assignment policy is disallowed in Spark data source V2. Please set the configuration to other values." +] + }, + "_LEGACY_ERROR_TEMP_1001" : { +"message" : [ + "USING column `` cannot be resolved on the side of the join.
[spark] branch master updated (b4014eb13a1 -> 0deba9fdd1a)
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b4014eb13a1 [SPARK-40330][PS] Implement `Series.searchsorted` add 0deba9fdd1a [SPARK-40554][PS] Make `ddof` in `DataFrame.sem` and `Series.sem` accept arbitary integers No new revisions were added by this update. Summary of changes: python/pyspark/pandas/generic.py | 18 +- python/pyspark/pandas/tests/test_stats.py | 6 ++ 2 files changed, 19 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org