Re: [PR] [SPARK-47952][CORE][CONNECT] Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn [spark]

2024-04-30 Thread via GitHub
TakawaAkirayo commented on PR #46182: URL: https://github.com/apache/spark/pull/46182#issuecomment-2084451692 > One thing I'm wondering if it might work out of the box is the ability to specify an ephemeral port for the spark connect service and pick this up during startup. > > This

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
HyukjinKwon commented on PR #46298: URL: https://github.com/apache/spark/pull/46298#issuecomment-2084478693 https://github.com/HyukjinKwon/spark/actions/runs/8890321658 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-48055][PYTHON][CONNECT][TESTS] Enable `PandasUDFScalarParityTests.{test_vectorized_udf_empty_partition, test_vectorized_udf_struct_with_empty_partition}` [spark]

2024-04-30 Thread via GitHub
zhengruifeng commented on PR #46296: URL: https://github.com/apache/spark/pull/46296#issuecomment-2084487146 thanks @HyukjinKwon merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
nija-at commented on PR #46298: URL: https://github.com/apache/spark/pull/46298#issuecomment-2084723062 @pan3793 - user must use a lower (or same) version client to connect to a server ONLY. -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] [SPARK-46122][SQL] Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on PR #46207: URL: https://github.com/apache/spark/pull/46207#issuecomment-2084736603 Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [SPARK-46122][SQL] Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on PR #46207: URL: https://github.com/apache/spark/pull/46207#issuecomment-2084735200 Thank you all. Votes passed. - https://lists.apache.org/thread/65h92lc4mp1d6l6f00xfnlh586for05g -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] [SPARK-46122][SQL] Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun closed pull request #46207: [SPARK-46122][SQL] Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default URL: https://github.com/apache/spark/pull/46207 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [SPARK-48040][CONNECT][WIP]Spark connect supports scheduler pool [spark]

2024-04-30 Thread via GitHub
xieshuaihu commented on PR #46278: URL: https://github.com/apache/spark/pull/46278#issuecomment-2084934296 @HyukjinKwon I add a new rpc to make the `setSchedulerPool` api less confuse. Please let me know if this PR is in the right way? If is, more unit test will be added. --

[PR] [WIP][SPARK-48058][PYTHON][CONNECT] `UserDefinedFunction.returnType` parse the DDL string [spark]

2024-04-30 Thread via GitHub
zhengruifeng opened a new pull request, #46300: URL: https://github.com/apache/spark/pull/46300 ### What changes were proposed in this pull request? `UserDefinedFunction.returnType` parse the DDL string ### Why are the changes needed? 1, the return type check is missing in

Re: [PR] [SPARK-48055][PYTHON][CONNECT][TESTS] Enable `PandasUDFScalarParityTests.{test_vectorized_udf_empty_partition, test_vectorized_udf_struct_with_empty_partition}` [spark]

2024-04-30 Thread via GitHub
zhengruifeng closed pull request #46296: [SPARK-48055][PYTHON][CONNECT][TESTS] Enable `PandasUDFScalarParityTests.{test_vectorized_udf_empty_partition, test_vectorized_udf_struct_with_empty_partition}` URL: https://github.com/apache/spark/pull/46296 -- This is an automated message from the

Re: [PR] [SPARK-47764][CORE][SQL] Cleanup shuffle dependencies based on ShuffleCleanupMode [spark]

2024-04-30 Thread via GitHub
ulysses-you commented on code in PR #45930: URL: https://github.com/apache/spark/pull/45930#discussion_r1584223064 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -161,6 +165,24 @@ object SQLExecution extends Logging { case e =>

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
nija-at commented on code in PR #46298: URL: https://github.com/apache/spark/pull/46298#discussion_r1584275691 ## .github/workflows/build_python_connect35.yml: ## @@ -0,0 +1,135 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
HyukjinKwon commented on code in PR #46298: URL: https://github.com/apache/spark/pull/46298#discussion_r1584290019 ## .github/workflows/build_python_connect35.yml: ## @@ -0,0 +1,135 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
HyukjinKwon commented on PR #46298: URL: https://github.com/apache/spark/pull/46298#issuecomment-2084628436 For now, I am testing old client -> newer server case only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-48017] Add Spark application submission worker for operator [spark-kubernetes-operator]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on PR #10: URL: https://github.com/apache/spark-kubernetes-operator/pull/10#issuecomment-2084663882 Thanks. Let me consider more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-48056][CONNECT][PYTHON] Re-execute plan if a SESSION_NOT_FOUND error is raised and no partial response was received [spark]

2024-04-30 Thread via GitHub
grundprinzip commented on code in PR #46297: URL: https://github.com/apache/spark/pull/46297#discussion_r1584356038 ## python/pyspark/sql/tests/connect/client/test_client.py: ## @@ -340,6 +353,71 @@ def check(): eventually(timeout=1, catch_assertions=True)(check)()

Re: [PR] [SPARK-47764][CORE][SQL] Cleanup shuffle dependencies based on ShuffleCleanupMode [spark]

2024-04-30 Thread via GitHub
bozhang2820 commented on code in PR #45930: URL: https://github.com/apache/spark/pull/45930#discussion_r1584347563 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -161,6 +165,24 @@ object SQLExecution extends Logging { case e =>

Re: [PR] [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings [spark]

2024-04-30 Thread via GitHub
cloud-fan commented on code in PR #45725: URL: https://github.com/apache/spark/pull/45725#discussion_r1584428501 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -441,6 +444,45 @@ public static int execICU(final UTF8String string,

Re: [PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-30 Thread via GitHub
yaooqinn commented on code in PR #46266: URL: https://github.com/apache/spark/pull/46266#discussion_r1584472110 ## sql/core/benchmarks/AggregateBenchmark-results.txt: ## @@ -2,147 +2,147 @@ aggregate without grouping

Re: [PR] [SPARK-48052][PYTHON][CONNECT] Recover `pyspark-connect` CI by parent classes [spark]

2024-04-30 Thread via GitHub
HyukjinKwon closed pull request #46294: [SPARK-48052][PYTHON][CONNECT] Recover `pyspark-connect` CI by parent classes URL: https://github.com/apache/spark/pull/46294 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-47129][CONNECT][SQL][3.4] Make ResolveRelations cache connect plan properly [spark]

2024-04-30 Thread via GitHub
zhengruifeng commented on PR #46290: URL: https://github.com/apache/spark/pull/46290#issuecomment-2084480867 thanks @dongjoon-hyun merged to branch-3.4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-47129][CONNECT][SQL][3.4] Make ResolveRelations cache connect plan properly [spark]

2024-04-30 Thread via GitHub
zhengruifeng closed pull request #46290: [SPARK-47129][CONNECT][SQL][3.4] Make ResolveRelations cache connect plan properly URL: https://github.com/apache/spark/pull/46290 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
HyukjinKwon commented on code in PR #46298: URL: https://github.com/apache/spark/pull/46298#discussion_r1584211520 ## .github/workflows/build_python_connect35.yml: ## @@ -0,0 +1,127 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
nija-at commented on code in PR #46298: URL: https://github.com/apache/spark/pull/46298#discussion_r1584365823 ## .github/workflows/build_python_connect35.yml: ## @@ -0,0 +1,135 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license

Re: [PR] [WIP][SPARK-48058][PYTHON][CONNECT] `UserDefinedFunction.returnType` parse the DDL string [spark]

2024-04-30 Thread via GitHub
zhengruifeng commented on code in PR #46300: URL: https://github.com/apache/spark/pull/46300#discussion_r1584596015 ## python/pyspark/sql/connect/udf.py: ## @@ -148,15 +150,35 @@ def __init__( ) self.func = func -self.returnType: DataType = ( -

Re: [PR] [WIP][SPARK-48058][PYTHON][CONNECT] `UserDefinedFunction.returnType` parse the DDL string [spark]

2024-04-30 Thread via GitHub
zhengruifeng commented on code in PR #46300: URL: https://github.com/apache/spark/pull/46300#discussion_r1584596015 ## python/pyspark/sql/connect/udf.py: ## @@ -148,15 +150,35 @@ def __init__( ) self.func = func -self.returnType: DataType = ( -

Re: [PR] [SPARK-47764][CORE][SQL] Cleanup shuffle dependencies based on ShuffleCleanupMode [spark]

2024-04-30 Thread via GitHub
ulysses-you commented on code in PR #45930: URL: https://github.com/apache/spark/pull/45930#discussion_r1584239353 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -161,6 +165,24 @@ object SQLExecution extends Logging { case e =>

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
pan3793 commented on PR #46298: URL: https://github.com/apache/spark/pull/46298#issuecomment-2084616043 A basic question about the "backward compatibility" policy: does it mean that the user can - use a lower version client to connect to a higher version server? or - use a higher

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
HyukjinKwon commented on PR #46298: URL: https://github.com/apache/spark/pull/46298#issuecomment-2084662455 https://github.com/HyukjinKwon/spark/actions/runs/8891418687 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
pan3793 commented on PR #46298: URL: https://github.com/apache/spark/pull/46298#issuecomment-2084739947 > @pan3793 - user must use a lower (or same) version client to connect to a server ONLY. make sense, would be great to clarify that on the docs :) -- This is an automated

Re: [PR] [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings [spark]

2024-04-30 Thread via GitHub
uros-db commented on code in PR #45725: URL: https://github.com/apache/spark/pull/45725#discussion_r1584432496 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -441,6 +444,45 @@ public static int execICU(final UTF8String string,

Re: [PR] [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-30 Thread via GitHub
yaooqinn commented on code in PR #46266: URL: https://github.com/apache/spark/pull/46266#discussion_r1584492317 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala: ## @@ -58,6 +58,7 @@ object TPCDSQueryBenchmark extends

[PR] [SPARK-48057][PYTHON][CONNECT][TESTS] Enable `GroupedApplyInPandasTests. test_grouped_with_empty_partition` [spark]

2024-04-30 Thread via GitHub
zhengruifeng opened a new pull request, #46299: URL: https://github.com/apache/spark/pull/46299 ### What changes were proposed in this pull request? Enable `GroupedApplyInPandasTests. test_grouped_with_empty_partition` ### Why are the changes needed? test coverage ###

Re: [PR] [SPARK-48030][SQL] SPJ: cache rowOrdering and structType for InternalRowComparableWrapper [spark]

2024-04-30 Thread via GitHub
advancedxy commented on PR #46265: URL: https://github.com/apache/spark/pull/46265#issuecomment-2084474689 > LGTM. It would be nice to have something to configure this but I don't think it is super important. I feel the default value should be more than enough for most use cases? Similarly

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
grundprinzip commented on code in PR #46298: URL: https://github.com/apache/spark/pull/46298#discussion_r1584208938 ## .github/workflows/build_python_connect35.yml: ## @@ -0,0 +1,127 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
nija-at commented on code in PR #46298: URL: https://github.com/apache/spark/pull/46298#discussion_r1584365823 ## .github/workflows/build_python_connect35.yml: ## @@ -0,0 +1,135 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
nija-at commented on code in PR #46298: URL: https://github.com/apache/spark/pull/46298#discussion_r1584368597 ## .github/workflows/build_python_connect35.yml: ## @@ -0,0 +1,135 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license

Re: [PR] [SPARK-47764][CORE][SQL] Cleanup shuffle dependencies based on ShuffleCleanupMode [spark]

2024-04-30 Thread via GitHub
cloud-fan commented on code in PR #45930: URL: https://github.com/apache/spark/pull/45930#discussion_r1584496133 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -161,6 +165,24 @@ object SQLExecution extends Logging { case e =>

[PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
HyukjinKwon opened a new pull request, #46298: URL: https://github.com/apache/spark/pull/46298 ### What changes were proposed in this pull request? This PR is a tentative try to run Spark 3.5 tests with Python Client 3.5 against Spark Connect server 4.0. ### Why are the

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
HyukjinKwon commented on code in PR #46298: URL: https://github.com/apache/spark/pull/46298#discussion_r1584287770 ## .github/workflows/build_python_connect35.yml: ## @@ -0,0 +1,135 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
HyukjinKwon commented on code in PR #46298: URL: https://github.com/apache/spark/pull/46298#discussion_r1584287016 ## .github/workflows/build_python_connect35.yml: ## @@ -0,0 +1,135 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor

Re: [PR] [SPARK-47764][CORE][SQL] Cleanup shuffle dependencies based on ShuffleCleanupMode [spark]

2024-04-30 Thread via GitHub
cloud-fan commented on code in PR #45930: URL: https://github.com/apache/spark/pull/45930#discussion_r1584298803 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -161,6 +165,24 @@ object SQLExecution extends Logging { case e =>

Re: [PR] [SPARK-46122][SQL] Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on PR #46207: URL: https://github.com/apache/spark/pull/46207#issuecomment-2084643262 Hi, @cloud-fan , @yaooqinn , @ulysses-you . If you don't mind, could you participate the vote? :) -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings [spark]

2024-04-30 Thread via GitHub
miland-db commented on code in PR #45725: URL: https://github.com/apache/spark/pull/45725#discussion_r1584430498 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -441,6 +444,45 @@ public static int execICU(final UTF8String string,

Re: [PR] [SPARK-48052][PYTHON][CONNECT] Recover `pyspark-connect` CI by parent classes [spark]

2024-04-30 Thread via GitHub
HyukjinKwon commented on PR #46294: URL: https://github.com/apache/spark/pull/46294#issuecomment-2084885540 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

2024-04-30 Thread via GitHub
HyukjinKwon commented on code in PR #46298: URL: https://github.com/apache/spark/pull/46298#discussion_r1584288878 ## .github/workflows/build_python_connect35.yml: ## @@ -0,0 +1,135 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor

Re: [PR] [SPARK-48056][CONNECT][PYTHON] Re-execute plan if a SESSION_NOT_FOUND error is raised and no partial response was received [spark]

2024-04-30 Thread via GitHub
grundprinzip commented on code in PR #46297: URL: https://github.com/apache/spark/pull/46297#discussion_r1584356448 ## python/pyspark/sql/tests/connect/client/test_client.py: ## @@ -340,6 +353,71 @@ def check(): eventually(timeout=1, catch_assertions=True)(check)()

Re: [PR] [SPARK-48056][CONNECT][PYTHON] Re-execute plan if a SESSION_NOT_FOUND error is raised and no partial response was received [spark]

2024-04-30 Thread via GitHub
grundprinzip commented on code in PR #46297: URL: https://github.com/apache/spark/pull/46297#discussion_r1584356448 ## python/pyspark/sql/tests/connect/client/test_client.py: ## @@ -340,6 +353,71 @@ def check(): eventually(timeout=1, catch_assertions=True)(check)()

Re: [PR] [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings [spark]

2024-04-30 Thread via GitHub
cloud-fan commented on PR #45725: URL: https://github.com/apache/spark/pull/45725#issuecomment-2084801665 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings [spark]

2024-04-30 Thread via GitHub
cloud-fan closed pull request #45725: [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings URL: https://github.com/apache/spark/pull/45725 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[PR] [SPARK-48059][CORE] Implement the structured log framework on the java side [spark]

2024-04-30 Thread via GitHub
panbingkun opened a new pull request, #46301: URL: https://github.com/apache/spark/pull/46301 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

Re: [PR] [SPARK-47359][SQL] Support TRANSLATE function to work with collated strings [spark]

2024-04-30 Thread via GitHub
cloud-fan commented on PR #45820: URL: https://github.com/apache/spark/pull/45820#issuecomment-2085478473 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47359][SQL] Support TRANSLATE function to work with collated strings [spark]

2024-04-30 Thread via GitHub
cloud-fan closed pull request #45820: [SPARK-47359][SQL] Support TRANSLATE function to work with collated strings URL: https://github.com/apache/spark/pull/45820 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-48059][CORE] Implement the structured log framework on the java side [spark]

2024-04-30 Thread via GitHub
panbingkun commented on PR #46301: URL: https://github.com/apache/spark/pull/46301#issuecomment-2085118887 @gengliangwang In the migration of logs to the structured log framework, I found that we `missed` the corresponding implementation on the `java side`. This PR is to supplement it.

Re: [PR] [SPARK-47764][CORE][SQL] Cleanup shuffle dependencies based on ShuffleCleanupMode [spark]

2024-04-30 Thread via GitHub
bozhang2820 commented on code in PR #45930: URL: https://github.com/apache/spark/pull/45930#discussion_r1584732080 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -161,6 +165,24 @@ object SQLExecution extends Logging { case e =>

Re: [PR] [SPARK-48050][SS] Log logical plan at query start [spark]

2024-04-30 Thread via GitHub
HeartSaVioR commented on PR #46292: URL: https://github.com/apache/spark/pull/46292#issuecomment-2085242190 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-30 Thread via GitHub
HeartSaVioR closed pull request #45977: [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source URL: https://github.com/apache/spark/pull/45977 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-30 Thread via GitHub
HeartSaVioR commented on PR #45977: URL: https://github.com/apache/spark/pull/45977#issuecomment-2085292815 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-30 Thread via GitHub
HeartSaVioR commented on PR #45977: URL: https://github.com/apache/spark/pull/45977#issuecomment-2085292610 The GA only failed with docker integration test which isn't related. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

2024-04-30 Thread via GitHub
uros-db commented on code in PR #46206: URL: https://github.com/apache/spark/pull/46206#discussion_r1584638773 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -0,0 +1,304 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-48050][SS] Log logical plan at query start [spark]

2024-04-30 Thread via GitHub
HeartSaVioR commented on PR #46292: URL: https://github.com/apache/spark/pull/46292#issuecomment-2085241789 It only failed on docker integration build. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [SPARK-48003][SQL] Add collation support for hll sketch aggregate [spark]

2024-04-30 Thread via GitHub
cloud-fan closed pull request #46241: [SPARK-48003][SQL] Add collation support for hll sketch aggregate URL: https://github.com/apache/spark/pull/46241 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-47359][SQL] Support TRANSLATE function to work with collated strings [spark]

2024-04-30 Thread via GitHub
miland-db commented on PR #45820: URL: https://github.com/apache/spark/pull/45820#issuecomment-2085384201 @cloud-fan please review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47545][CONNECT] Dataset `observe` support for the Scala client [spark]

2024-04-30 Thread via GitHub
hvanhovell commented on PR #45701: URL: https://github.com/apache/spark/pull/45701#issuecomment-2085537023 @xupefei there is a genuine test failure. Can you check what is going on? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-47679][SQL] Use `HiveConf.getConfVars` or Hive conf names directly [spark]

2024-04-30 Thread via GitHub
dom93dd commented on PR #45804: URL: https://github.com/apache/spark/pull/45804#issuecomment-2085178244 This change seem's to break the compatibility with apache.iceberg.hive dependency: E.g. here:

Re: [PR] [SPARK-48003][SQL] Add collation support for hll sketch aggregate [spark]

2024-04-30 Thread via GitHub
cloud-fan commented on PR #46241: URL: https://github.com/apache/spark/pull/46241#issuecomment-2085259731 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-48059][CORE] Implement the structured log framework on the java side [spark]

2024-04-30 Thread via GitHub
panbingkun commented on PR #46301: URL: https://github.com/apache/spark/pull/46301#issuecomment-2085409857 We currently support three ways to write logs: - loggger.{error/warn/info/debug/trace}(string) - loggger.{error/warn/info/debug/trace}(string, throwable) -

Re: [PR] [SPARK-48056][CONNECT][PYTHON] Re-execute plan if a SESSION_NOT_FOUND error is raised and no partial response was received [spark]

2024-04-30 Thread via GitHub
nija-at commented on PR #46297: URL: https://github.com/apache/spark/pull/46297#issuecomment-2085591295 @juliuszsompolski - this will be followed up in a separate PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

2024-04-30 Thread via GitHub
davidm-db commented on code in PR #46206: URL: https://github.com/apache/spark/pull/46206#discussion_r1584647718 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -0,0 +1,304 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-48050][SS] Log logical plan at query start [spark]

2024-04-30 Thread via GitHub
HeartSaVioR closed pull request #46292: [SPARK-48050][SS] Log logical plan at query start URL: https://github.com/apache/spark/pull/46292 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-46894][PYTHON] Move PySpark error conditions into standalone JSON file [spark]

2024-04-30 Thread via GitHub
nchammas commented on code in PR #44920: URL: https://github.com/apache/spark/pull/44920#discussion_r1585016463 ## python/pyspark/errors/error_classes.py: ## @@ -15,1160 +15,15 @@ # limitations under the License. # -# NOTE: Automatically sort this file via -# - cd

Re: [PR] [SPARK-48061][SQL][TESTS] Parameterize max limits of `spark.sql.test.randomDataGenerator` [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on PR #46305: URL: https://github.com/apache/spark/pull/46305#issuecomment-2085999669 Hi, @viirya . Could you review this PR, please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-47977] DateTimeUtils.timestampDiff and DateTimeUtils.timestampAdd should not throw INTERNAL_ERROR exception [spark]

2024-04-30 Thread via GitHub
vitaliili-db commented on PR #46210: URL: https://github.com/apache/spark/pull/46210#issuecomment-2086003294 @cloud-fan please review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-48037][CORE] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on PR #46273: URL: https://github.com/apache/spark/pull/46273#issuecomment-2086892603 Thank you for review, @mridulm . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-47788][SS][TESTS][FOLLOWUP] Make `StreamingQueryHashPartitionVerifySuite` independent from SparkConf change [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun closed pull request #46303: [SPARK-47788][SS][TESTS][FOLLOWUP] Make `StreamingQueryHashPartitionVerifySuite` independent from SparkConf change URL: https://github.com/apache/spark/pull/46303 -- This is an automated message from the Apache Git Service. To respond to the message,

[PR] [SPARK-48061][SQL][TESTS] Parameterize max limits of `spark.sql.test.randomDataGenerator` [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun opened a new pull request, #46305: URL: https://github.com/apache/spark/pull/46305 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change?

Re: [PR] [SPARK-48059][CORE] Implement the structured log framework on the java side [spark]

2024-04-30 Thread via GitHub
panbingkun commented on PR #46301: URL: https://github.com/apache/spark/pull/46301#issuecomment-2086553908 > @panbingkun Thanks for bringing this up. I check the current Spark java code: > > ``` > find . -name "*.java"|xargs grep "logInfo\|logWarn\|logError"|grep -v target|grep

[PR] [SPARK-48060][SS][TESTS] Fix StreamingQueryHashPartitionVerifySuite to update golden files correctly [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun opened a new pull request, #46304: URL: https://github.com/apache/spark/pull/46304 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change?

Re: [PR] [SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on code in PR #46266: URL: https://github.com/apache/spark/pull/46266#discussion_r1585236044 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala: ## @@ -58,6 +58,7 @@ object TPCDSQueryBenchmark extends

Re: [PR] [SPARK-48061][SQL][TESTS] Parameterize max limits of `spark.sql.test.randomDataGenerator` [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on PR #46305: URL: https://github.com/apache/spark/pull/46305#issuecomment-2086257656 Thank you, @viirya ! Since this is irrelevant to CI and I did manual verification. I'll merge this~ Merged to master for Apache Spark 4.0.0. -- This is an automated

Re: [PR] [SPARK-48061][SQL][TESTS] Parameterize max limits of `spark.sql.test.randomDataGenerator` [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun closed pull request #46305: [SPARK-48061][SQL][TESTS] Parameterize max limits of `spark.sql.test.randomDataGenerator` URL: https://github.com/apache/spark/pull/46305 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [SPARK-48060][SS][TESTS] Fix `StreamingQueryHashPartitionVerifySuite` to update golden files correctly [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on PR #46304: URL: https://github.com/apache/spark/pull/46304#issuecomment-2086879868 Could you review this test PR about generating golden files, too, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] [SPARK-48040][CONNECT][WIP]Spark connect supports scheduler pool [spark]

2024-04-30 Thread via GitHub
hvanhovell commented on PR #46278: URL: https://github.com/apache/spark/pull/46278#issuecomment-2086174926 I am not 100% sure we should expose this as a client side conf. A client shouldn't have to set these things. Can't we just make the connect sever use a specific scheduler pool?

[PR] [SPARK-48062] add pyspark test for SimpleDataSourceStreamingReader [spark]

2024-04-30 Thread via GitHub
chaoqin-li1123 opened a new pull request, #46306: URL: https://github.com/apache/spark/pull/46306 ### What changes were proposed in this pull request? Add pyspark test for SimpleDataSourceStreamingReader. ### Why are the changes needed? To make sure

[PR] [SPARK-48035] Fix try_add/try_multiply being semantic equal to add/multiply [spark]

2024-04-30 Thread via GitHub
db-scnakandala opened a new pull request, #46307: URL: https://github.com/apache/spark/pull/46307 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ###

Re: [PR] [SPARK-48057][PYTHON][CONNECT][TESTS] Enable `GroupedApplyInPandasTests.test_grouped_with_empty_partition` [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun closed pull request #46299: [SPARK-48057][PYTHON][CONNECT][TESTS] Enable `GroupedApplyInPandasTests.test_grouped_with_empty_partition` URL: https://github.com/apache/spark/pull/46299 -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] [SPARK-48037][CORE] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on code in PR #46273: URL: https://github.com/apache/spark/pull/46273#discussion_r1585445992 ## sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala: ## @@ -85,8 +86,10 @@ class AdaptiveQueryExecSuite

Re: [PR] [SPARK-48060][SS][TESTS] Fix `StreamingQueryHashPartitionVerifySuite` to update golden files correctly [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on PR #46304: URL: https://github.com/apache/spark/pull/46304#issuecomment-2086924377 Thank you, @viirya ! Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [SPARK-48060][SS][TESTS] Fix `StreamingQueryHashPartitionVerifySuite` to update golden files correctly [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun closed pull request #46304: [SPARK-48060][SS][TESTS] Fix `StreamingQueryHashPartitionVerifySuite` to update golden files correctly URL: https://github.com/apache/spark/pull/46304 -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] [SPARK-48059][CORE] Implement the structured log framework on the java side [spark]

2024-04-30 Thread via GitHub
gengliangwang commented on PR #46301: URL: https://github.com/apache/spark/pull/46301#issuecomment-2086393195 @panbingkun Thanks for bringing this up. I check the current Spark java code: ``` find . -name "*.java"|xargs grep "logInfo\|logWarn\|logError"|grep -v target|grep -v test

[PR] [SPARK-47788][SS][TESTS][FOLLOWUP] Make `StreamingQueryHashPartitionVerifySuite` independent from SparkConf change [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun opened a new pull request, #46303: URL: https://github.com/apache/spark/pull/46303 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change?

Re: [PR] [SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on code in PR #46266: URL: https://github.com/apache/spark/pull/46266#discussion_r1585233714 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala: ## @@ -58,6 +58,7 @@ object TPCDSQueryBenchmark extends

Re: [PR] [SPARK-47578][CORE] Spark Core: Migrate logWarning with variables to structured logging framework [spark]

2024-04-30 Thread via GitHub
gengliangwang commented on code in PR #46309: URL: https://github.com/apache/spark/pull/46309#discussion_r1585528937 ## core/src/main/scala/org/apache/spark/Dependency.scala: ## @@ -211,10 +212,13 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( // This

Re: [PR] [SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on code in PR #46266: URL: https://github.com/apache/spark/pull/46266#discussion_r1585529534 ## connector/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroReadBenchmark.scala: ## @@ -87,7 +87,7 @@ object AvroReadBenchmark extends

Re: [PR] [SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on code in PR #46266: URL: https://github.com/apache/spark/pull/46266#discussion_r1585529534 ## connector/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroReadBenchmark.scala: ## @@ -87,7 +87,7 @@ object AvroReadBenchmark extends

Re: [PR] [SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on code in PR #46266: URL: https://github.com/apache/spark/pull/46266#discussion_r1585536429 ## sql/core/benchmarks/DataSourceReadBenchmark-jdk21-results.txt: ## @@ -2,430 +2,430 @@ SQL Single Numeric Column Scan

Re: [PR] [SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on code in PR #46266: URL: https://github.com/apache/spark/pull/46266#discussion_r1585536429 ## sql/core/benchmarks/DataSourceReadBenchmark-jdk21-results.txt: ## @@ -2,430 +2,430 @@ SQL Single Numeric Column Scan

Re: [PR] [SPARK-47578][CORE] Spark Core: Migrate logWarning with variables to structured logging framework [spark]

2024-04-30 Thread via GitHub
gengliangwang commented on code in PR #46309: URL: https://github.com/apache/spark/pull/46309#discussion_r1585543510 ## core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala: ## @@ -155,10 +155,12 @@ private[r] class RBackendHandler(server: RBackend) args)

Re: [PR] [SPARK-48063][CORE] Enable `spark.stage.ignoreDecommissionFetchFailure` by default [spark]

2024-04-30 Thread via GitHub
dongjoon-hyun commented on PR #46308: URL: https://github.com/apache/spark/pull/46308#issuecomment-2087477405 Thank you, @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47578][CORE] Spark Core: Migrate logWarning with variables to structured logging framework [spark]

2024-04-30 Thread via GitHub
gengliangwang commented on code in PR #46309: URL: https://github.com/apache/spark/pull/46309#discussion_r1585591097 ## core/src/main/scala/org/apache/spark/executor/Executor.scala: ## @@ -638,10 +638,12 @@ private[spark] class Executor( val freedMemory =

  1   2   >