[jira] [Assigned] (SPARK-49667) Spark expressions that use StringSearch do not behave properly with CS_AI collators

2024-09-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49667:
---

Assignee: Vladan Vasić

> Spark expressions that use StringSearch do not behave properly with CS_AI 
> collators
> ---
>
> Key: SPARK-49667
> URL: https://issues.apache.org/jira/browse/SPARK-49667
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.5.2
>Reporter: Vladan Vasić
>Assignee: Vladan Vasić
>Priority: Major
>  Labels: pull-request-available
>
> Expressions that are implemented using StringSearch are not behaving 
> correctly when used with CS_AI collators.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49667) Spark expressions that use StringSearch do not behave properly with CS_AI collators

2024-09-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49667.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 48121
[https://github.com/apache/spark/pull/48121]

> Spark expressions that use StringSearch do not behave properly with CS_AI 
> collators
> ---
>
> Key: SPARK-49667
> URL: https://issues.apache.org/jira/browse/SPARK-49667
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.5.2
>Reporter: Vladan Vasić
>Assignee: Vladan Vasić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Expressions that are implemented using StringSearch are not behaving 
> correctly when used with CS_AI collators.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49667][SQL] Disallowed CS_AI collators with expressions that use StringSearch

2024-09-19 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a060c236d314 [SPARK-49667][SQL] Disallowed CS_AI collators with 
expressions that use StringSearch
a060c236d314 is described below

commit a060c236d314bd2facc73ad26926b59401e5f7aa
Author: Vladan Vasić 
AuthorDate: Thu Sep 19 14:25:53 2024 +0200

[SPARK-49667][SQL] Disallowed CS_AI collators with expressions that use 
StringSearch

### What changes were proposed in this pull request?

In this PR, I propose to disallow `CS_AI` collated strings in expressions 
that use `StringsSearch` in their implementation. These expressions are `trim`, 
`startswith`, `endswith`, `locate`, `instr`, `str_to_map`, `contains`, 
`replace`, `split_part` and `substring_index`.

Currently, these expressions support all possible collations, however, they 
do not work properly with `CS_AI` collators. This is because there is no 
support for `CS_AI` search in the ICU's `StringSearch` class which is used to 
implement these expressions. Therefore, the expressions are not behaving 
correctly when used with `CS_AI` collators (e.g. currently `startswith('hOtEl' 
collate unicode_ai, 'Hotel' collate unicode_ai)` returns `true`).

### Why are the changes needed?

Proposed changes are necessary in order to achieve correct behavior of the 
expressions mentioned above.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This patch was tested by adding a test in the `CollationSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48121 from 
vladanvasi-db/vladanvasi-db/cs-ai-collations-expressions-disablement.

Authored-by: Vladan Vasić 
    Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/util/CollationFactory.java  |  12 +
 .../sql/internal/types/AbstractStringType.scala|   9 +
 .../org/apache/spark/sql/types/StringType.scala|   3 +
 .../catalyst/expressions/complexTypeCreator.scala  |   4 +-
 .../catalyst/expressions/stringExpressions.scala   |  33 +-
 .../sql-tests/analyzer-results/collations.sql.out  | 336 +++
 .../test/resources/sql-tests/inputs/collations.sql |  14 +
 .../resources/sql-tests/results/collations.sql.out | 364 +
 .../spark/sql/CollationSQLExpressionsSuite.scala   |  24 ++
 .../sql/CollationStringExpressionsSuite.scala  | 251 ++
 10 files changed, 1041 insertions(+), 9 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
index 87558971042e..d5dbca7eb89b 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
@@ -921,6 +921,18 @@ public final class CollationFactory {
 return Collation.CollationSpec.collationNameToId(collationName);
   }
 
+  /**
+   * Returns whether the ICU collation is not Case Sensitive Accent Insensitive
+   * for the given collation id.
+   * This method is used in expressions which do not support CS_AI collations.
+   */
+  public static boolean isCaseSensitiveAndAccentInsensitive(int collationId) {
+return 
Collation.CollationSpecICU.fromCollationId(collationId).caseSensitivity ==
+Collation.CollationSpecICU.CaseSensitivity.CS &&
+
Collation.CollationSpecICU.fromCollationId(collationId).accentSensitivity ==
+Collation.CollationSpecICU.AccentSensitivity.AI;
+  }
+
   public static void assertValidProvider(String provider) throws 
SparkException {
 if (!SUPPORTED_PROVIDERS.contains(provider.toLowerCase())) {
   Map params = Map.of(
diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/internal/types/AbstractStringType.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/internal/types/AbstractStringType.scala
index 05d1701eff74..dc4ee013fd18 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/internal/types/AbstractStringType.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/internal/types/AbstractStringType.scala
@@ -51,3 +51,12 @@ case object StringTypeBinaryLcase extends AbstractStringType 
{
 case object StringTypeAnyCollation extends AbstractStringType {
   override private[sql] def acceptsType(other: DataType): Boolean = 
other.isInstanceOf[StringType]
 }
+
+/**
+ * Use StringTypeNonCSAICollation for expressions supporting all possible 
collation types except
+ * CS_AI collation types.
+ */
+case object StringTypeNonCSAICollation extends AbstractStringType {
+  override private[sql] def acceptsType(other: DataType): 

(spark) branch master updated: [SPARK-48280][SQL][FOLLOW-UP] Add expressions that are built via expressionBuilder to Expression Walker

2024-09-19 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ac34f1de92c6 [SPARK-48280][SQL][FOLLOW-UP] Add expressions that are 
built via expressionBuilder to Expression Walker
ac34f1de92c6 is described below

commit ac34f1de92c6f5cb53d799f00e550a0a204d9eb2
Author: Mihailo Milosevic 
AuthorDate: Thu Sep 19 11:56:10 2024 +0200

[SPARK-48280][SQL][FOLLOW-UP] Add expressions that are built via 
expressionBuilder to Expression Walker

### What changes were proposed in this pull request?
Addition of new expressions to expression walker. This PR also improves 
descriptions of methods in the Suite.

### Why are the changes needed?
It was noticed while debugging that startsWith, endsWith and contains are 
not tested with this suite and these expressions represent core of collation 
testing.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Test only.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48162 from mihailom-db/expressionwalkerfollowup.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/CollationExpressionWalkerSuite.scala | 148 +
 1 file changed, 121 insertions(+), 27 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationExpressionWalkerSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationExpressionWalkerSuite.scala
index 2342722c0bb1..1d23774a5169 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationExpressionWalkerSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationExpressionWalkerSuite.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql
 import java.sql.Timestamp
 
 import org.apache.spark.{SparkFunSuite, SparkRuntimeException}
+import org.apache.spark.sql.catalyst.analysis.ExpressionBuilder
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.variant.ParseJson
 import org.apache.spark.sql.internal.SqlApiConf
@@ -46,7 +47,7 @@ class CollationExpressionWalkerSuite extends SparkFunSuite 
with SharedSparkSessi
*
* @param inputEntry - List of all input entries that need to be generated
* @param collationType - Flag defining collation type to use
-   * @return
+   * @return - List of data generated for expression instance creation
*/
   def generateData(
   inputEntry: Seq[Any],
@@ -54,23 +55,11 @@ class CollationExpressionWalkerSuite extends SparkFunSuite 
with SharedSparkSessi
 inputEntry.map(generateSingleEntry(_, collationType))
   }
 
-  /**
-   * Helper function to generate single entry of data as a string.
-   * @param inputEntry - Single input entry that requires generation
-   * @param collationType - Flag defining collation type to use
-   * @return
-   */
-  def generateDataAsStrings(
-  inputEntry: Seq[AbstractDataType],
-  collationType: CollationType): Seq[Any] = {
-inputEntry.map(generateInputAsString(_, collationType))
-  }
-
   /**
* Helper function to generate single entry of data.
* @param inputEntry - Single input entry that requires generation
* @param collationType - Flag defining collation type to use
-   * @return
+   * @return - Single input entry data
*/
   def generateSingleEntry(
   inputEntry: Any,
@@ -100,7 +89,7 @@ class CollationExpressionWalkerSuite extends SparkFunSuite 
with SharedSparkSessi
*
* @param inputType- Single input literal type that requires generation
* @param collationType - Flag defining collation type to use
-   * @return
+   * @return - Literal/Expression containing expression ready for evaluation
*/
   def generateLiterals(
   inputType: AbstractDataType,
@@ -116,6 +105,7 @@ class CollationExpressionWalkerSuite extends SparkFunSuite 
with SharedSparkSessi
   }
   case BooleanType => Literal(true)
   case _: DatetimeType => Literal(Timestamp.valueOf("2009-07-30 12:58:59"))
+  case DecimalType => Literal((new Decimal).set(5))
   case _: DecimalType => Literal((new Decimal).set(5))
   case _: DoubleType => Literal(5.0)
   case IntegerType | NumericType | IntegralType => Literal(5)
@@ -158,11 +148,15 @@ class CollationExpressionWalkerSuite extends 
SparkFunSuite with SharedSparkSessi
   case MapType =>
 val key = generateLiterals(StringTypeAnyCollation, collationType)
 val value = generateLiterals(StringTypeAnyCollation, collationType)
-Literal.create(Map(key -> value))
+CreateMap(Seq(key, value))
   case MapType(keyType, valueType, _) =>
 val key = generateLiterals(keyType, collationType)
 val value = generateLiterals(valueType, colla

[jira] [Assigned] (SPARK-48782) Add support for loading stored procedures in catalogs

2024-09-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48782:
---

Assignee: Anton Okolnychyi

> Add support for loading stored procedures in catalogs
> -
>
> Key: SPARK-48782
> URL: https://issues.apache.org/jira/browse/SPARK-48782
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>  Labels: pull-request-available
>
> Add support for loading stored procedures in catalogs by providing parser 
> extensions as well as new logical and physical plans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48782][SQL] Add support for executing procedures in catalogs

2024-09-19 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 492d1b14c0d1 [SPARK-48782][SQL] Add support for executing procedures 
in catalogs
492d1b14c0d1 is described below

commit 492d1b14c0d19fa89b9ce9c0e48fc0e4c120b70c
Author: Anton Okolnychyi 
AuthorDate: Thu Sep 19 11:09:40 2024 +0200

[SPARK-48782][SQL] Add support for executing procedures in catalogs

### What changes were proposed in this pull request?

This PR adds support for executing procedures in catalogs.

### Why are the changes needed?

These changes are needed per [discussed and 
voted](https://lists.apache.org/thread/w586jr53fxwk4pt9m94b413xyjr1v25m) SPIP 
tracked in [SPARK-44167](https://issues.apache.org/jira/browse/SPARK-44167).

### Does this PR introduce _any_ user-facing change?

Yes. This PR adds CALL commands.

### How was this patch tested?

This PR comes with tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47943 from aokolnychyi/spark-48782.

Authored-by: Anton Okolnychyi 
Signed-off-by: Wenchen Fan 
---
 .../src/main/resources/error/error-conditions.json |   6 +
 docs/sql-ref-ansi-compliance.md|   1 +
 .../spark/sql/catalyst/parser/SqlBaseLexer.g4  |   1 +
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |   5 +
 .../catalog/procedures/ProcedureParameter.java |   5 +
 .../catalog/procedures/UnboundProcedure.java   |   6 +
 .../spark/sql/catalyst/analysis/Analyzer.scala |  65 +-
 .../sql/catalyst/analysis/AnsiTypeCoercion.scala   |   1 +
 .../sql/catalyst/analysis/CheckAnalysis.scala  |   8 +
 .../spark/sql/catalyst/analysis/TypeCoercion.scala |  16 +
 .../spark/sql/catalyst/analysis/package.scala  |   6 +-
 .../sql/catalyst/analysis/v2ResolutionPlans.scala  |  17 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala |  22 +
 .../plans/logical/ExecutableDuringAnalysis.scala   |  28 +
 .../plans/logical/FunctionBuilderBase.scala|  36 +-
 .../sql/catalyst/plans/logical/MultiResult.scala   |  30 +
 .../sql/catalyst/plans/logical/v2Commands.scala|  67 ++-
 .../sql/catalyst/rules/RuleIdCollection.scala  |   1 +
 .../spark/sql/catalyst/trees/TreePatterns.scala|   1 +
 .../sql/connector/catalog/CatalogV2Implicits.scala |   7 +
 .../spark/sql/errors/QueryCompilationErrors.scala  |   7 +
 .../sql/connector/catalog/InMemoryCatalog.scala|  19 +-
 .../sql/catalyst/analysis/InvokeProcedures.scala   |  71 +++
 .../spark/sql/execution/MultiResultExec.scala  |  36 ++
 .../spark/sql/execution/SparkStrategies.scala  |   2 +
 .../spark/sql/execution/command/commands.scala |  11 +-
 .../datasources/v2/DataSourceV2Strategy.scala  |   6 +-
 .../datasources/v2/ExplainOnlySparkPlan.scala  |  38 ++
 .../sql/internal/BaseSessionStateBuilder.scala |   3 +-
 .../sql-tests/results/ansi/keywords.sql.out|   2 +
 .../resources/sql-tests/results/keywords.sql.out   |   1 +
 .../spark/sql/connector/ProcedureSuite.scala   | 654 +
 .../ThriftServerWithSparkContextSuite.scala|   2 +-
 .../spark/sql/hive/HiveSessionStateBuilder.scala   |   3 +-
 34 files changed, 1162 insertions(+), 22 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index 6463cc2c12da..72985de6631f 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -1456,6 +1456,12 @@
 ],
 "sqlState" : "2203G"
   },
+  "FAILED_TO_LOAD_ROUTINE" : {
+"message" : [
+  "Failed to load routine ."
+],
+"sqlState" : "38000"
+  },
   "FAILED_TO_PARSE_TOO_COMPLEX" : {
 "message" : [
   "The statement, including potential SQL functions and referenced views, 
was too complex to parse.",
diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md
index fff6906457f7..12dff1e325c4 100644
--- a/docs/sql-ref-ansi-compliance.md
+++ b/docs/sql-ref-ansi-compliance.md
@@ -426,6 +426,7 @@ Below is a list of all the keywords in Spark SQL.
 |BY|non-reserved|non-reserved|reserved|
 |BYTE|non-reserved|non-reserved|non-reserved|
 |CACHE|non-reserved|non-reserved|non-reserved|
+|CALL|reserved|non-reserved|reserved|
 |CALLED|non-reserved|non-reserved|non-reserved|
 |CASCADE|non-reserved|non-reserved|non-reserved|
 |CASE|reserved|non-reserved|reserved|
diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
index e

[jira] [Resolved] (SPARK-48782) Add support for loading stored procedures in catalogs

2024-09-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48782.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47943
[https://github.com/apache/spark/pull/47943]

> Add support for loading stored procedures in catalogs
> -
>
> Key: SPARK-48782
> URL: https://issues.apache.org/jira/browse/SPARK-48782
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add support for loading stored procedures in catalogs by providing parser 
> extensions as well as new logical and physical plans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Spark 4.0.0-preview2 (RC1)

2024-09-18 Thread Wenchen Fan
+1

On Wed, Sep 18, 2024 at 1:21 AM John Zhuge  wrote:

> +1 non-binding
>
> John Zhuge
>
>
> On Mon, Sep 16, 2024 at 11:07 PM Xinrong Meng  wrote:
>
>> +1
>>
>> Thank you @Dongjoon Hyun  !
>>
>> On Tue, Sep 17, 2024 at 11:31 AM huaxin gao 
>> wrote:
>>
>>> +1
>>>
>>> On Mon, Sep 16, 2024 at 6:20 PM L. C. Hsieh  wrote:
>>>
 +1

 On Mon, Sep 16, 2024 at 5:56 PM Dongjoon Hyun 
 wrote:
 >
 > +1
 >
 > Dongjoon
 >
 > On Mon, Sep 16, 2024 at 10:57 AM Holden Karau 
 wrote:
 >>
 >> +1
 >>
 >> Twitter: https://twitter.com/holdenkarau
 >> Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9
 >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
 >> Pronouns: she/her
 >>
 >>
 >> On Mon, Sep 16, 2024 at 10:55 AM Zhou Jiang 
 wrote:
 >>>
 >>> + 1
 >>> Sent from my iPhone
 >>>
 >>> On Sep 16, 2024, at 01:04, Dongjoon Hyun 
 wrote:
 >>>
 >>> 
 >>>
 >>> Please vote on releasing the following candidate as Apache Spark
 version 4.0.0-preview2.
 >>>
 >>> The vote is open until September 20th 1AM (PDT) and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
 >>>
 >>> [ ] +1 Release this package as Apache Spark 4.0.0-preview2
 >>> [ ] -1 Do not release this package because ...
 >>>
 >>> To learn more about Apache Spark, please see
 https://spark.apache.org/
 >>>
 >>> The tag to be voted on is v4.0.0-preview2-rc1 (commit
 f0d465e09b8d89d5e56ec21f4bd7e3ecbeeb318a)
 >>> https://github.com/apache/spark/tree/v4.0.0-preview2-rc1
 >>>
 >>> The release files, including signatures, digests, etc. can be found
 at:
 >>>
 https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview2-rc1-bin/
 >>>
 >>> Signatures used for Spark RCs can be found in this file:
 >>> https://dist.apache.org/repos/dist/dev/spark/KEYS
 >>>
 >>> The staging repository for this release can be found at:
 >>>
 https://repository.apache.org/content/repositories/orgapachespark-1468/
 >>>
 >>> The documentation corresponding to this release can be found at:
 >>>
 https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview2-rc1-docs/
 >>>
 >>> The list of bug fixes going into 4.0.0-preview2 can be found at the
 following URL:
 >>> https://issues.apache.org/jira/projects/SPARK/versions/12353359
 >>>
 >>> This release is using the release script of the tag
 v4.0.0-preview2-rc1.
 >>>
 >>> FAQ
 >>>
 >>> =
 >>> How can I help test this release?
 >>> =
 >>>
 >>> If you are a Spark user, you can help us test this release by taking
 >>> an existing Spark workload and running on this release candidate,
 then
 >>> reporting any regressions.
 >>>
 >>> If you're working in PySpark you can set up a virtual env and
 install
 >>> the current RC and see if anything important breaks, in the
 Java/Scala
 >>> you can add the staging repository to your projects resolvers and
 test
 >>> with the RC (make sure to clean up the artifact cache before/after
 so
 >>> you don't end up building with a out of date RC going forward).

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




[jira] [Resolved] (SPARK-49611) Introduce TVF all_collations()

2024-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49611.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 48087
[https://github.com/apache/spark/pull/48087]

> Introduce TVF all_collations()
> --
>
> Key: SPARK-49611
> URL: https://issues.apache.org/jira/browse/SPARK-49611
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49611][SQL] Introduce TVF `collations()` & remove the `SHOW COLLATIONS` command

2024-09-16 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2113f109b8d7 [SPARK-49611][SQL] Introduce TVF `collations()` & remove 
the `SHOW COLLATIONS` command
2113f109b8d7 is described below

commit 2113f109b8d73cb8deb404664f25bd51308ca809
Author: panbingkun 
AuthorDate: Mon Sep 16 16:33:44 2024 +0800

[SPARK-49611][SQL] Introduce TVF `collations()` & remove the `SHOW 
COLLATIONS` command

### What changes were proposed in this pull request?
The pr aims to
- introduce `TVF` `collations()`.
- remove the `SHOW COLLATIONS` command.

### Why are the changes needed?
Based on cloud-fan's suggestion: 
https://github.com/apache/spark/pull/47364#issuecomment-2345183501
I believe that after this, we can do many things based on it, such as 
`filtering` and `querying` based on `LANGUAGE` or `COUNTRY`, etc. eg:
```sql
SELECT * FROM collations() WHERE LANGUAGE like '%Chinese%';
```

### Does this PR introduce _any_ user-facing change?
Yes, provide a new TVF `collations()` for end-users.

### How was this patch tested?
- Add new UT.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48087 from panbingkun/SPARK-49611.

Lead-authored-by: panbingkun 
Co-authored-by: panbingkun 
    Signed-off-by: Wenchen Fan 
---
 docs/sql-ref-ansi-compliance.md|  1 -
 .../spark/sql/catalyst/parser/SqlBaseLexer.g4  |  1 -
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |  2 -
 .../sql/catalyst/analysis/FunctionRegistry.scala   |  1 +
 .../sql/catalyst/catalog/SessionCatalog.scala  | 15 +---
 .../sql/catalyst/expressions/generators.scala  | 44 +++-
 .../resources/ansi-sql-2016-reserved-keywords.txt  |  1 -
 .../spark/sql/execution/SparkSqlParser.scala   | 12 
 .../execution/command/ShowCollationsCommand.scala  | 62 -
 .../sql-tests/results/ansi/keywords.sql.out|  2 -
 .../resources/sql-tests/results/keywords.sql.out   |  1 -
 .../org/apache/spark/sql/CollationSuite.scala  | 79 +++---
 .../ThriftServerWithSparkContextSuite.scala|  2 +-
 13 files changed, 101 insertions(+), 122 deletions(-)

diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md
index 7987e5eb6012..fff6906457f7 100644
--- a/docs/sql-ref-ansi-compliance.md
+++ b/docs/sql-ref-ansi-compliance.md
@@ -442,7 +442,6 @@ Below is a list of all the keywords in Spark SQL.
 |CODEGEN|non-reserved|non-reserved|non-reserved|
 |COLLATE|reserved|non-reserved|reserved|
 |COLLATION|reserved|non-reserved|reserved|
-|COLLATIONS|reserved|non-reserved|reserved|
 |COLLECTION|non-reserved|non-reserved|non-reserved|
 |COLUMN|reserved|non-reserved|reserved|
 |COLUMNS|non-reserved|non-reserved|non-reserved|
diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
index c82ee57a2517..e704f9f58b96 100644
--- 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
+++ 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
@@ -162,7 +162,6 @@ CLUSTERED: 'CLUSTERED';
 CODEGEN: 'CODEGEN';
 COLLATE: 'COLLATE';
 COLLATION: 'COLLATION';
-COLLATIONS: 'COLLATIONS';
 COLLECTION: 'COLLECTION';
 COLUMN: 'COLUMN';
 COLUMNS: 'COLUMNS';
diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
index 1840b6887841..f13dde773496 100644
--- 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
+++ 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
@@ -268,7 +268,6 @@ statement
 | SHOW PARTITIONS identifierReference partitionSpec?   
#showPartitions
 | SHOW identifier? FUNCTIONS ((FROM | IN) ns=identifierReference)?
 (LIKE? (legacy=multipartIdentifier | pattern=stringLit))?  
#showFunctions
-| SHOW COLLATIONS (LIKE? pattern=stringLit)?   
#showCollations
 | SHOW CREATE TABLE identifierReference (AS SERDE)?
#showCreateTable
 | SHOW CURRENT namespace   
#showCurrentNamespace
 | SHOW CATALOGS (LIKE? pattern=stringLit)?
#showCatalogs
@@ -1868,7 +1867,6 @@ nonReserved
 | CODEGEN
 | COLLATE
 | COLLATION
-| COLLATIONS
 | COLLECTION
 | COLUMN
 | COLUMNS
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.sc

[jira] [Resolved] (SPARK-49646) fix subquery decorrelation for union / set operations when parentOuterReferences has references not covered in collectedChildOuterReferences

2024-09-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49646.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 48109
[https://github.com/apache/spark/pull/48109]

> fix subquery decorrelation for union / set operations when 
> parentOuterReferences has references not covered in 
> collectedChildOuterReferences
> 
>
> Key: SPARK-49646
> URL: https://issues.apache.org/jira/browse/SPARK-49646
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Avery Qi
>Assignee: Avery Qi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> spark currently cannot handle queries like:
> ```
> create table IF NOT EXISTS t(t1 INT,t2 int) using json;
> CREATE TABLE IF NOT EXISTS a (a1 INT) using json;
> select 1
> from t as t_outer
> left join
>    lateral(
>    select b1,b2
>    from
>    (
>    select
>    a.a1 as b1,
>    1 as b2
>    from a
>    union
>    select t_outer.t1 as b1,
>   null as b2
>    ) as t_inner
>    where (t_inner.b1 < t_outer.t2  or t_inner.b1 is null) and  t_inner.b1 
> = t_outer.t1
>    order by t_inner.b1,t_inner.b2 desc limit 1
>    ) as lateral_table
> ```
> And the stack error trace is:
> org.apache.spark.SparkException:   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:97)  at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:101)  at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.rewriteDomainJoins(DecorrelateInnerQuery.scala:447)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.$anonfun$rewriteDomainJoins$7(DecorrelateInnerQuery.scala:463)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1308)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1307)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.mapChildren(basicLogicalOperators.scala:87)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.rewriteDomainJoins(DecorrelateInnerQuery.scala:463)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.$anonfun$rewriteDomainJoins$5(DecorrelateInnerQuery.scala:453)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)  
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)  
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)  at 
> scala.collection.TraversableLike.map(TraversableLike.scala:286)  at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:279)  at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108)  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:744)  
> at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.rewriteDomainJoins(DecorrelateInnerQuery.scala:451)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.$anonfun$rewriteDomainJoins$7(DecorrelateInnerQuery.scala:463)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1308)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1307)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Aggregate.mapChildren(basicLogicalOperators.scala:1470)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.rewriteDomainJoins(DecorrelateInnerQuery.scala:463)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.$anonfun$rewriteDomainJoins$7(DecorrelateInnerQuery.scala:463)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1308)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1307)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Filter.mapChildren(basicLogicalOperators.scala:344)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.rewriteDomainJoins(DecorrelateInnerQuery.scala:463)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.$anonfun$rewriteDomainJoins$7(DecorrelateInnerQuery.scala:463)
> ...
>  
> {color:#172b4d}See

[jira] [Assigned] (SPARK-49646) fix subquery decorrelation for union / set operations when parentOuterReferences has references not covered in collectedChildOuterReferences

2024-09-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49646:
---

Assignee: Avery Qi

> fix subquery decorrelation for union / set operations when 
> parentOuterReferences has references not covered in 
> collectedChildOuterReferences
> 
>
> Key: SPARK-49646
> URL: https://issues.apache.org/jira/browse/SPARK-49646
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Avery Qi
>Assignee: Avery Qi
>Priority: Major
>  Labels: pull-request-available
>
> spark currently cannot handle queries like:
> ```
> create table IF NOT EXISTS t(t1 INT,t2 int) using json;
> CREATE TABLE IF NOT EXISTS a (a1 INT) using json;
> select 1
> from t as t_outer
> left join
>    lateral(
>    select b1,b2
>    from
>    (
>    select
>    a.a1 as b1,
>    1 as b2
>    from a
>    union
>    select t_outer.t1 as b1,
>   null as b2
>    ) as t_inner
>    where (t_inner.b1 < t_outer.t2  or t_inner.b1 is null) and  t_inner.b1 
> = t_outer.t1
>    order by t_inner.b1,t_inner.b2 desc limit 1
>    ) as lateral_table
> ```
> And the stack error trace is:
> org.apache.spark.SparkException:   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:97)  at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:101)  at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.rewriteDomainJoins(DecorrelateInnerQuery.scala:447)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.$anonfun$rewriteDomainJoins$7(DecorrelateInnerQuery.scala:463)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1308)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1307)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.mapChildren(basicLogicalOperators.scala:87)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.rewriteDomainJoins(DecorrelateInnerQuery.scala:463)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.$anonfun$rewriteDomainJoins$5(DecorrelateInnerQuery.scala:453)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)  
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)  
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)  at 
> scala.collection.TraversableLike.map(TraversableLike.scala:286)  at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:279)  at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108)  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:744)  
> at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.rewriteDomainJoins(DecorrelateInnerQuery.scala:451)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.$anonfun$rewriteDomainJoins$7(DecorrelateInnerQuery.scala:463)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1308)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1307)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Aggregate.mapChildren(basicLogicalOperators.scala:1470)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.rewriteDomainJoins(DecorrelateInnerQuery.scala:463)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.$anonfun$rewriteDomainJoins$7(DecorrelateInnerQuery.scala:463)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1308)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1307)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Filter.mapChildren(basicLogicalOperators.scala:344)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.rewriteDomainJoins(DecorrelateInnerQuery.scala:463)
>   at 
> org.apache.spark.sql.catalyst.optimizer.DecorrelateInnerQuery$.$anonfun$rewriteDomainJoins$7(DecorrelateInnerQuery.scala:463)
> ...
>  
> {color:#172b4d}See this investigation doc for more context: {color}
> [https://docs.google.com/document/d/1HtBDPKVD6pgGntTXdPVX27xH7PdcKTYNyQJLnwr7T-U/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49646][SQL] fix subquery decorrelation for union/set operations when parentOuterReferences has references not covered in collectedChildOuterReferences

2024-09-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 738db079c0b6 [SPARK-49646][SQL] fix subquery decorrelation for 
union/set operations when parentOuterReferences has references not covered in 
collectedChildOuterReferences
738db079c0b6 is described below

commit 738db079c0b65e8305b7a1349923ee017316f691
Author: Avery Qi 
AuthorDate: Mon Sep 16 11:37:37 2024 +0800

[SPARK-49646][SQL] fix subquery decorrelation for union/set operations when 
parentOuterReferences has references not covered in 
collectedChildOuterReferences

### What changes were proposed in this pull request?
fix bug when encounter union/setOp under limit/aggregation with filter 
predicates cannot pulled up directly in lateral join. eg:
```
create table IF NOT EXISTS t(t1 INT,t2 int) using json;
CREATE TABLE IF NOT EXISTS a (a1 INT) using json;

select 1
from t as t_outer
left join
   lateral(
   select b1,b2
   from
   (
   select
   a.a1 as b1,
   1 as b2
   from a
   union
   select t_outer.t1 as b1,
  null as b2
   ) as t_inner
   where (t_inner.b1 < t_outer.t2  or t_inner.b1 is null) and  
t_inner.b1 = t_outer.t1
   order by t_inner.b1,t_inner.b2 desc limit 1
   ) as lateral_table
```

### Why are the changes needed?
In general, spark cannot handle this query because:
1. Decorrelation logic tries to rewrite limit operator into Window 
aggregation and pull up correlated predicates, and Union operator is rewritten 
to have DomainJoin within its children with outer references.
2. When we're rewriting DomainJoin to real join execution, it needs 
attribute reference map based on pulled up correlated predicates to rewrite 
outer references in DomainJoin. However, each child of Union/SetOp operator are 
using different attribute references even they are referring to the same column 
of outer table. We need Union/SetOp output and its children output to map 
between these references.
3. Combined with aggregation and filters with inequality comparison, more 
outer references are remained within children of Union operator, and these 
references are not covered in Union/SetOp output which leads to lacking of 
information when we're trying to map different attributed references within 
children of Union/SetOp operator.

More context -> please read this short investigation doc(I've changed the 
link and it's now public):

https://docs.google.com/document/d/1_pJIi_8GuLHOXabLEgRy2e7OHw-OIBnWbwGwSkwIcxg/edit?usp=sharing

### Does this PR introduce _any_ user-facing change?

yes, bug is fixed and the above query can be handled without error.

### How was this patch tested?

added unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #48109 from averyqi-db/averyqi-db/SPARK-49646.

Authored-by: Avery Qi 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/optimizer/DecorrelateInnerQuery.scala |  2 +-
 .../analyzer-results/join-lateral.sql.out  | 47 ++
 .../resources/sql-tests/inputs/join-lateral.sql| 21 ++
 .../sql-tests/results/join-lateral.sql.out | 27 +
 4 files changed, 96 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
index 424f4b96271d..6c0d7189862d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
@@ -1064,7 +1064,7 @@ object DecorrelateInnerQuery extends PredicateHelper {
 // Project, they could get added at the beginning or the end 
of the output columns
 // depending on the child plan.
 // The inner expressions for the domain are the values of 
newOuterReferenceMap.
-val domainProjections = 
collectedChildOuterReferences.map(newOuterReferenceMap(_))
+val domainProjections = 
newOuterReferences.map(newOuterReferenceMap(_))
 val newChild = Project(child.output ++ domainProjections, 
decorrelatedChild)
 (newChild, newJoinCond, newOuterReferenceMap)
   }
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/join-lateral.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/join-lateral.sql.out
index e81ee769f57d..5bf893605423 10064

[jira] [Resolved] (SPARK-48824) Add SQL syntax in create/replace table to create an identity column

2024-09-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48824.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47614
[https://github.com/apache/spark/pull/47614]

> Add SQL syntax in create/replace table to create an identity column
> ---
>
> Key: SPARK-48824
> URL: https://issues.apache.org/jira/browse/SPARK-48824
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Carmen Kwan
>Assignee: Carmen Kwan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add SQL support for creating identity columns. Identity Column syntax should 
> be flexible such that users can specify 
>  * whether identity values are always generated by the system
>  * (optionally) the starting value of the column 
>  * (optionally) the increment/step of the column
> The SQL syntax support should also allow flexible ordering of the increment 
> and starting values, as both variants are used in the wild by other systems 
> (e.g. 
> [PostgreSQL|https://www.postgresql.org/docs/current/sql-createsequence.html] 
> [Oracle).|https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/CREATE-SEQUENCE.html#GUID-E9C78A8C-615A-4757-B2A8-5E6EFB130571]
>  That is, we should allow both
> {code:java}
> START WITH  INCREMENT BY {code}
> and 
> {code:java}
> INCREMENT BY  START WITH {code}
> .
> For example, we should be able to define
> {code:java}
> CREATE TABLE default.example (
>   id LONG GENERATED ALWAYS AS IDENTITY,
>   id2 LONG GENERATED BY DEFAULT START WITH 0 INCREMENT BY -10,
>   id3 LONG GENERATED ALWAYS AS IDENTITY INCREMENT BY 2 START WITH -8,
>   value LONG
> )
> {code}
> This will enable defining identity columns in Spark SQL for data sources that 
> support it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48824) Add SQL syntax in create/replace table to create an identity column

2024-09-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48824:
---

Assignee: Carmen Kwan

> Add SQL syntax in create/replace table to create an identity column
> ---
>
> Key: SPARK-48824
> URL: https://issues.apache.org/jira/browse/SPARK-48824
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Carmen Kwan
>Assignee: Carmen Kwan
>Priority: Major
>  Labels: pull-request-available
>
> Add SQL support for creating identity columns. Identity Column syntax should 
> be flexible such that users can specify 
>  * whether identity values are always generated by the system
>  * (optionally) the starting value of the column 
>  * (optionally) the increment/step of the column
> The SQL syntax support should also allow flexible ordering of the increment 
> and starting values, as both variants are used in the wild by other systems 
> (e.g. 
> [PostgreSQL|https://www.postgresql.org/docs/current/sql-createsequence.html] 
> [Oracle).|https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/CREATE-SEQUENCE.html#GUID-E9C78A8C-615A-4757-B2A8-5E6EFB130571]
>  That is, we should allow both
> {code:java}
> START WITH  INCREMENT BY {code}
> and 
> {code:java}
> INCREMENT BY  START WITH {code}
> .
> For example, we should be able to define
> {code:java}
> CREATE TABLE default.example (
>   id LONG GENERATED ALWAYS AS IDENTITY,
>   id2 LONG GENERATED BY DEFAULT START WITH 0 INCREMENT BY -10,
>   id3 LONG GENERATED ALWAYS AS IDENTITY INCREMENT BY 2 START WITH -8,
>   value LONG
> )
> {code}
> This will enable defining identity columns in Spark SQL for data sources that 
> support it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48824][SQL] Add Identity Column SQL syntax

2024-09-14 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 931ab065df39 [SPARK-48824][SQL] Add Identity Column SQL syntax
931ab065df39 is described below

commit 931ab065df3952487028316ebd49c2895d947bf2
Author: zhipeng.mao 
AuthorDate: Sun Sep 15 13:35:00 2024 +0800

[SPARK-48824][SQL] Add Identity Column SQL syntax

### What changes were proposed in this pull request?

Add SQL support for creating identity columns. Users can specify a column 
`GENERATED ALWAYS AS IDENTITY(identityColumnSpec)` , where identity values are 
**always** generated by the system, or `GENERATED BY DEFAULT AS 
IDENTITY(identityColumnSpec)`, where users can specify the identity values.

Users can optionally specify the starting value of the column (default = 1) 
and the increment/step of the column (default = 1). Also we allow both
`START WITH  INCREMENT BY `
and
`INCREMENT BY  START WITH `

It allows flexible ordering of the increment and starting values, as both 
variants are used in the wild by other systems (e.g. 
[PostgreSQL](https://www.postgresql.org/docs/current/sql-createsequence.html) 
[Oracle](https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/CREATE-SEQUENCE.html#GUID-E9C78A8C-615A-4757-B2A8-5E6EFB130571)).

For example, we can define

```
CREATE TABLE default.example (
  id LONG GENERATED ALWAYS AS IDENTITY,
  id1 LONG GENERATED ALWAYS AS IDENTITY(),
  id2 LONG GENERATED BY DEFAULT AS IDENTITY(START WITH 0),
  id3 LONG GENERATED ALWAYS AS IDENTITY(INCREMENT BY 2),
  id4 LONG GENERATED BY DEFAULT AS IDENTITY(START WITH 0 INCREMENT BY -10),
  id5 LONG GENERATED ALWAYS AS IDENTITY(INCREMENT BY 2 START WITH -8),
  value LONG
)
```
This will enable defining identity columns in Spark SQL for data sources 
that support it.

To be more specific this PR

- Adds parser support for GENERATED { BY DEFAULT | ALWAYS } AS IDENTITY in 
create/replace table statements. Identity column specifications are temporarily 
stored in the field's metadata, and then are parsed/verified in 
DataSourceV2Strategy and used to instantiate v2 [Column]
- Adds TableCatalog::capabilities() and 
TableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_IDENTITY_COLUMNS This will be 
used to determine whether to allow specifying identity columns or whether to 
throw an exception.

### Why are the changes needed?

A SQL API is needed to create Identity Columns.

### Does this PR introduce _any_ user-facing change?

It allows the aforementioned SQL syntax to create identity columns in a 
table.

### How was this patch tested?

Positive and negative unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47614 from zhipengmao-db/zhipengmao-db/SPARK-48824-id-syntax.

Authored-by: zhipeng.mao 
Signed-off-by: Wenchen Fan 
---
 .../src/main/resources/error/error-conditions.json |  24 +++
 docs/sql-ref-ansi-compliance.md|   2 +
 .../spark/sql/catalyst/parser/SqlBaseLexer.g4  |   2 +
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |  21 +-
 .../sql/connector/catalog/IdentityColumnSpec.java  |  88 +
 .../spark/sql/errors/QueryParsingErrors.scala  |  19 ++
 .../apache/spark/sql/connector/catalog/Column.java |  24 ++-
 .../connector/catalog/TableCatalogCapability.java  |  20 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala |  66 ++-
 .../catalyst/plans/logical/ColumnDefinition.scala  |  68 +--
 .../spark/sql/catalyst/util/IdentityColumn.scala   |  78 
 .../sql/connector/catalog/CatalogV2Util.scala  |  47 +++--
 .../spark/sql/internal/connector/ColumnImpl.scala  |   3 +-
 .../spark/sql/catalyst/parser/DDLParserSuite.scala | 213 -
 .../connector/catalog/InMemoryTableCatalog.scala   |   3 +-
 .../execution/datasources/DataSourceStrategy.scala |   7 +-
 .../datasources/v2/DataSourceV2Strategy.scala  |   5 +-
 .../sql-tests/results/ansi/keywords.sql.out|   2 +
 .../resources/sql-tests/results/keywords.sql.out   |   2 +
 .../spark/sql/connector/DataSourceV2SQLSuite.scala |  58 ++
 .../spark/sql/execution/command/DDLSuite.scala |  11 ++
 .../ThriftServerWithSparkContextSuite.scala|   2 +-
 22 files changed, 724 insertions(+), 41 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index a6d8550716b9..38472f44fb59 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -1589,6 +1589,30 @@
 ],
 "sqlState

[jira] [Resolved] (SPARK-49556) SELECT operator

2024-09-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49556.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 48047
[https://github.com/apache/spark/pull/48047]

> SELECT operator
> ---
>
> Key: SPARK-49556
> URL: https://issues.apache.org/jira/browse/SPARK-49556
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49556][SQL] Add SQL pipe syntax for the SELECT operator

2024-09-13 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 017b0ea71e03 [SPARK-49556][SQL] Add SQL pipe syntax for the SELECT 
operator
017b0ea71e03 is described below

commit 017b0ea71e03339336b5d199ecad4f50961e4948
Author: Daniel Tenedorio 
AuthorDate: Sat Sep 14 12:16:35 2024 +0800

[SPARK-49556][SQL] Add SQL pipe syntax for the SELECT operator

### What changes were proposed in this pull request?

This PR adds SQL pipe syntax support for the SELECT operator.

For example:

```
CREATE TABLE t(x INT, y STRING) USING CSV;
INSERT INTO t VALUES (0, 'abc'), (1, 'def');

TABLE t
|> SELECT x, y

0   abc
1   def

TABLE t
|> SELECT x, y
|> SELECT x + LENGTH(y) AS z

3
4

(SELECT * FROM t UNION ALL SELECT * FROM t)
|> SELECT x + LENGTH(y) AS result

3
3
4
4

TABLE t
|> SELECT sum(x) AS result

Error: aggregate functions are not allowed in the pipe operator |> SELECT 
clause; please use the |> AGGREGATE clause instead
```

### Why are the changes needed?

The SQL pipe operator syntax will let users compose queries in a more 
flexible fashion.

### Does this PR introduce _any_ user-facing change?

Yes, see above.

### How was this patch tested?

This PR adds a few unit test cases, but mostly relies on golden file test 
coverage. I did this to make sure the answers are correct as this feature is 
implemented and also so we can look at the analyzer output plans to ensure they 
look right as well.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #48047 from dtenedor/pipe-select.

Authored-by: Daniel Tenedorio 
Signed-off-by: Wenchen Fan 
---
 .../src/main/resources/error/error-conditions.json |   6 +
 .../spark/sql/catalyst/parser/SqlBaseLexer.g4  |   1 +
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |   5 +
 .../sql/catalyst/expressions/PipeSelect.scala  |  47 +++
 .../spark/sql/catalyst/parser/AstBuilder.scala |  58 +++-
 .../spark/sql/catalyst/trees/TreePatterns.scala|   1 +
 .../spark/sql/errors/QueryCompilationErrors.scala  |   8 +
 .../org/apache/spark/sql/internal/SQLConf.scala|   9 +
 .../analyzer-results/pipe-operators.sql.out| 318 +
 .../resources/sql-tests/inputs/pipe-operators.sql  | 102 +++
 .../sql-tests/results/pipe-operators.sql.out   | 308 
 .../spark/sql/execution/SparkSqlParserSuite.scala  |  19 +-
 .../thriftserver/ThriftServerQueryTestSuite.scala  |   3 +-
 13 files changed, 876 insertions(+), 9 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index 0a9dcd52ea83..a6d8550716b9 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -3754,6 +3754,12 @@
 ],
 "sqlState" : "42K03"
   },
+  "PIPE_OPERATOR_SELECT_CONTAINS_AGGREGATE_FUNCTION" : {
+"message" : [
+  "Aggregate function  is not allowed when using the pipe operator 
|> SELECT clause; please use the pipe operator |> AGGREGATE clause instead"
+],
+"sqlState" : "0A000"
+  },
   "PIVOT_VALUE_DATA_TYPE_MISMATCH" : {
 "message" : [
   "Invalid pivot value '': value data type  does not 
match pivot column data type ."
diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
index 9ea213f3bf4a..96a58b99debe 100644
--- 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
+++ 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
@@ -506,6 +506,7 @@ TILDE: '~';
 AMPERSAND: '&';
 PIPE: '|';
 CONCAT_PIPE: '||';
+OPERATOR_PIPE: '|>';
 HAT: '^';
 COLON: ':';
 DOUBLE_COLON: '::';
diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
index 73d5cb55295a..3ea408ca4270 100644
--- 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
+++ 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
@@ -613,6 +613,7 @@ queryTerm
 operator=INTERSECT setQuantifier? right=queryTerm  
 

[jira] [Resolved] (SPARK-49488) MySQL dialect supports pushdown datetime functions.

2024-09-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49488.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47951
[https://github.com/apache/spark/pull/47951]

> MySQL dialect supports pushdown datetime functions.
> ---
>
> Key: SPARK-49488
> URL: https://issues.apache.org/jira/browse/SPARK-49488
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated (5533c81e3453 -> 9fc58aa4c075)

2024-09-13 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5533c81e3453 [SPARK-48355][SQL] Support for CASE statement
 add 9fc58aa4c075 [SPARK-49488][SQL] MySQL dialect supports pushdown 
datetime functions

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  | 86 +-
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  |  8 +-
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   | 23 +-
 3 files changed, 114 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[jira] [Assigned] (SPARK-49591) Distinguish logical and physical types in variant spec

2024-09-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49591:
---

Assignee: David Cashman

> Distinguish logical and physical types in variant spec
> --
>
> Key: SPARK-49591
> URL: https://issues.apache.org/jira/browse/SPARK-49591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: David Cashman
>Assignee: David Cashman
>Priority: Major
>  Labels: pull-request-available
>
> We should clarify in the spec what types are considered to be logically 
> equivalent for the purposes of expression behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49591) Distinguish logical and physical types in variant spec

2024-09-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49591.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 48064
[https://github.com/apache/spark/pull/48064]

> Distinguish logical and physical types in variant spec
> --
>
> Key: SPARK-49591
> URL: https://issues.apache.org/jira/browse/SPARK-49591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: David Cashman
>Assignee: David Cashman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We should clarify in the spec what types are considered to be logically 
> equivalent for the purposes of expression behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49591][SQL] Add Logical Type column to variant readme

2024-09-12 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c5c880e690c3 [SPARK-49591][SQL] Add Logical Type column to variant 
readme
c5c880e690c3 is described below

commit c5c880e690c38b2bb597b7a38f20b32e2e2d272c
Author: cashmand 
AuthorDate: Thu Sep 12 22:35:57 2024 +0800

[SPARK-49591][SQL] Add Logical Type column to variant readme

### What changes were proposed in this pull request?

Add a concept of logical type to the variant README.md, distinct from the 
physical encoding of a value. In particular, decimal and integer values are 
considered to be members of a single "Exact Numeric" type.

### Why are the changes needed?

This is intended to describe and justify the existing Spark behaviour for 
Variant (e.g. stripping trailing zeros for decimal to string casts), not change 
it. (Although the SchemaOfVariant expression does not strictly follow this 
right now for numeric types, and should be updated to match it.) The motivation 
for introducing a single numeric type that encompasses integer and decimal 
values is to allow more flexibility in storage (particularly once shredding is 
introduced), and provide a [...]

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

It is a documentation change.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48064 from cashmand/cashmand/SPARK-49591.

Authored-by: cashmand 
Signed-off-by: Wenchen Fan 
---
 common/variant/README.md | 44 +++-
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/common/variant/README.md b/common/variant/README.md
index a66d708da75b..4ed7c16f5b6e 100644
--- a/common/variant/README.md
+++ b/common/variant/README.md
@@ -333,27 +333,27 @@ The Decimal type contains a scale, but no precision. The 
implied precision of a
 | Object   | `2` | A collection of (string-key, variant-value) pairs |
 | Array| `3` | An ordered sequence of variant values |
 
-| Primitive Type  | Type ID | Equivalent Parquet Type | Binary 
format  
 |
-|-|-|-|-|
-| null| `0` | any | none   

 |
-| boolean (True)  | `1` | BOOLEAN | none   

 |
-| boolean (False) | `2` | BOOLEAN | none   

 |
-| int8| `3` | INT(8, signed)  | 1 byte 

 |
-| int16   | `4` | INT(16, signed) | 2 byte 
little-endian   
 |
-| int32   | `5` | INT(32, signed) | 4 byte 
little-endian   
 |
-| int64   | `6` | INT(64, signed) | 8 byte 
little-endian   
 |
-| double  | `7` | DOUBLE  | IEEE 
little-endian   
   |
-| decimal4| `8` | DECIMAL(precision, scale)   | 1 byte 
scale in range [0, 38], followed by little-endian unscaled value (see decimal 
table) |
-| decimal8| `9` | DECIMAL(precision, scale)   | 1 byte 
scale in range [0, 38], followed by little-endian unscaled value (see decimal 
table) |
-| decimal16   | `10`| DECIMAL(precision, scale)   | 1 byte 
scale in range [0, 38], followed by little-endian unscaled value (see decimal 
table) |
-| date| `11`| DATE| 4 byte 
lit

(spark) branch master updated (c5fd509ad3c0 -> d72e8f9e1263)

2024-09-11 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from c5fd509ad3c0 [SPARK-49085][CONNECT][BUILD][FOLLOWUP] Remove the 
erroneous `type` definition for `spark-protobuf` from 
`sql/connect/server/pom.xml`
 add d72e8f9e1263 [SPARK-43838][SQL][FOLLOWUP] Improve DeduplicateRelations 
performance

No new revisions were added by this update.

Summary of changes:
 .../catalyst/analysis/DeduplicateRelations.scala   |  14 +-
 .../approved-plans-v2_7/q22a.sf100/explain.txt | 154 +-
 .../approved-plans-v2_7/q22a.sf100/simplified.txt  |   2 +-
 .../approved-plans-v2_7/q22a/explain.txt   | 154 +-
 .../approved-plans-v2_7/q22a/simplified.txt|   2 +-
 .../approved-plans-v2_7/q67a.sf100/explain.txt | 326 ++---
 .../approved-plans-v2_7/q67a.sf100/simplified.txt  |   2 +-
 .../approved-plans-v2_7/q67a/explain.txt   | 326 ++---
 .../approved-plans-v2_7/q67a/simplified.txt|   2 +-
 9 files changed, 493 insertions(+), 489 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [VOTE] Document and Feature Preview via GitHub Pages

2024-09-11 Thread Wenchen Fan
+1

On Wed, Sep 11, 2024 at 5:15 PM Martin Grund 
wrote:

> +1
>
> On Wed, Sep 11, 2024 at 9:39 AM Kent Yao  wrote:
>
>> Hi all,
>>
>> Following the discussion[1], I'd like to start the vote for 'Document and
>> Feature Preview via GitHub Pages'
>>
>>
>> Please vote for the next 72 hours:(excluding next weekend)
>>
>>  [ ] +1: Accept the proposal
>>  [ ] +0
>>  [ ]- 1: I don’t think this is a good idea because …
>>
>>
>>
>> Bests,
>> Kent Yao
>>
>> [1] https://lists.apache.org/thread/xojcdlw77pht9bs4mt4087ynq6k9sbqq
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects

2024-09-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49443:
---

Assignee: Harsh Motwani

> Implement to_variant_object expression and make schema_of_variant expressions 
> print OBJECT for for Variant Objects
> --
>
> Key: SPARK-49443
> URL: https://issues.apache.org/jira/browse/SPARK-49443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Harsh Motwani
>Assignee: Harsh Motwani
>Priority: Major
>  Labels: pull-request-available
>
> Cast from structs to variant objects should not be legal since variant 
> objects are unordered bags of key-value pairs while structs are ordered sets 
> of elements of fixed types. Therefore, casts between structs and variant 
> objects do not behave like casts between structs. Example (produced by Serge 
> Rielau):
> {code:java}
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show()
> ++
> |named_struct(c, 1, b, 2)|
> ++
> |{1, 2}|
> ++
> Passing a struct into VARIANT loses the position
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as 
> struct)").show()
> +-+
> |CAST(named_struct(c, 1, b, 2) AS VARIANT)|
> +-+
> |{2, 1}|
> +-+
> {code}
> Casts from maps to variant objects should also not be legal since they 
> represent completely orthogonal data types. Maps can represent a variable 
> number of key value pairs based on just a key and value type in the schema 
> but in objects, the schema (produced by schema_of_variant expressions) will 
> have a type corresponding to each value in the object. Objects can have 
> values of different types while maps cannot and objects can only have string 
> keys while maps can also have complex keys.
> We should therefore prohibit the existing behavior of allowing explicit casts 
> from structs and maps to variants as the variant spec currently only supports 
> an object type which is remotely compatible with structs and maps. We should 
> introduce a new expression that converts schemas containing structs and maps 
> to variants. We will call it `to_variant_object`.
> Also, schema_of_variant and schema_of_variant_agg expressions currently print 
> STRUCT when Variant Objects are observed. We should also correct that to 
> OBJECT.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49501) Catalog createTable API is double-escaping paths

2024-09-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49501.
-
Fix Version/s: 3.5.4
   4.0.0
   Resolution: Fixed

> Catalog createTable API is double-escaping paths
> 
>
> Key: SPARK-49501
> URL: https://issues.apache.org/jira/browse/SPARK-49501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Christos Stavrakakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.4, 4.0.0
>
>
> Creating an external table using {{spark.catalog.createTable}} results in 
> incorrect escaping of special chars in paths.
> Consider the following code:
> {}spark.catalog.createTable({}}}{{{}"testTable", source = "parquet", 
> schema = new StructType().add("id", "int"), description = "", options = 
> Map("path" -> "/tmp/test table")){}
> The above call creates a table that is stored in {{/tmp/test%20table}} 
> instead of {{{}/tmp/test table{}}}. Note that this behaviour is different 
> from the SQL API, e.g. {{create table testTable(id int) using parquet 
> location '/tmp/test table'}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated (532aaafec4f1 -> dc3333bcc599)

2024-09-09 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 532aaafec4f1 [SPARK-49006] Implement purging for 
OperatorStateMetadataV2 and StateSchemaV3 files
 add dcbcc599 [SPARK-49501][SQL] Fix double-escaping of table location

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/spark/sql/internal/CatalogImpl.scala | 10 --
 .../scala/org/apache/spark/sql/internal/CatalogSuite.scala |  3 ++-
 2 files changed, 6 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: Apache Spark 4.0.0-preview2 (?)

2024-09-08 Thread Wenchen Fan
+1, thanks Dongjoon!

On Mon, Sep 9, 2024 at 9:44 AM Xinrong Meng  wrote:

> +1
>
> Thank you @Dongjoon Hyun  !
>
> On Sat, Sep 7, 2024 at 8:05 PM Hyukjin Kwon  wrote:
>
>> +1
>>
>> On Sat, Sep 7, 2024 at 9:04 AM huaxin gao  wrote:
>>
>>> +1
>>>
>>> On Fri, Sep 6, 2024 at 1:12 PM L. C. Hsieh  wrote:
>>>
 +1

 Thanks Dongjoon.

 On Fri, Sep 6, 2024 at 12:18 PM Dongjoon Hyun 
 wrote:
 >
 > Hi, All.
 >
 > Since the Apache Spark 4.0.0-preview1 tag was created in May, it's
 been over 3 months.
 >
 > https://github.com/apache/spark/releases/tag/v4.0.0-preview1
 (2024-05-28)
 >
 > Almost 1k commits including improvements, refactoring, and bug fixes
 landed at `master` branch.
 >
 > $ git log --oneline master a78ef73..HEAD | wc -l
 >  965
 >
 > According to the progress on SPARK-44111 and related issues, I
 believe we had better release `Preview2` this month in order to get more
 feedback on the recent progress.
 >
 > - https://issues.apache.org/jira/browse/SPARK-44111
 >   (Prepare Apache Spark 4.0.0)
 >
 > WDYT? I'm also volunteering as the release manager of Apache Spark
 4.0.0-preview2.
 >
 > Dongjoon.

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




(spark) branch branch-3.5 updated: [SPARK-49246][SQL][FOLLOW-UP] The behavior of SaveAsTable should not be changed by falling back to v1 command

2024-09-08 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 3f22ef172173 [SPARK-49246][SQL][FOLLOW-UP] The behavior of SaveAsTable 
should not be changed by falling back to v1 command
3f22ef172173 is described below

commit 3f22ef1721738ebacba8a27854ea8f24e0c6e5b9
Author: Wenchen Fan 
AuthorDate: Mon Sep 9 10:45:14 2024 +0800

[SPARK-49246][SQL][FOLLOW-UP] The behavior of SaveAsTable should not be 
changed by falling back to v1 command

This is a followup of https://github.com/apache/spark/pull/47772 . The 
behavior of SaveAsTable should not be changed by switching v1 to v2 command. 
This is similar to https://github.com/apache/spark/pull/47995. For the case of 
`DelegatingCatalogExtension` we need it goes to V1 commands to be consistent 
with previous behavior.

Behavior regression.

No

UT

No

Closes #48019 from amaliujia/regress_v2.

Lead-authored-by: Wenchen Fan 
Co-authored-by: Rui Wang 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 37b39b41d07cf8f39dd54cc18342e4d7b8bc71a3)
Signed-off-by: Wenchen Fan 
---
 .../main/scala/org/apache/spark/sql/DataFrameWriter.scala |  6 --
 .../DataSourceV2DataFrameSessionCatalogSuite.scala| 12 ++--
 .../spark/sql/connector/TestV2SessionCatalogBase.scala| 15 +--
 3 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
index 84f02c723136..8c945ef0dbcb 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
@@ -565,8 +565,10 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
 import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
 
 val session = df.sparkSession
-val canUseV2 = lookupV2Provider().isDefined ||
-  
df.sparkSession.sessionState.conf.getConf(SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION).isDefined
+val canUseV2 = lookupV2Provider().isDefined || 
(df.sparkSession.sessionState.conf.getConf(
+SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION).isDefined &&
+
!df.sparkSession.sessionState.catalogManager.catalog(CatalogManager.SESSION_CATALOG_NAME)
+  .isInstanceOf[DelegatingCatalogExtension])
 
 session.sessionState.sqlParser.parseMultipartIdentifier(tableName) match {
   case nameParts @ NonSessionCatalogAndIdentifier(catalog, ident) =>
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSessionCatalogSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSessionCatalogSuite.scala
index 79fbabbeacaa..9dd20c906535 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSessionCatalogSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSessionCatalogSuite.scala
@@ -55,8 +55,7 @@ class DataSourceV2DataFrameSessionCatalogSuite
 "and a same-name temp view exist") {
 withTable("same_name") {
   withTempView("same_name") {
-val format = spark.sessionState.conf.defaultDataSourceName
-sql(s"CREATE TABLE same_name(id LONG) USING $format")
+sql(s"CREATE TABLE same_name(id LONG) USING $v2Format")
 spark.range(10).createTempView("same_name")
 
spark.range(20).write.format(v2Format).mode(SaveMode.Append).saveAsTable("same_name")
 checkAnswer(spark.table("same_name"), spark.range(10).toDF())
@@ -88,6 +87,15 @@ class DataSourceV2DataFrameSessionCatalogSuite
   assert(tableInfo.properties().get("provider") === v2Format)
 }
   }
+
+  test("SPARK-49246: saveAsTable with v1 format") {
+withTable("t") {
+  sql("CREATE TABLE t(c INT) USING csv")
+  val df = spark.range(10).toDF()
+  df.write.mode(SaveMode.Overwrite).format("csv").saveAsTable("t")
+  verifyTable("t", df)
+}
+  }
 }
 
 class InMemoryTableSessionCatalog extends 
TestV2SessionCatalogBase[InMemoryTable] {
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/TestV2SessionCatalogBase.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/TestV2SessionCatalogBase.scala
index 9144fb939045..bd13123d587f 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/TestV2SessionCatalogBase.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/TestV2SessionCatalogBase.scala
@@ -22,8 +22,7 @@ import java.util.concurrent.atomic.AtomicBoolean
 
 import scala.c

(spark) branch master updated: [SPARK-49246][SQL][FOLLOW-UP] The behavior of SaveAsTable should not be changed by falling back to v1 command

2024-09-08 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 37b39b41d07c [SPARK-49246][SQL][FOLLOW-UP] The behavior of SaveAsTable 
should not be changed by falling back to v1 command
37b39b41d07c is described below

commit 37b39b41d07cf8f39dd54cc18342e4d7b8bc71a3
Author: Wenchen Fan 
AuthorDate: Mon Sep 9 10:45:14 2024 +0800

[SPARK-49246][SQL][FOLLOW-UP] The behavior of SaveAsTable should not be 
changed by falling back to v1 command

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/47772 . The 
behavior of SaveAsTable should not be changed by switching v1 to v2 command. 
This is similar to https://github.com/apache/spark/pull/47995. For the case of 
`DelegatingCatalogExtension` we need it goes to V1 commands to be consistent 
with previous behavior.

### Why are the changes needed?

Behavior regression.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #48019 from amaliujia/regress_v2.

Lead-authored-by: Wenchen Fan 
Co-authored-by: Rui Wang 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/sql/internal/DataFrameWriterImpl.scala   |  6 --
 .../test/scala/org/apache/spark/sql/CollationSuite.scala  |  3 ---
 .../DataSourceV2DataFrameSessionCatalogSuite.scala| 12 ++--
 .../spark/sql/connector/TestV2SessionCatalogBase.scala| 15 +--
 4 files changed, 19 insertions(+), 17 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/internal/DataFrameWriterImpl.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/internal/DataFrameWriterImpl.scala
index 7248a2d3f056..f0eef9ae1cbb 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/internal/DataFrameWriterImpl.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/internal/DataFrameWriterImpl.scala
@@ -426,8 +426,10 @@ final class DataFrameWriterImpl[T] private[sql](ds: 
Dataset[T]) extends DataFram
 import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
 
 val session = df.sparkSession
-val canUseV2 = lookupV2Provider().isDefined ||
-  
df.sparkSession.sessionState.conf.getConf(SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION).isDefined
+val canUseV2 = lookupV2Provider().isDefined || 
(df.sparkSession.sessionState.conf.getConf(
+SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION).isDefined &&
+
!df.sparkSession.sessionState.catalogManager.catalog(CatalogManager.SESSION_CATALOG_NAME)
+  .isInstanceOf[DelegatingCatalogExtension])
 
 session.sessionState.sqlParser.parseMultipartIdentifier(tableName) match {
   case nameParts @ NonSessionCatalogAndIdentifier(catalog, ident) =>
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
index 5e7feec149c9..da8aad16f55d 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
@@ -34,7 +34,6 @@ import 
org.apache.spark.sql.execution.aggregate.{HashAggregateExec, ObjectHashAg
 import org.apache.spark.sql.execution.columnar.InMemoryTableScanExec
 import org.apache.spark.sql.execution.joins._
 import org.apache.spark.sql.internal.{SqlApiConf, SQLConf}
-import org.apache.spark.sql.internal.SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION
 import org.apache.spark.sql.types.{ArrayType, MapType, StringType, 
StructField, StructType}
 
 class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
@@ -158,7 +157,6 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
   }
 
   test("disable bucketing on collated string column") {
-spark.conf.unset(V2_SESSION_CATALOG_IMPLEMENTATION.key)
 def createTable(bucketColumns: String*): Unit = {
   val tableName = "test_partition_tbl"
   withTable(tableName) {
@@ -760,7 +758,6 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
   }
 
   test("disable partition on collated string column") {
-spark.conf.unset(V2_SESSION_CATALOG_IMPLEMENTATION.key)
 def createTable(partitionColumns: String*): Unit = {
   val tableName = "test_partition_tbl"
   withTable(tableName) {
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSessionCatalogSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSessionCatalogSuite.scala
index 7bbb6485c273..fe078c5ae441 100644
--- 
a/sql

[jira] [Resolved] (SPARK-49383) Support Transpose DataFrame API

2024-09-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49383.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47884
[https://github.com/apache/spark/pull/47884]

> Support Transpose DataFrame API
> ---
>
> Key: SPARK-49383
> URL: https://issues.apache.org/jira/browse/SPARK-49383
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Support Transpose as Scala/Python DataFrame API in both Spark Connect and 
> Classic Spark.
> Transposing data is a crucial operation in data analysis, enabling the 
> transformation of rows into columns. This operation is widely used in tools 
> like pandas and numpy, allowing for more flexible data manipulation and 
> visualization.
> While Apache Spark supports unpivot and pivot operations, it currently lacks 
> a built-in transpose function. Implementing a transpose operation in Spark 
> would enhance its data processing capabilities, aligning it with the 
> functionalities available in pandas and numpy, and further empowering users 
> in their data analysis workflows.
> Please see 
> [https://docs.google.com/document/d/1QSmG81qQ-muab0UOeqgDAELqF7fJTH8GnxCJF4Ir-kA/edit]
>  for a detailed design.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49396) PLAN_VALIDATION_FAILED_RULE_IN_BATCH in SimplifyConditionals rule

2024-09-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49396:
---

Assignee: Avery Qi

> PLAN_VALIDATION_FAILED_RULE_IN_BATCH in SimplifyConditionals rule
> -
>
> Key: SPARK-49396
> URL: https://issues.apache.org/jira/browse/SPARK-49396
> Project: Spark
>  Issue Type: Task
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Avery Qi
>Assignee: Avery Qi
>Priority: Major
>  Labels: pull-request-available
>
>  *_{{SimplifyConditionals}}_* has a simplification code where if any of the 
> (not first) branches  in _*{{CaseWhen}}*_ is *_{{TrueLiteral}}_* we remove 
> all the remaining branches including the *_{{{}elseValue{}}}._*
> {code:java}
> case CaseWhen(branches, ) if branches.exists(._1 == TrueLiteral) => // a 
> branch with a true condition eliminates all following branches,
> // these branches can be pruned away
> val (h, t) = branches.span(_._1 != TrueLiteral)
> CaseWhen( h :+ t.head, None)}}{code}
> Now, the nullability check of *_{{CaseWhen}}_* checks that 
> (1) either of the branches including *_{{elseValue}}_* is {{nullable}} or 
> (2) *_{{elseValue}}_* is {{{}None{}}}.
>  
> The above simplification makes the *_{{elseValue}}_* as {{{}None{}}}. 
> Combined with this nullability check makes the *_{{CaseWhen}}_* switch from 
> {{non-nullable}} to {{{}nullable{}}}.
> This ticket aims to fix this issue and remain the nullability of the 
> expression by replacing the *_elsevalue_* to the value of the TrueLiteral 
> expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49152][SQL][FOLLOWUP] DelegatingCatalogExtension should also use V1 commands

2024-09-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f7cfeb534d92 [SPARK-49152][SQL][FOLLOWUP] DelegatingCatalogExtension 
should also use V1 commands
f7cfeb534d92 is described below

commit f7cfeb534d9285df381d147e01de47ec439c082e
Author: Wenchen Fan 
AuthorDate: Thu Sep 5 21:02:20 2024 +0800

[SPARK-49152][SQL][FOLLOWUP] DelegatingCatalogExtension should also use V1 
commands

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/47660 . If users 
override `spark_catalog` with
`DelegatingCatalogExtension`, we should still use v1 commands as 
`DelegatingCatalogExtension` forwards requests to HMS and there are still 
behavior differences between v1 and v2 commands targeting HMS.

This PR also forces to use v1 commands for certain commands that do not 
have a v2 version.

### Why are the changes needed?

Avoid introducing behavior changes to Spark plugins that implements 
`DelegatingCatalogExtension` to override `spark_catalog`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test case

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47995 from amaliujia/fix_catalog_v2.

Lead-authored-by: Wenchen Fan 
Co-authored-by: Rui Wang 
Co-authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/analysis/ResolveSessionCatalog.scala  | 41 --
 .../DataSourceV2SQLSessionCatalogSuite.scala   |  8 +
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 16 ++---
 3 files changed, 51 insertions(+), 14 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
index d569f1ed484c..02ad2e79a564 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
@@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.catalyst.util.{quoteIfNeeded, toPrettySQL, 
ResolveDefaultColumns => DefaultCols}
 import org.apache.spark.sql.catalyst.util.ResolveDefaultColumns._
-import org.apache.spark.sql.connector.catalog.{CatalogManager, CatalogPlugin, 
CatalogV2Util, LookupCatalog, SupportsNamespaces, V1Table}
+import org.apache.spark.sql.connector.catalog.{CatalogManager, CatalogPlugin, 
CatalogV2Util, DelegatingCatalogExtension, LookupCatalog, SupportsNamespaces, 
V1Table}
 import org.apache.spark.sql.connector.expressions.Transform
 import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.execution.command._
@@ -284,10 +284,20 @@ class ResolveSessionCatalog(val catalogManager: 
CatalogManager)
 case AnalyzeColumn(ResolvedV1TableOrViewIdentifier(ident), columnNames, 
allColumns) =>
   AnalyzeColumnCommand(ident, columnNames, allColumns)
 
-case RepairTable(ResolvedV1TableIdentifier(ident), addPartitions, 
dropPartitions) =>
+// V2 catalog doesn't support REPAIR TABLE yet, we must use v1 command 
here.
+case RepairTable(
+ResolvedV1TableIdentifierInSessionCatalog(ident),
+addPartitions,
+dropPartitions) =>
   RepairTableCommand(ident, addPartitions, dropPartitions)
 
-case LoadData(ResolvedV1TableIdentifier(ident), path, isLocal, 
isOverwrite, partition) =>
+// V2 catalog doesn't support LOAD DATA yet, we must use v1 command here.
+case LoadData(
+ResolvedV1TableIdentifierInSessionCatalog(ident),
+path,
+isLocal,
+isOverwrite,
+partition) =>
   LoadDataCommand(
 ident,
 path,
@@ -336,7 +346,8 @@ class ResolveSessionCatalog(val catalogManager: 
CatalogManager)
   }
   ShowColumnsCommand(db, v1TableName, output)
 
-case RecoverPartitions(ResolvedV1TableIdentifier(ident)) =>
+// V2 catalog doesn't support RECOVER PARTITIONS yet, we must use v1 
command here.
+case RecoverPartitions(ResolvedV1TableIdentifierInSessionCatalog(ident)) =>
   RepairTableCommand(
 ident,
 enableAddPartitions = true,
@@ -364,8 +375,9 @@ class ResolveSessionCatalog(val catalogManager: 
CatalogManager)
 purge,
 retainData = false)
 
+// V2 catalog doesn't support setting serde properties yet, we must use v1 
command here.
 case SetTableSerDeProperties(
-ResolvedV1TableIdentifier(ident),
+ResolvedV1TableIdentifierInSe

(spark) branch master updated: [SPARK-48348][SPARK-48376][SQL] Introduce `LEAVE` and `ITERATE` statements

2024-09-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9676b1c48cba [SPARK-48348][SPARK-48376][SQL] Introduce `LEAVE` and 
`ITERATE` statements
9676b1c48cba is described below

commit 9676b1c48cba47825ff3dd48e609fa3f0b046c02
Author: David Milicevic 
AuthorDate: Thu Sep 5 20:59:16 2024 +0800

[SPARK-48348][SPARK-48376][SQL] Introduce `LEAVE` and `ITERATE` statements

### What changes were proposed in this pull request?
This PR proposes introduction of `LEAVE` and `ITERATE` statement types to 
SQL Scripting language:
- `LEAVE` statement can be used in loops, as well as in `BEGIN ... END` 
compound blocks.
- `ITERATE` statement can be used only in loops.

This PR introduces:
- Logical operators for both statement types.
- Execution nodes for both statement types.
- Interpreter changes required to build execution plans that support new 
statement types.
- New error if statements are not used properly.
- Minor changes required to support new keywords.

### Why are the changes needed?
Adding support for new statement types to SQL Scripting language.

### Does this PR introduce _any_ user-facing change?
This PR introduces new statement types that will be available to users. 
However, script execution logic hasn't been done yet, so the new changes are 
not accessible by users yet.

### How was this patch tested?
Tests are introduced to all test suites related to SQL scripting.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47973 from davidm-db/sql_scripting_leave_iterate.

Authored-by: David Milicevic 
Signed-off-by: Wenchen Fan 
---
 .../src/main/resources/error/error-conditions.json |  18 +++
 docs/sql-ref-ansi-compliance.md|   2 +
 .../spark/sql/catalyst/parser/SqlBaseLexer.g4  |   2 +
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |  14 ++
 .../spark/sql/catalyst/parser/AstBuilder.scala |  52 +-
 .../parser/SqlScriptingLogicalOperators.scala  |  18 +++
 .../spark/sql/errors/SqlScriptingErrors.scala  |  23 +++
 .../catalyst/parser/SqlScriptingParserSuite.scala  | 177 +
 .../sql/scripting/SqlScriptingExecutionNode.scala  | 104 +++-
 .../sql/scripting/SqlScriptingInterpreter.scala|  17 +-
 .../sql-tests/results/ansi/keywords.sql.out|   2 +
 .../resources/sql-tests/results/keywords.sql.out   |   2 +
 .../scripting/SqlScriptingExecutionNodeSuite.scala | 103 +++-
 .../scripting/SqlScriptingInterpreterSuite.scala   | 143 +
 .../ThriftServerWithSparkContextSuite.scala|   2 +-
 15 files changed, 664 insertions(+), 15 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index 96105c967225..b42aae1311f4 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -2495,6 +2495,24 @@
 ],
 "sqlState" : "F"
   },
+  "INVALID_LABEL_USAGE" : {
+"message" : [
+  "The usage of the label  is invalid."
+],
+"subClass" : {
+  "DOES_NOT_EXIST" : {
+"message" : [
+  "Label was used in the  statement, but the label does 
not belong to any surrounding block."
+]
+  },
+  "ITERATE_IN_COMPOUND" : {
+"message" : [
+  "ITERATE statement cannot be used with a label that belongs to a 
compound (BEGIN...END) body."
+]
+  }
+},
+"sqlState" : "42K0L"
+  },
   "INVALID_LAMBDA_FUNCTION_CALL" : {
 "message" : [
   "Invalid lambda function call."
diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md
index f5e1ddfd3c57..0ac19e2ae943 100644
--- a/docs/sql-ref-ansi-compliance.md
+++ b/docs/sql-ref-ansi-compliance.md
@@ -556,6 +556,7 @@ Below is a list of all the keywords in Spark SQL.
 |INVOKER|non-reserved|non-reserved|non-reserved|
 |IS|reserved|non-reserved|reserved|
 |ITEMS|non-reserved|non-reserved|non-reserved|
+|ITERATE|non-reserved|non-reserved|non-reserved|
 |JOIN|reserved|strict-non-reserved|reserved|
 |KEYS|non-reserved|non-reserved|non-reserved|
 |LANGUAGE|non-reserved|non-reserved|reserved|
@@ -563,6 +564,7 @@ Below is a list of all the keywords in Spark SQL.
 |LATERAL|reserved|strict-non-reserved|reserved|
 |LAZY|non-reserved|non-reserved|non-reserved|
 |LEADING|reserved|non-reserved|reserved|
+|LEAVE|non-reserved|non-reserved|non-reserved|
 |LEFT|reserved|strict-non-reserved|reserved|
 |LIKE|no

[jira] [Assigned] (SPARK-48348) [M0] Support for LEAVE statement

2024-09-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48348:
---

Assignee: David Milicevic

> [M0] Support for LEAVE statement
> 
>
> Key: SPARK-48348
> URL: https://issues.apache.org/jira/browse/SPARK-48348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
>
> Add support for LEAVE statement in WHILE (and other) loops to SQL scripting 
> parser & interpreter.
> This is the same functionality as BREAK in other languages.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [DISCUSS] release Spark 3.5.3?

2024-09-01 Thread Wenchen Fan
Thanks for the support!

@Yuming Wang  I looked into the tickets you mentioned. I
think the first one is not an issue, the second one is bad error message
and not a blocker.

@Haejoon Lee  There is a newly
reported regression of 3.5.2, please wait for the fix before starting the
RC: https://github.com/apache/spark/pull/47325#issuecomment-2321342164

On Mon, Sep 2, 2024 at 11:33 AM Haejoon Lee 
wrote:

> +1, and I'd like to volunteer as the release manager for Apache Spark
> 3.5.3 if we don't have one yet
>
> On Sun, Sep 1, 2024 at 11:23 PM Xiao Li  wrote:
>
>> +1
>>
>> Yuming Wang  于2024年8月30日周五 02:34写道:
>>
>>> +1, Could we include two additional issues:
>>> https://issues.apache.org/jira/browse/SPARK-49472
>>> https://issues.apache.org/jira/browse/SPARK-49349
>>>
>>> On Wed, Aug 28, 2024 at 7:01 PM Wenchen Fan  wrote:
>>>
>>>> Hi all,
>>>>
>>>> It's unfortunate that we missed merging a fix of a correctness bug in
>>>> Spark 3.5: https://github.com/apache/spark/pull/43938. I just
>>>> re-submitted it: https://github.com/apache/spark/pull/47905
>>>>
>>>> In addition to this correctness bug fix, around 40 fixes have been
>>>> merged to branch 3.5 after 3.5.2 was released. Shall we do a 3.5.3 release
>>>> now?
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>


[jira] [Assigned] (SPARK-49451) Allow duplicate keys in parse_json.

2024-09-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49451:
---

Assignee: Chenhao Li

> Allow duplicate keys in parse_json.
> ---
>
> Key: SPARK-49451
> URL: https://issues.apache.org/jira/browse/SPARK-49451
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49451) Allow duplicate keys in parse_json.

2024-09-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49451.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47920
[https://github.com/apache/spark/pull/47920]

> Allow duplicate keys in parse_json.
> ---
>
> Key: SPARK-49451
> URL: https://issues.apache.org/jira/browse/SPARK-49451
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49451] Allow duplicate keys in parse_json

2024-09-01 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8879df5fc12b [SPARK-49451] Allow duplicate keys in parse_json
8879df5fc12b is described below

commit 8879df5fc12b4c80df8382a55fddec87520b9ee8
Author: Chenhao Li 
AuthorDate: Mon Sep 2 14:25:15 2024 +0800

[SPARK-49451] Allow duplicate keys in parse_json

### What changes were proposed in this pull request?

Before the change, `parse_json` will throw an error if there are duplicate 
keys in an input JSON object. After the change, `parse_json` will keep the last 
field with the same key. It doesn't affect other variant building expressions 
(creating a variant from struct/map/variant) because it is legal for them to 
contain duplicate keys.

The change is guarded by a flag and disabled by default.

### Why are the changes needed?

To make the data migration simpler. The user won't need to change its data 
if it contains duplicated keys. The behavior is inspired by 
https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super
 (reject duplicate keys or keep the last occurance).

### Does this PR introduce _any_ user-facing change?

Yes, as described in the first section.

### How was this patch tested?

New unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47920 from chenhao-db/allow_duplicate_keys.

Authored-by: Chenhao Li 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/types/variant/VariantBuilder.java | 87 +-
 .../variant/VariantExpressionEvalUtils.scala   | 11 ++-
 .../expressions/variant/variantExpressions.scala   |  9 ++-
 .../spark/sql/catalyst/json/JacksonParser.scala|  4 +-
 .../org/apache/spark/sql/internal/SQLConf.scala| 10 +++
 .../variant/VariantExpressionEvalUtilsSuite.scala  | 20 +++--
 .../function_is_variant_null.explain   |  2 +-
 .../explain-results/function_parse_json.explain|  2 +-
 .../function_schema_of_variant.explain |  2 +-
 .../function_schema_of_variant_agg.explain |  2 +-
 .../function_try_parse_json.explain|  2 +-
 .../function_try_variant_get.explain   |  2 +-
 .../explain-results/function_variant_get.explain   |  2 +-
 .../apache/spark/sql/VariantEndToEndSuite.scala|  9 ++-
 14 files changed, 124 insertions(+), 40 deletions(-)

diff --git 
a/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
 
b/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
index f5e5f729459f..375d69034fd3 100644
--- 
a/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
+++ 
b/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
@@ -26,10 +26,7 @@ import java.io.IOException;
 import java.math.BigDecimal;
 import java.math.BigInteger;
 import java.nio.charset.StandardCharsets;
-import java.util.ArrayList;
-import java.util.Arrays;
-import java.util.Collections;
-import java.util.HashMap;
+import java.util.*;
 
 import com.fasterxml.jackson.core.JsonFactory;
 import com.fasterxml.jackson.core.JsonParser;
@@ -43,24 +40,29 @@ import static org.apache.spark.types.variant.VariantUtil.*;
  * Build variant value and metadata by parsing JSON values.
  */
 public class VariantBuilder {
+  public VariantBuilder(boolean allowDuplicateKeys) {
+this.allowDuplicateKeys = allowDuplicateKeys;
+  }
+
   /**
* Parse a JSON string as a Variant value.
* @throws VariantSizeLimitException if the resulting variant value or 
metadata would exceed
* the SIZE_LIMIT (for example, this could be a maximum of 16 MiB).
* @throws IOException if any JSON parsing error happens.
*/
-  public static Variant parseJson(String json) throws IOException {
+  public static Variant parseJson(String json, boolean allowDuplicateKeys) 
throws IOException {
 try (JsonParser parser = new JsonFactory().createParser(json)) {
   parser.nextToken();
-  return parseJson(parser);
+  return parseJson(parser, allowDuplicateKeys);
 }
   }
 
   /**
-   * Similar {@link #parseJson(String)}, but takes a JSON parser instead of 
string input.
+   * Similar {@link #parseJson(String, boolean)}, but takes a JSON parser 
instead of string input.
*/
-  public static Variant parseJson(JsonParser parser) throws IOException {
-VariantBuilder builder = new VariantBuilder();
+  public static Variant parseJson(JsonParser parser, boolean 
allowDuplicateKeys)
+  throws IOException {
+VariantBuilder builder = new VariantBuilder(allowDuplicateKeys);
 builder.buildJson(parser);
 return builder.result();
   }
@@ -274,23 +276,63 @@ public class Vari

[jira] [Resolved] (SPARK-49480) NullPointerException from SparkThrowableHelper.isInternalError method

2024-09-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49480.
-
Fix Version/s: 4.0.0
   3.5.3
   Resolution: Fixed

> NullPointerException from SparkThrowableHelper.isInternalError method
> -
>
> Key: SPARK-49480
> URL: https://issues.apache.org/jira/browse/SPARK-49480
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0, 3.5.2
>Reporter: Xi Chen
>Assignee: Xi Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>
> The SparkThrowableHelper.isInternalError method doesn't handle null input, 
> and it could lead to NullPointerException.
> Example stacktrace from our environment with Spark 3.5.1:
> {code:java}
>  Caused by: java.lang.NullPointerException: Cannot invoke 
> "String.startsWith(String)" because "errorClass" is null
>     at 
> org.apache.spark.SparkThrowableHelper$.isInternalError(SparkThrowableHelper.scala:64)
>     at 
> org.apache.spark.SparkThrowableHelper.isInternalError(SparkThrowableHelper.scala)
>     at org.apache.spark.SparkThrowable.isInternalError(SparkThrowable.java:50)
>     at 
> org.apache.spark.SparkException.isInternalError(SparkException.scala:27)
>     at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>     at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>     at 
> com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:688)
>     at 
> com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:772)
>     ... 30 more  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49480) NullPointerException from SparkThrowableHelper.isInternalError method

2024-09-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49480:
---

Assignee: Xi Chen

> NullPointerException from SparkThrowableHelper.isInternalError method
> -
>
> Key: SPARK-49480
> URL: https://issues.apache.org/jira/browse/SPARK-49480
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0, 3.5.2
>Reporter: Xi Chen
>Assignee: Xi Chen
>Priority: Major
>  Labels: pull-request-available
>
> The SparkThrowableHelper.isInternalError method doesn't handle null input, 
> and it could lead to NullPointerException.
> Example stacktrace from our environment with Spark 3.5.1:
> {code:java}
>  Caused by: java.lang.NullPointerException: Cannot invoke 
> "String.startsWith(String)" because "errorClass" is null
>     at 
> org.apache.spark.SparkThrowableHelper$.isInternalError(SparkThrowableHelper.scala:64)
>     at 
> org.apache.spark.SparkThrowableHelper.isInternalError(SparkThrowableHelper.scala)
>     at org.apache.spark.SparkThrowable.isInternalError(SparkThrowable.java:50)
>     at 
> org.apache.spark.SparkException.isInternalError(SparkException.scala:27)
>     at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>     at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>     at 
> com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:688)
>     at 
> com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:772)
>     ... 30 more  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-49480][CORE] Fix NullPointerException from `SparkThrowableHelper.isInternalError`

2024-09-01 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new d5c3437b [SPARK-49480][CORE] Fix NullPointerException from 
`SparkThrowableHelper.isInternalError`
d5c3437b is described below

commit d5c3437b29ae7d2253b4c4a05f4b87952eb9
Author: Xi Chen 
AuthorDate: Mon Sep 2 13:58:18 2024 +0800

[SPARK-49480][CORE] Fix NullPointerException from 
`SparkThrowableHelper.isInternalError`

### What changes were proposed in this pull request?

Handle null input for `SparkThrowableHelper.isInternalError` method.

### Why are the changes needed?

The `SparkThrowableHelper.isInternalError` method doesn't handle null 
input, and it could lead to NullPointerException. It happens when a 
`SparkException` without `errorClass` is invoked `isInternalError`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add 2 assertions to current test cases to cover this issue.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47946 from jshmchenxi/SPARK-49480/null-pointer-is-internal-error.

Authored-by: Xi Chen 
Signed-off-by: Wenchen Fan 
(cherry picked from commit cef3c86e046edb43beb487e92f0542b5d8354be4)
Signed-off-by: Wenchen Fan 
---
 common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala | 2 +-
 core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala  | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git 
a/common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala 
b/common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala
index 0f329b5655b3..2331a8e67b28 100644
--- a/common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala
+++ b/common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala
@@ -61,7 +61,7 @@ private[spark] object SparkThrowableHelper {
   }
 
   def isInternalError(errorClass: String): Boolean = {
-errorClass.startsWith("INTERNAL_ERROR")
+errorClass != null && errorClass.startsWith("INTERNAL_ERROR")
   }
 
   def getMessage(e: SparkThrowable with Throwable, format: 
ErrorMessageFormat.Value): String = {
diff --git a/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala 
b/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala
index 299bcea3f9e2..a5f5eb21c68b 100644
--- a/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala
+++ b/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala
@@ -416,6 +416,7 @@ class SparkThrowableSuite extends SparkFunSuite {
 } catch {
   case e: SparkThrowable =>
 assert(e.getErrorClass == null)
+assert(!e.isInternalError)
 assert(e.getSqlState == null)
   case _: Throwable =>
 // Should not end up here
@@ -432,6 +433,7 @@ class SparkThrowableSuite extends SparkFunSuite {
 } catch {
   case e: SparkThrowable =>
 assert(e.getErrorClass == "CANNOT_PARSE_DECIMAL")
+assert(!e.isInternalError)
 assert(e.getSqlState == "22018")
   case _: Throwable =>
 // Should not end up here


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-49480][CORE] Fix NullPointerException from `SparkThrowableHelper.isInternalError`

2024-09-01 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cef3c86e046e [SPARK-49480][CORE] Fix NullPointerException from 
`SparkThrowableHelper.isInternalError`
cef3c86e046e is described below

commit cef3c86e046edb43beb487e92f0542b5d8354be4
Author: Xi Chen 
AuthorDate: Mon Sep 2 13:58:18 2024 +0800

[SPARK-49480][CORE] Fix NullPointerException from 
`SparkThrowableHelper.isInternalError`

### What changes were proposed in this pull request?

Handle null input for `SparkThrowableHelper.isInternalError` method.

### Why are the changes needed?

The `SparkThrowableHelper.isInternalError` method doesn't handle null 
input, and it could lead to NullPointerException. It happens when a 
`SparkException` without `errorClass` is invoked `isInternalError`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add 2 assertions to current test cases to cover this issue.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47946 from jshmchenxi/SPARK-49480/null-pointer-is-internal-error.

Authored-by: Xi Chen 
Signed-off-by: Wenchen Fan 
---
 common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala | 2 +-
 core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala  | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git 
a/common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala 
b/common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala
index db5eff72e124..428c9d2a4935 100644
--- a/common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala
+++ b/common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala
@@ -74,7 +74,7 @@ private[spark] object SparkThrowableHelper {
   }
 
   def isInternalError(errorClass: String): Boolean = {
-errorClass.startsWith("INTERNAL_ERROR")
+errorClass != null && errorClass.startsWith("INTERNAL_ERROR")
   }
 
   def getMessage(e: SparkThrowable with Throwable, format: 
ErrorMessageFormat.Value): String = {
diff --git a/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala 
b/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala
index 0c22edbe984c..d99589c171c3 100644
--- a/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala
+++ b/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala
@@ -259,6 +259,7 @@ class SparkThrowableSuite extends SparkFunSuite {
 } catch {
   case e: SparkThrowable =>
 assert(e.getErrorClass == null)
+assert(!e.isInternalError)
 assert(e.getSqlState == null)
   case _: Throwable =>
 // Should not end up here
@@ -275,6 +276,7 @@ class SparkThrowableSuite extends SparkFunSuite {
 } catch {
   case e: SparkThrowable =>
 assert(e.getErrorClass == "CANNOT_PARSE_DECIMAL")
+assert(!e.isInternalError)
 assert(e.getSqlState == "22018")
   case _: Throwable =>
 // Should not end up here


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[jira] [Assigned] (SPARK-46037) When Left Join build Left, ShuffledHashJoinExec may result in incorrect results

2024-08-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46037:
---

Assignee: mcdull_zhang

> When Left Join build Left, ShuffledHashJoinExec may result in incorrect 
> results
> ---
>
> Key: SPARK-46037
> URL: https://issues.apache.org/jira/browse/SPARK-46037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: mcdull_zhang
>Assignee: mcdull_zhang
>Priority: Blocker
>  Labels: correctness, pull-request-available
>
> When Left Join build Left and codegen is turned off, ShuffledHashJoinExec may 
> have incorrect results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46037) When Left Join build Left, ShuffledHashJoinExec may result in incorrect results

2024-08-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46037.
-
Fix Version/s: 4.0.0
   3.5.3
   Resolution: Fixed

Issue resolved by pull request 47905
[https://github.com/apache/spark/pull/47905]

> When Left Join build Left, ShuffledHashJoinExec may result in incorrect 
> results
> ---
>
> Key: SPARK-46037
> URL: https://issues.apache.org/jira/browse/SPARK-46037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: mcdull_zhang
>Assignee: mcdull_zhang
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>
> When Left Join build Left and codegen is turned off, ShuffledHashJoinExec may 
> have incorrect results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-46037][SQL] Correctness fix for Shuffled Hash Join build left without codegen

2024-08-28 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 2ad11b632e07 [SPARK-46037][SQL] Correctness fix for Shuffled Hash Join 
build left without codegen
2ad11b632e07 is described below

commit 2ad11b632e072f47b84793a6cbaeb06c984b0e35
Author: Wenchen Fan 
AuthorDate: Thu Aug 29 12:27:41 2024 +0800

[SPARK-46037][SQL] Correctness fix for Shuffled Hash Join build left 
without codegen

This is a re-submitting of https://github.com/apache/spark/pull/43938 to 
fix a join correctness bug caused by https://github.com/apache/spark/pull/41398 
. Credits go to mcdull-zhang

correctness fix

Yes, the query result will be corrected.

new test

no

Closes #47905 from cloud-fan/join.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
(cherry picked from commit af5e0a267e5a37adc25bdf9c78b6fe207ef7bfb5)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/joins/HashJoin.scala   |  5 ++---
 .../spark/sql/execution/joins/OuterJoinSuite.scala | 22 --
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
index 7c48baf99ef8..07f7915416c1 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
@@ -138,9 +138,8 @@ trait HashJoin extends JoinCodegenSupport {
 UnsafeProjection.create(streamedBoundKeys)
 
   @transient protected[this] lazy val boundCondition = if 
(condition.isDefined) {
-if (joinType == FullOuter && buildSide == BuildLeft) {
-  // Put join left side before right side. This is to be consistent with
-  // `ShuffledHashJoinExec.fullOuterJoin`.
+if ((joinType == FullOuter || joinType == LeftOuter) && buildSide == 
BuildLeft) {
+  // Put join left side before right side.
   Predicate.create(condition.get, buildPlan.output ++ 
streamedPlan.output).eval _
 } else {
   Predicate.create(condition.get, streamedPlan.output ++ 
buildPlan.output).eval _
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala
index 4f78833abdb9..a4a3d76db313 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala
@@ -26,10 +26,11 @@ import org.apache.spark.sql.catalyst.plans.logical.{Join, 
JoinHint}
 import org.apache.spark.sql.execution.{SparkPlan, SparkPlanTest}
 import org.apache.spark.sql.execution.exchange.EnsureRequirements
 import org.apache.spark.sql.internal.SQLConf
-import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.sql.test.{SharedSparkSession, SQLTestData}
 import org.apache.spark.sql.types.{DoubleType, IntegerType, StructType}
 
-class OuterJoinSuite extends SparkPlanTest with SharedSparkSession {
+class OuterJoinSuite extends SparkPlanTest with SharedSparkSession with 
SQLTestData {
+  setupTestData()
 
   private val EnsureRequirements = new EnsureRequirements()
 
@@ -325,4 +326,21 @@ class OuterJoinSuite extends SparkPlanTest with 
SharedSparkSession {
   (null, null, 7, 7.0)
 )
   )
+
+  testWithWholeStageCodegenOnAndOff(
+"SPARK-46037: ShuffledHashJoin build left with left outer join, codegen 
off") { _ =>
+def join(hint: String): DataFrame = {
+  sql(
+s"""
+  |SELECT /*+ $hint */ *
+  |FROM testData t1
+  |LEFT OUTER JOIN
+  |testData2 t2
+  |ON key = a AND concat(value, b) = '12'
+  |""".stripMargin)
+}
+val df1 = join("SHUFFLE_HASH(t1)")
+val df2 = join("SHUFFLE_MERGE(t1)")
+checkAnswer(df1, identity, df2.collect().toSeq)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46037][SQL] Correctness fix for Shuffled Hash Join build left without codegen

2024-08-28 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new af5e0a267e5a [SPARK-46037][SQL] Correctness fix for Shuffled Hash Join 
build left without codegen
af5e0a267e5a is described below

commit af5e0a267e5a37adc25bdf9c78b6fe207ef7bfb5
Author: Wenchen Fan 
AuthorDate: Thu Aug 29 12:27:41 2024 +0800

[SPARK-46037][SQL] Correctness fix for Shuffled Hash Join build left 
without codegen

### What changes were proposed in this pull request?

This is a re-submitting of https://github.com/apache/spark/pull/43938 to 
fix a join correctness bug caused by https://github.com/apache/spark/pull/41398 
. Credits go to mcdull-zhang

### Why are the changes needed?

correctness fix

### Does this PR introduce _any_ user-facing change?

Yes, the query result will be corrected.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47905 from cloud-fan/join.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/joins/HashJoin.scala   |  5 ++---
 .../spark/sql/execution/joins/OuterJoinSuite.scala | 22 --
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
index 3ae76a1db22b..5d59a48d544a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
@@ -138,9 +138,8 @@ trait HashJoin extends JoinCodegenSupport {
 UnsafeProjection.create(streamedBoundKeys)
 
   @transient protected[this] lazy val boundCondition = if 
(condition.isDefined) {
-if (joinType == FullOuter && buildSide == BuildLeft) {
-  // Put join left side before right side. This is to be consistent with
-  // `ShuffledHashJoinExec.fullOuterJoin`.
+if ((joinType == FullOuter || joinType == LeftOuter) && buildSide == 
BuildLeft) {
+  // Put join left side before right side.
   Predicate.create(condition.get, buildPlan.output ++ 
streamedPlan.output).eval _
 } else {
   Predicate.create(condition.get, streamedPlan.output ++ 
buildPlan.output).eval _
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala
index e4ea88067c7c..7ba93ee13e18 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala
@@ -26,11 +26,12 @@ import org.apache.spark.sql.catalyst.plans.logical.{Join, 
JoinHint}
 import org.apache.spark.sql.execution.{SparkPlan, SparkPlanTest}
 import org.apache.spark.sql.execution.exchange.EnsureRequirements
 import org.apache.spark.sql.internal.SQLConf
-import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.sql.test.{SharedSparkSession, SQLTestData}
 import org.apache.spark.sql.types.{DoubleType, IntegerType, StructType}
 
-class OuterJoinSuite extends SparkPlanTest with SharedSparkSession {
+class OuterJoinSuite extends SparkPlanTest with SharedSparkSession with 
SQLTestData {
   import testImplicits.toRichColumn
+  setupTestData()
 
   private val EnsureRequirements = new EnsureRequirements()
 
@@ -326,4 +327,21 @@ class OuterJoinSuite extends SparkPlanTest with 
SharedSparkSession {
   (null, null, 7, 7.0)
 )
   )
+
+  testWithWholeStageCodegenOnAndOff(
+"SPARK-46037: ShuffledHashJoin build left with left outer join, codegen 
off") { _ =>
+def join(hint: String): DataFrame = {
+  sql(
+s"""
+  |SELECT /*+ $hint */ *
+  |FROM testData t1
+  |LEFT OUTER JOIN
+  |testData2 t2
+  |ON key = a AND concat(value, b) = '12'
+  |""".stripMargin)
+}
+val df1 = join("SHUFFLE_HASH(t1)")
+val df2 = join("SHUFFLE_MERGE(t1)")
+checkAnswer(df1, identity, df2.collect().toSeq)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[DISCUSS] release Spark 3.5.3?

2024-08-28 Thread Wenchen Fan
Hi all,

It's unfortunate that we missed merging a fix of a correctness bug in Spark
3.5: https://github.com/apache/spark/pull/43938. I just re-submitted it:
https://github.com/apache/spark/pull/47905

In addition to this correctness bug fix, around 40 fixes have been merged
to branch 3.5 after 3.5.2 was released. Shall we do a 3.5.3 release now?

Thanks,
Wenchen


[jira] [Created] (SPARK-49393) fail by default in deprecated catalog plugin APIs

2024-08-26 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-49393:
---

 Summary: fail by default in deprecated catalog plugin APIs
 Key: SPARK-49393
 URL: https://issues.apache.org/jira/browse/SPARK-49393
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [DISCUSS] [Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

2024-08-26 Thread Wenchen Fan
+1. The analyzer rule order issue has bitten me multiple times and it's
very hard to make your analyzer rule bug-free if it interacts with other
rules.

On Wed, Aug 21, 2024 at 2:49 AM Reynold Xin 
wrote:

> +1 on this too
>
> When I implemented "group by all", I introduced at least two subtle bugs
> that many reviewers weren't able to catch and those two bugs would not have
> been possible to introduce if we had a single pass analyzer. Single pass
> can make the whole framework more robust.
>
>
>
>
>
>
> On Tue, Aug 20, 2024 at 7:23 PM Xiao Li  wrote:
>
>> This sounds like a good idea!
>>
>> The Analyzer is complex. The changes in the new Analyzer should not
>> affect the existing one. The users could add the QO rules and rely on the
>> existing structures and patterns of the logical plan trees generated by the
>> current one.  The new Analyzer needs to generate the same logical plan
>> trees as the current one.
>>
>> Cheers,
>>
>> Xiao
>>
>> Vladimir Golubev  于2024年8月14日周三 11:27写道:
>>
>>> - I think we can rely on the current tests. One possibility would be to
>>> dual-run both Analyzer implementations if `Utils.isTesting` and compare the
>>> (normalized) logical plans
>>> - We can implement the analyzer functionality by milestones (Milestone
>>> 0: Project, Filter, UnresolvedInlineTable, Milestone 2: Support main
>>> datasources, ...). Running both analyzers in mixed mode may lead to
>>> unexpected logical plan problems, because that would introduce a completely
>>> different chain of transformations
>>>
>>> On Wed, Aug 14, 2024 at 3:58 PM Herman van Hovell 
>>> wrote:
>>>
 +1(000) on this!

 This should massively reduce allocations done in the analyzer, and it
 is much more efficient. I also can't count the times that I had to increase
 the number of iterations. This sounds like a no-brainer to me.

 I do have two questions:

- How do we ensure that we don't accidentally break the analyzer?
Are existing tests enough?
- How can we introduce this incrementally? Can we run the analyzer
in mixed mode (both single pass rules and the existing tree traversal
rules) for a while?

 Cheers,
 Herman

 On Fri, Aug 9, 2024 at 10:48 AM Vladimir Golubev 
 wrote:

> Hello All,
>
> I recently did some research in the Catalyst Analyzer area to check if
> it’s possible to make it single-pass instead of fixed-point. Despite the
> flexibility of the current fixed-point approach (new functionality - new
> rule), it has some drawbacks. The dependencies between the rules are
> unobvious, so it’s hard to introduce changes without having the full
> knowledge. By modifying one rule, the whole chain of transformations can
> change in an unobvious way. Since we can hit the maximum number of
> iterations, there’s no guarantee that the plan is going to be resolved. 
> And
> from a performance perspective the Analyzer can be quite slow for large
> logical plans and wide tables.
>
> The idea is to resolve the tree in a single-pass post-order traversal.
> This approach can be beneficial for the following reasons:
>
>-
>
>Code can rely on child nodes being resolved
>-
>
>Name visibility can be controlled by a scope, which would be
>maintained during the traversal
>-
>
>Random in-tree lookups are disallowed during the traversal, so the
>resolution algorithm would be linear with respect to the parsed tree 
> size
>-
>
>Analyzer would deterministically produce resolved logical plan or
>a descriptive error
>
>
> The prototype shows that it’s possible to do a bottom-up Analyzer
> implementation and reuse a large chunk of node-processing code from rule
> bodies. I did the performance benchmark on two cases where the current
> Analyzer struggles the most - wide nested views and large logical plans 
> and
> got 7-10x performance speedups.
>
> Is this something that would be interesting for the Spark community?
> What would be the best way to proceed with this idea?
>
> Regards,
>
> Vladimir Golubev.
>



[jira] [Updated] (SPARK-49359) allow StagedTableCatalog implementations to fall back to non-atomic write

2024-08-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-49359:

Fix Version/s: 3.5.3

> allow StagedTableCatalog implementations to fall back to non-atomic write
> -
>
> Key: SPARK-49359
> URL: https://issues.apache.org/jira/browse/SPARK-49359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-49359][SQL] Allow StagedTableCatalog implementations to fall back to non-atomic write

2024-08-24 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 5eca9530309c [SPARK-49359][SQL] Allow StagedTableCatalog 
implementations to fall back to non-atomic write
5eca9530309c is described below

commit 5eca9530309c681a0e522d010f380c61a1d3df50
Author: Wenchen Fan 
AuthorDate: Fri Aug 23 10:49:39 2024 +0900

[SPARK-49359][SQL] Allow StagedTableCatalog implementations to fall back to 
non-atomic write

### What changes were proposed in this pull request?

This PR allows `StagedTableCatalog#create/replaceTable` to return null and 
Spark will fall back to normal non-atomic write.

### Why are the changes needed?

Extending an interface is static but sometimes the implementations need 
more dynamicity. For example, a catalog may only support atomic CTAS for 
certain table formats, and we shouldn't force them to implement atomic writes 
for all other formats.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47848 from cloud-fan/stage.

Authored-by: Wenchen Fan 
Signed-off-by: Hyukjin Kwon 
---
 .../sql/connector/catalog/StagingTableCatalog.java |  9 ++-
 .../datasources/v2/WriteToDataSourceV2Exec.scala   |  7 ++-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 72 ++
 3 files changed, 83 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/StagingTableCatalog.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/StagingTableCatalog.java
index 4337a7c61520..3094b0cf1bbd 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/StagingTableCatalog.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/StagingTableCatalog.java
@@ -80,7 +80,8 @@ public interface StagingTableCatalog extends TableCatalog {
* @param columns the column of the new table
* @param partitions transforms to use for partitioning data in the table
* @param properties a string map of table properties
-   * @return metadata for the new table
+   * @return metadata for the new table. This can be null if the catalog does 
not support atomic
+   * creation for this table. Spark will call {@link 
#loadTable(Identifier)} later.
* @throws TableAlreadyExistsException If a table or view already exists for 
the identifier
* @throws UnsupportedOperationException If a requested partition transform 
is not supported
* @throws NoSuchNamespaceException If the identifier namespace does not 
exist (optional)
@@ -128,7 +129,8 @@ public interface StagingTableCatalog extends TableCatalog {
* @param columns the columns of the new table
* @param partitions transforms to use for partitioning data in the table
* @param properties a string map of table properties
-   * @return metadata for the new table
+   * @return metadata for the new table. This can be null if the catalog does 
not support atomic
+   * creation for this table. Spark will call {@link 
#loadTable(Identifier)} later.
* @throws UnsupportedOperationException If a requested partition transform 
is not supported
* @throws NoSuchNamespaceException If the identifier namespace does not 
exist (optional)
* @throws NoSuchTableException If the table does not exist
@@ -176,7 +178,8 @@ public interface StagingTableCatalog extends TableCatalog {
* @param columns the columns of the new table
* @param partitions transforms to use for partitioning data in the table
* @param properties a string map of table properties
-   * @return metadata for the new table
+   * @return metadata for the new table. This can be null if the catalog does 
not support atomic
+   * creation for this table. Spark will call {@link 
#loadTable(Identifier)} later.
* @throws UnsupportedOperationException If a requested partition transform 
is not supported
* @throws NoSuchNamespaceException If the identifier namespace does not 
exist (optional)
*/
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala
index 89c879beda82..c99e2bba2e96 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala
@@ -116,9 +116,10 @@ case class AtomicCreateTableAsSelectExec(
   }
   throw QueryCompilationErrors.tableAlreadyExistsError(ident)
 }
-  

[jira] [Updated] (SPARK-49359) allow StagedTableCatalog implementations to fall back to non-atomic write

2024-08-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-49359:

Summary: allow StagedTableCatalog implementations to fall back to 
non-atomic write  (was: allow FakeStagedTableCatalog implementations to fall 
back to non-atomic write)

> allow StagedTableCatalog implementations to fall back to non-atomic write
> -
>
> Key: SPARK-49359
> URL: https://issues.apache.org/jira/browse/SPARK-49359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49359) allow FakeStagedTableCatalog implementations to fall back to non-atomic write

2024-08-22 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-49359:
---

 Summary: allow FakeStagedTableCatalog implementations to fall back 
to non-atomic write
 Key: SPARK-49359
 URL: https://issues.apache.org/jira/browse/SPARK-49359
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0, 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-46444][SQL] V2SessionCatalog#createTable should not load the table

2024-08-20 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 481bc58bddb6 [SPARK-46444][SQL] V2SessionCatalog#createTable should 
not load the table
481bc58bddb6 is described below

commit 481bc58bddb6b998386b320e61a1b9f0e73c4711
Author: Wenchen Fan 
AuthorDate: Tue Dec 26 15:17:30 2023 +0800

[SPARK-46444][SQL] V2SessionCatalog#createTable should not load the table

It's a perf regression in CREATE TABLE if we switch to the v2 command 
framework, as `V2SessionCatalog#createTable` does an extra table lookup, which 
does not happen in v1. This PR fixes it by allowing `TableCatalog#createTable` 
to return null, and Spark will call `loadTable` to get the new table metadata 
in the case of CTAS. This PR also fixed `alterTable` in the same way.

fix perf regression in v2. The perf of a single command may not matter, but 
in a cluster with many Spark applications, it's important to reduce the RPCs to 
the metastore.

no

existing tests

No

Closes #44377 from cloud-fan/create-table.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/connector/catalog/TableCatalog.java  |   8 +-
 .../datasources/v2/V2SessionCatalog.scala  |   6 +-
 .../datasources/v2/WriteToDataSourceV2Exec.scala   |   8 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala |  13 +-
 .../sql/connector/TestV2SessionCatalogBase.scala   |   5 +-
 .../datasources/v2/V2SessionCatalogSuite.scala | 181 +
 6 files changed, 139 insertions(+), 82 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
index 387477d0f191..d1951a7f7fbf 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
@@ -208,7 +208,9 @@ public interface TableCatalog extends CatalogPlugin {
* @param columns the columns of the new table.
* @param partitions transforms to use for partitioning data in the table
* @param properties a string map of table properties
-   * @return metadata for the new table
+   * @return metadata for the new table. This can be null if getting the 
metadata for the new table
+   * is expensive. Spark will call {@link #loadTable(Identifier)} if 
needed (e.g. CTAS).
+   *
* @throws TableAlreadyExistsException If a table or view already exists for 
the identifier
* @throws UnsupportedOperationException If a requested partition transform 
is not supported
* @throws NoSuchNamespaceException If the identifier namespace does not 
exist (optional)
@@ -242,7 +244,9 @@ public interface TableCatalog extends CatalogPlugin {
*
* @param ident a table identifier
* @param changes changes to apply to the table
-   * @return updated metadata for the table
+   * @return updated metadata for the table. This can be null if getting the 
metadata for the
+   * updated table is expensive. Spark always discard the returned 
table here.
+   *
* @throws NoSuchTableException If the table doesn't exist or is a view
* @throws IllegalArgumentException If any change is rejected by the 
implementation.
*/
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
index d7ab23cf08dd..a022a01455a0 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
@@ -148,7 +148,7 @@ class V2SessionCatalog(catalog: SessionCatalog)
 throw QueryCompilationErrors.tableAlreadyExistsError(ident)
 }
 
-loadTable(ident)
+null // Return null to save the `loadTable` call for CREATE TABLE without 
AS SELECT.
   }
 
   private def toOptions(properties: Map[String, String]): Map[String, String] 
= {
@@ -189,7 +189,7 @@ class V2SessionCatalog(catalog: SessionCatalog)
 throw QueryCompilationErrors.noSuchTableError(ident)
 }
 
-loadTable(ident)
+null // Return null to save the `loadTable` call for ALTER TABLE.
   }
 
   override def purgeTable(ident: Identifier): Boolean = {
@@ -233,8 +233,6 @@ class V2SessionCatalog(catalog: SessionCatalog)
   throw QueryCompilationErrors.tableAlreadyExistsError(newIdent)
 }
 
-// Load table to make sure the table exists
-loadTable(oldIdent)
 catalog.renameTable(oldIdent.asTableIdentifier, newIdent.asTableIdentifier)
   }
 
diff --git 
a/sql/core/src/main/scala/org/a

[jira] [Resolved] (SPARK-49246) TableCatalog#loadTable should indicate if it's for writing

2024-08-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49246.
-
Fix Version/s: 4.0.0
   3.5.3
   Resolution: Fixed

Issue resolved by pull request 47772
[https://github.com/apache/spark/pull/47772]

> TableCatalog#loadTable should indicate if it's for writing
> --
>
> Key: SPARK-49246
> URL: https://issues.apache.org/jira/browse/SPARK-49246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-49246][SQL] TableCatalog#loadTable should indicate if it's for writing

2024-08-20 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 027a14b3cc05 [SPARK-49246][SQL] TableCatalog#loadTable should indicate 
if it's for writing
027a14b3cc05 is described below

commit 027a14b3cc058effab7f5a9f7e9d633e8179c5bb
Author: Wenchen Fan 
AuthorDate: Wed Aug 21 10:53:31 2024 +0800

[SPARK-49246][SQL] TableCatalog#loadTable should indicate if it's for 
writing

For custom catalogs that have access control, read and write permissions 
can be different. However, currently Spark always call `TableCatalog#loadTable` 
to look up the table, no matter it's for read or write.

This PR adds a variant of `loadTable`: `loadTableForWrite`, in 
`TableCatalog`. All the write commands will call this new method to look up 
tables instead. This new method has a default implementation that just calls 
`loadTable`, so there is no breaking change.

allow more fine-grained access control for custom catalogs.

No

new tests

no

Closes #47772 from cloud-fan/write.

Lead-authored-by: Wenchen Fan 
Co-authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
(cherry picked from commit b6164e6ed7174201057cf8f9ad59f59d6f60f089)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/connector/catalog/TableCatalog.java  | 20 +
 .../sql/connector/catalog/TableWritePrivilege.java | 40 ++
 .../spark/sql/catalyst/analysis/Analyzer.scala |  8 +-
 .../spark/sql/catalyst/analysis/unresolved.scala   | 27 +++
 .../spark/sql/catalyst/parser/AstBuilder.scala | 57 -
 .../sql/catalyst/plans/logical/v2Commands.scala| 15 
 .../sql/connector/catalog/CatalogV2Util.scala  | 17 +++-
 .../spark/sql/catalyst/parser/DDLParserSuite.scala | 29 ---
 .../sql/catalyst/parser/PlanParserSuite.scala  | 93 --
 .../org/apache/spark/sql/DataFrameWriter.scala | 18 +++--
 .../org/apache/spark/sql/DataFrameWriterV2.scala   | 11 ++-
 .../catalyst/analysis/ResolveSessionCatalog.scala  |  4 +-
 .../sql-tests/analyzer-results/explain-aqe.sql.out |  2 +-
 .../sql-tests/analyzer-results/explain.sql.out |  2 +-
 .../sql-tests/results/explain-aqe.sql.out  |  2 +-
 .../resources/sql-tests/results/explain.sql.out|  2 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 83 +++
 .../command/AlignAssignmentsSuiteBase.scala|  4 +-
 .../execution/command/PlanResolutionSuite.scala|  6 +-
 19 files changed, 344 insertions(+), 96 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
index b990f59bfd90..387477d0f191 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
@@ -110,6 +110,26 @@ public interface TableCatalog extends CatalogPlugin {
*/
   Table loadTable(Identifier ident) throws NoSuchTableException;
 
+  /**
+   * Load table metadata by {@link Identifier identifier} from the catalog. 
Spark will write data
+   * into this table later.
+   * 
+   * If the catalog supports views and contains a view for the identifier and 
not a table, this
+   * must throw {@link NoSuchTableException}.
+   *
+   * @param ident a table identifier
+   * @param writePrivileges
+   * @return the table's metadata
+   * @throws NoSuchTableException If the table doesn't exist or is a view
+   *
+   * @since 3.5.3
+   */
+  default Table loadTable(
+  Identifier ident,
+  Set writePrivileges) throws NoSuchTableException {
+return loadTable(ident);
+  }
+
   /**
* Load table metadata of a specific version by {@link Identifier 
identifier} from the catalog.
* 
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableWritePrivilege.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableWritePrivilege.java
new file mode 100644
index ..ca2d4ba9e7b4
--- /dev/null
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableWritePrivilege.java
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to i

(spark) branch master updated: [SPARK-49246][SQL] TableCatalog#loadTable should indicate if it's for writing

2024-08-20 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b6164e6ed717 [SPARK-49246][SQL] TableCatalog#loadTable should indicate 
if it's for writing
b6164e6ed717 is described below

commit b6164e6ed7174201057cf8f9ad59f59d6f60f089
Author: Wenchen Fan 
AuthorDate: Wed Aug 21 10:53:31 2024 +0800

[SPARK-49246][SQL] TableCatalog#loadTable should indicate if it's for 
writing

### What changes were proposed in this pull request?

For custom catalogs that have access control, read and write permissions 
can be different. However, currently Spark always call `TableCatalog#loadTable` 
to look up the table, no matter it's for read or write.

This PR adds a variant of `loadTable`: `loadTableForWrite`, in 
`TableCatalog`. All the write commands will call this new method to look up 
tables instead. This new method has a default implementation that just calls 
`loadTable`, so there is no breaking change.

### Why are the changes needed?

allow more fine-grained access control for custom catalogs.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47772 from cloud-fan/write.

Lead-authored-by: Wenchen Fan 
Co-authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/connector/catalog/TableCatalog.java  | 20 +
 .../sql/connector/catalog/TableWritePrivilege.java | 40 ++
 .../spark/sql/catalyst/analysis/Analyzer.scala |  8 +-
 .../spark/sql/catalyst/analysis/unresolved.scala   | 27 +++
 .../spark/sql/catalyst/parser/AstBuilder.scala | 56 -
 .../sql/catalyst/plans/logical/v2Commands.scala| 15 
 .../sql/connector/catalog/CatalogV2Util.scala  | 17 +++-
 .../spark/sql/catalyst/parser/DDLParserSuite.scala | 29 ---
 .../sql/catalyst/parser/PlanParserSuite.scala  | 93 --
 .../org/apache/spark/sql/DataFrameWriter.scala | 18 +++--
 .../org/apache/spark/sql/DataFrameWriterV2.scala   | 11 ++-
 .../org/apache/spark/sql/MergeIntoWriter.scala |  3 +-
 .../catalyst/analysis/ResolveSessionCatalog.scala  |  4 +-
 .../datasources/v2/WriteToDataSourceV2Exec.scala   |  8 +-
 .../sql-tests/analyzer-results/explain-aqe.sql.out |  2 +-
 .../sql-tests/analyzer-results/explain.sql.out |  2 +-
 .../sql-tests/results/explain-aqe.sql.out  |  2 +-
 .../resources/sql-tests/results/explain.sql.out|  2 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 84 +++
 .../command/AlignAssignmentsSuiteBase.scala|  4 +-
 .../execution/command/PlanResolutionSuite.scala|  6 +-
 21 files changed, 351 insertions(+), 100 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
index facfc0d774e8..ad4fe743218f 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
@@ -110,6 +110,26 @@ public interface TableCatalog extends CatalogPlugin {
*/
   Table loadTable(Identifier ident) throws NoSuchTableException;
 
+  /**
+   * Load table metadata by {@link Identifier identifier} from the catalog. 
Spark will write data
+   * into this table later.
+   * 
+   * If the catalog supports views and contains a view for the identifier and 
not a table, this
+   * must throw {@link NoSuchTableException}.
+   *
+   * @param ident a table identifier
+   * @param writePrivileges
+   * @return the table's metadata
+   * @throws NoSuchTableException If the table doesn't exist or is a view
+   *
+   * @since 3.5.3
+   */
+  default Table loadTable(
+  Identifier ident,
+  Set writePrivileges) throws NoSuchTableException {
+return loadTable(ident);
+  }
+
   /**
* Load table metadata of a specific version by {@link Identifier 
identifier} from the catalog.
* 
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableWritePrivilege.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableWritePrivilege.java
new file mode 100644
index ..ca2d4ba9e7b4
--- /dev/null
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableWritePrivilege.java
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this fil

(spark) branch master updated: [SPARK-45891][FOLLOW-UP][VARIANT] Address post-merge comments

2024-08-16 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 469ebcf4fe41 [SPARK-45891][FOLLOW-UP][VARIANT] Address post-merge 
comments
469ebcf4fe41 is described below

commit 469ebcf4fe41e4c951f4c061ffc31a2dd5a89fc3
Author: Harsh Motwani 
AuthorDate: Sat Aug 17 10:28:45 2024 +0800

[SPARK-45891][FOLLOW-UP][VARIANT] Address post-merge comments

### What changes were proposed in this pull request?

The minor post-merge comments from 
https://github.com/apache/spark/pull/47473 have been addressed

### Why are the changes needed?

Improved code style.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing VariantExpressionSuite passes

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47792 from harshmotw-db/harshmotw-db/PR_fix.

Authored-by: Harsh Motwani 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/util/DayTimeIntervalUtils.java| 155 ++---
 1 file changed, 76 insertions(+), 79 deletions(-)

diff --git 
a/common/utils/src/main/scala/org/apache/spark/util/DayTimeIntervalUtils.java 
b/common/utils/src/main/scala/org/apache/spark/util/DayTimeIntervalUtils.java
index ce86ee152393..00d5f0cd8ed2 100644
--- 
a/common/utils/src/main/scala/org/apache/spark/util/DayTimeIntervalUtils.java
+++ 
b/common/utils/src/main/scala/org/apache/spark/util/DayTimeIntervalUtils.java
@@ -17,31 +17,31 @@
 
 package org.apache.spark.util;
 
-import org.apache.spark.SparkException;
-
 import java.math.BigDecimal;
 import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.spark.SparkException;
 
 // Replicating code from SparkIntervalUtils so code in the 'common' space can 
work with
 // year-month intervals.
 public class DayTimeIntervalUtils {
-  private static byte DAY = 0;
-  private static byte HOUR = 1;
-  private static byte MINUTE = 2;
-  private static byte SECOND = 3;
-  private static long HOURS_PER_DAY = 24;
-  private static long MINUTES_PER_HOUR = 60;
-  private static long SECONDS_PER_MINUTE = 60;
-  private static long MILLIS_PER_SECOND = 1000;
-  private static long MICROS_PER_MILLIS = 1000;
-  private static long MICROS_PER_SECOND = MICROS_PER_MILLIS * 
MILLIS_PER_SECOND;
-  private static long MICROS_PER_MINUTE = SECONDS_PER_MINUTE * 
MICROS_PER_SECOND;
-  private static long MICROS_PER_HOUR = MINUTES_PER_HOUR * MICROS_PER_MINUTE;
-  private static long MICROS_PER_DAY = HOURS_PER_DAY * MICROS_PER_HOUR;
-  private static long MAX_DAY = Long.MAX_VALUE / MICROS_PER_DAY;
-  private static long MAX_HOUR = Long.MAX_VALUE / MICROS_PER_HOUR;
-  private static long MAX_MINUTE = Long.MAX_VALUE / MICROS_PER_MINUTE;
-  private static long MAX_SECOND = Long.MAX_VALUE / MICROS_PER_SECOND;
+  private static final byte DAY = 0;
+  private static final byte HOUR = 1;
+  private static final byte MINUTE = 2;
+  private static final byte SECOND = 3;
+  private static final long HOURS_PER_DAY = 24;
+  private static final long MINUTES_PER_HOUR = 60;
+  private static final long SECONDS_PER_MINUTE = 60;
+  private static final long MILLIS_PER_SECOND = 1000;
+  private static final long MICROS_PER_MILLIS = 1000;
+  private static final long MICROS_PER_SECOND = MICROS_PER_MILLIS * 
MILLIS_PER_SECOND;
+  private static final long MICROS_PER_MINUTE = SECONDS_PER_MINUTE * 
MICROS_PER_SECOND;
+  private static final long MICROS_PER_HOUR = MINUTES_PER_HOUR * 
MICROS_PER_MINUTE;
+  private static final long MICROS_PER_DAY = HOURS_PER_DAY * MICROS_PER_HOUR;
+  private static final long MAX_HOUR = Long.MAX_VALUE / MICROS_PER_HOUR;
+  private static final long MAX_MINUTE = Long.MAX_VALUE / MICROS_PER_MINUTE;
+  private static final long MAX_SECOND = Long.MAX_VALUE / MICROS_PER_SECOND;
 
   public static String fieldToString(byte field) throws SparkException {
 if (field == DAY) {
@@ -65,70 +65,67 @@ public class DayTimeIntervalUtils {
   throws SparkException {
 String sign = "";
 long rest = micros;
-try {
-  String from = fieldToString(startField).toUpperCase();
-  String to = fieldToString(endField).toUpperCase();
-  String prefix = "INTERVAL '";
-  String postfix = startField == endField ? "' " + from : "' " + from + " 
TO " + to;
-  if (micros < 0) {
-if (micros == Long.MIN_VALUE) {
-  // Especial handling of minimum `Long` value because negate op 
overflows `Long`.
-  // seconds = 106751991 * (24 * 60 * 60) + 4 * 60 * 60 + 54 = 
9223372036854
-  // microseconds = -922337203685400L-775808 == Long.MinValue
-  String baseStr = "-106751991 04:00:54.775808

(spark) branch master updated: [SPARK-49250][SQL] Improve error message for nested UnresolvedWindowExpression in CheckAnalysis

2024-08-16 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 291647daf483 [SPARK-49250][SQL] Improve error message for nested 
UnresolvedWindowExpression in CheckAnalysis
291647daf483 is described below

commit 291647daf48347057fec0a734cc550555ad64bd3
Author: Vladimir Golubev 
AuthorDate: Sat Aug 17 00:46:12 2024 +0800

[SPARK-49250][SQL] Improve error message for nested 
UnresolvedWindowExpression in CheckAnalysis

### What changes were proposed in this pull request?

When `CheckAnalysis` encounters `UnresolvedWindowExpression` in `Project` 
or `Aggregate`, it throws 
`QueryCompilationErrors.windowSpecificationNotDefinedError`:
- https://github.com/apache/spark/commit/4718d59c6c4
- https://github.com/apache/spark/commit/0b48d3f61b7

However, consider the following query:

`SELECT (SUM(col1) OVER(unspecified_window) / 1) FROM VALUES (1)`

Here `UnreolvedWindowExpression` is wrapped into the division expression. 
And `CheckAnalysis` throws a different unrelated 
`org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] 
Invalid call to dataType on unresolved object SQLSTATE: XX000` exception 
earlier from this 
[case](https://github.com/apache/spark/blob/0b48d3f61b726209e96b0b967530534b5ad9101d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L308).

The solution to improve this would be to match `UnreolvedWindowExpression` 
early.

### Why are the changes needed?

To improve error message for incorrect window usage.

### Does this PR introduce _any_ user-facing change?

Yes, the error message is better now

### How was this patch tested?

Added a unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47775 from 
vladimirg-db/vladimirg-db/better-error-message-for-unresolved-window-expression-in-check-analysis.

Authored-by: Vladimir Golubev 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  5 +
 .../scala/org/apache/spark/sql/SQLQuerySuite.scala | 22 ++
 2 files changed, 27 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 594573f13cf0..aa5a2cd95074 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -305,6 +305,11 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   throw 
QueryCompilationErrors.invalidStarUsageError(operator.nodeName, Seq(s))
 }
 
+  // Should be before `e.checkInputDataTypes()` to produce the correct 
error for unknown
+  // window expressions nested inside other expressions
+  case UnresolvedWindowExpression(_, WindowSpecReference(windowName)) 
=>
+throw 
QueryCompilationErrors.windowSpecificationNotDefinedError(windowName)
+
   case e: Expression if e.checkInputDataTypes().isFailure =>
 e.checkInputDataTypes() match {
   case checkRes: TypeCheckResult.DataTypeMismatch =>
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
index b5f1707c7a26..55313c8ac2f8 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@@ -4887,6 +4887,28 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
   assert(relations.head.options == Map("key1" -> "1", "key2" -> "2"))
 }
   }
+
+  test(
+"SPARK-49250: CheckAnalysis for UnresolvedWindowExpression must produce " +
+"MISSING_WINDOW_SPECIFICATION error"
+  ) {
+for (sqlText <- Seq(
+  "SELECT SUM(col1) OVER(unspecified_window) FROM VALUES (1)",
+  "SELECT SUM(col1) OVER(unspecified_window) FROM VALUES (1) GROUP BY 
col1",
+  "SELECT (SUM(col1) OVER(unspecified_window) / 1) FROM VALUES (1)"
+)) {
+  checkError(
+exception = intercept[AnalysisException](
+  sql(sqlText)
+),
+errorClass = "MISSING_WINDOW_SPECIFICATION",
+parameters = Map(
+  "windowName" -> "unspecified_window",
+  "docroot" -> SPARK_DOC_ROOT
+)
+  )
+}
+  }
 }
 
 case class Foo(bar: Option[String])


---

[jira] [Resolved] (SPARK-49250) CheckAnalysis with UnresolvedWindowExpression nested inside other expressions produces confusing error message

2024-08-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49250.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47775
[https://github.com/apache/spark/pull/47775]

> CheckAnalysis with UnresolvedWindowExpression nested inside other expressions 
> produces confusing error message
> --
>
> Key: SPARK-49250
> URL: https://issues.apache.org/jira/browse/SPARK-49250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Vladimir Golubev
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Consider the following query: `SELECT (SUM(col1) OVER(unspecified_window) / 
> 1) FROM VALUES (1)`. Currently the resolution produces 
> `org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] 
> Invalid call to dataType on unresolved object SQLSTATE: XX000`, which is 
> confusing. We need to change it to `ExtendedAnalysisException: 
> [MISSING_WINDOW_SPECIFICATION]` 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49250) CheckAnalysis with UnresolvedWindowExpression nested inside other expressions produces confusing error message

2024-08-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49250:
---

Assignee: Vladimir Golubev

> CheckAnalysis with UnresolvedWindowExpression nested inside other expressions 
> produces confusing error message
> --
>
> Key: SPARK-49250
> URL: https://issues.apache.org/jira/browse/SPARK-49250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Vladimir Golubev
>Priority: Major
>  Labels: pull-request-available
>
> Consider the following query: `SELECT (SUM(col1) OVER(unspecified_window) / 
> 1) FROM VALUES (1)`. Currently the resolution produces 
> `org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] 
> Invalid call to dataType on unresolved object SQLSTATE: XX000`, which is 
> confusing. We need to change it to `ExtendedAnalysisException: 
> [MISSING_WINDOW_SPECIFICATION]` 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49211) V2 Catalog can also support built-in data sources

2024-08-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-49211:

Fix Version/s: 3.5.3

> V2 Catalog can also support built-in data sources
> -
>
> Key: SPARK-49211
> URL: https://issues.apache.org/jira/browse/SPARK-49211
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-49211][SQL][3.5] V2 Catalog can also support built-in data sources

2024-08-16 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 3148cfa10bd7 [SPARK-49211][SQL][3.5] V2 Catalog can also support 
built-in data sources
3148cfa10bd7 is described below

commit 3148cfa10bd781f258afc5dbfb7bdc8bfcec269c
Author: Rui Wang 
AuthorDate: Fri Aug 16 22:33:35 2024 +0800

[SPARK-49211][SQL][3.5] V2 Catalog can also support built-in data sources

### What changes were proposed in this pull request?

V2 Catalog can also support built-in data sources.

### Why are the changes needed?

V2 catalog could still support spark built-in data sources if the V2 
catalog returns v1 table and do not track partitions in catalog. This is 
because we do not need to require V2 catalog to implement every thing to 
support built-in data sources (as that is a big chunk of work).

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

UT
### Was this patch authored or co-authored using generative AI tooling?

NO

Closes #4 from amaliujia/branch-3.5.

Authored-by: Rui Wang 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/analysis/Analyzer.scala |   9 +-
 .../sql/catalyst/catalog/SessionCatalog.scala  |  21 +++--
 .../apache/spark/sql/catalyst/identifiers.scala|   4 +
 .../sql/catalyst/catalog/SessionCatalogSuite.scala |  10 ++-
 .../execution/datasources/DataSourceStrategy.scala |  19 ++--
 .../spark/sql/StatisticsCollectionTestBase.scala   |   6 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 100 -
 .../spark/sql/execution/command/DDLSuite.scala |   9 +-
 .../execution/command/v1/TruncateTableSuite.scala  |   6 +-
 .../spark/sql/hive/HiveMetastoreCatalog.scala  |  14 +--
 10 files changed, 164 insertions(+), 34 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index bd917fd73e20..fd0a0715b634 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1246,7 +1246,14 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 options: CaseInsensitiveStringMap,
 isStreaming: Boolean): Option[LogicalPlan] = {
   table.map {
-case v1Table: V1Table if CatalogV2Util.isSessionCatalog(catalog) =>
+// To utilize this code path to execute V1 commands, e.g. INSERT,
+// either it must be session catalog, or tracksPartitionsInCatalog
+// must be false so it does not require use catalog to manage 
partitions.
+// Obviously we cannot execute V1Table by V1 code path if the table
+// is not from session catalog and the table still requires its catalog
+// to manage partitions.
+case v1Table: V1Table if CatalogV2Util.isSessionCatalog(catalog)
+  || !v1Table.catalogTable.tracksPartitionsInCatalog =>
   if (isStreaming) {
 if (v1Table.v1Table.tableType == CatalogTableType.VIEW) {
   throw 
QueryCompilationErrors.permanentViewNotSupportedByStreamingReadingAPIError(
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
index cba928dfa924..99074b859a7f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
@@ -40,6 +40,7 @@ import 
org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, Subque
 import org.apache.spark.sql.catalyst.trees.{CurrentOrigin, Origin}
 import org.apache.spark.sql.catalyst.util.{CharVarcharUtils, StringUtils}
 import org.apache.spark.sql.connector.catalog.CatalogManager
+import 
org.apache.spark.sql.connector.catalog.CatalogManager.SESSION_CATALOG_NAME
 import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.internal.StaticSQLConf.GLOBAL_TEMP_DATABASE
@@ -193,7 +194,7 @@ class SessionCatalog(
 }
   }
 
-  private val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = {
+  private val tableRelationCache: Cache[FullQualifiedTableName, LogicalPlan] = 
{
 var builder = CacheBuilder.newBuilder()
   .maximumSize(cacheSize)
 
@@ -201,33 +202,34 @@ class SessionCatalog(
   builder = builder.expireAfterWrite(cacheTTL, TimeUnit.SECONDS)
 }
 
-builder.build[QualifiedTableName, Logic

[jira] [Updated] (SPARK-49246) TableCatalog#loadTable should indicate if it's for writing

2024-08-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-49246:

Summary: TableCatalog#loadTable should indicate if it's for writing  (was: 
loadTable should indicate if it's for writing)

> TableCatalog#loadTable should indicate if it's for writing
> --
>
> Key: SPARK-49246
> URL: https://issues.apache.org/jira/browse/SPARK-49246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49246) loadTable should indicate if it's for writing

2024-08-15 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-49246:
---

 Summary: loadTable should indicate if it's for writing
 Key: SPARK-49246
 URL: https://issues.apache.org/jira/browse/SPARK-49246
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49211) V2 Catalog can also support built-in data sources

2024-08-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49211.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47723
[https://github.com/apache/spark/pull/47723]

> V2 Catalog can also support built-in data sources
> -
>
> Key: SPARK-49211
> URL: https://issues.apache.org/jira/browse/SPARK-49211
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated (0219d60224ad -> b9fbdf010efe)

2024-08-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 0219d60224ad [SPARK-48882][SS] Assign names to streaming output mode 
related error classes
 add b9fbdf010efe [SPARK-49211][SQL] V2 Catalog can also support built-in 
data sources

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala |   9 +-
 .../sql/catalyst/catalog/SessionCatalog.scala  |  21 +++--
 .../apache/spark/sql/catalyst/identifiers.scala|   4 +
 .../sql/catalyst/catalog/SessionCatalogSuite.scala |  10 ++-
 .../execution/datasources/DataSourceStrategy.scala |  19 ++--
 .../datasources/v2/V2SessionCatalog.scala  |   5 +-
 .../spark/sql/StatisticsCollectionTestBase.scala   |   6 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 100 -
 .../spark/sql/execution/command/DDLSuite.scala |   9 +-
 .../execution/command/v1/TruncateTableSuite.scala  |   6 +-
 .../spark/sql/hive/HiveMetastoreCatalog.scala  |  16 ++--
 11 files changed, 168 insertions(+), 37 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-49152][SQL][FOLLOWUP][3.5] table location string should be Hadoop Path string

2024-08-14 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 8d05bf2bd017 [SPARK-49152][SQL][FOLLOWUP][3.5] table location string 
should be Hadoop Path string
8d05bf2bd017 is described below

commit 8d05bf2bd0173d291848f4538ff218330229f9cb
Author: Wenchen Fan 
AuthorDate: Thu Aug 15 10:47:18 2024 +0800

[SPARK-49152][SQL][FOLLOWUP][3.5] table location string should be Hadoop 
Path string

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/47660 to restore 
the behavior change. The table location string should be Hadoop Path string 
instead of URL string which escapes all special chars.

### Why are the changes needed?

restore the unintentional behavior change.

### Does this PR introduce _any_ user-facing change?

No, it's not released yet

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47765 from cloud-fan/fix.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/connector/catalog/TableCatalog.java  |  2 +-
 .../spark/sql/connector/catalog/V1Table.scala  |  8 ---
 .../catalyst/analysis/ResolveSessionCatalog.scala  | 11 +++--
 .../datasources/v2/DataSourceV2Strategy.scala  | 16 +++--
 .../sql-tests/results/show-create-table.sql.out|  4 ++--
 .../DataSourceV2DataFrameSessionCatalogSuite.scala |  2 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 28 ++
 .../command/v2/ShowCreateTableSuite.scala  |  2 +-
 .../apache/spark/sql/internal/CatalogSuite.scala   |  2 +-
 9 files changed, 57 insertions(+), 18 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
index 29c2da307a0f..b990f59bfd90 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
@@ -46,7 +46,7 @@ public interface TableCatalog extends CatalogPlugin {
 
   /**
* A reserved property to specify the location of the table. The files of 
the table
-   * should be under this location.
+   * should be under this location. The location is a Hadoop Path string.
*/
   String PROP_LOCATION = "location";
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V1Table.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V1Table.scala
index da201e816497..8928ba57f06c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V1Table.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V1Table.scala
@@ -22,7 +22,7 @@ import java.util
 import scala.collection.JavaConverters._
 import scala.collection.mutable
 
-import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
CatalogUtils}
 import 
org.apache.spark.sql.connector.catalog.CatalogV2Implicits.TableIdentifierHelper
 import org.apache.spark.sql.connector.catalog.V1Table.addV2TableProperties
 import org.apache.spark.sql.connector.expressions.{LogicalExpressions, 
Transform}
@@ -38,7 +38,7 @@ private[sql] case class V1Table(v1Table: CatalogTable) 
extends Table {
   lazy val options: Map[String, String] = {
 v1Table.storage.locationUri match {
   case Some(uri) =>
-v1Table.storage.properties + ("path" -> uri.toString)
+v1Table.storage.properties + ("path" -> CatalogUtils.URIToString(uri))
   case _ =>
 v1Table.storage.properties
 }
@@ -81,7 +81,9 @@ private[sql] object V1Table {
 TableCatalog.OPTION_PREFIX + key -> value } ++
   v1Table.provider.map(TableCatalog.PROP_PROVIDER -> _) ++
   v1Table.comment.map(TableCatalog.PROP_COMMENT -> _) ++
-  v1Table.storage.locationUri.map(TableCatalog.PROP_LOCATION -> 
_.toString) ++
+  v1Table.storage.locationUri.map { loc =>
+TableCatalog.PROP_LOCATION -> CatalogUtils.URIToString(loc)
+  } ++
   (if (managed) Some(TableCatalog.PROP_IS_MANAGED_LOCATION -> "true") else 
None) ++
   (if (external) Some(TableCatalog.PROP_EXTERNAL -> "true") else None) ++
   Some(TableCatalog.PROP_OWNER -> v1Table.owner)
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.sca

(spark) branch master updated (67d2888c476b -> b9b14f95c725)

2024-08-13 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 67d2888c476b [SPARK-49226][SQL] Clean-up UDF code generation
 add b9b14f95c725 [SPARK-48967][SQL] Improve performance and memory 
footprint of "INSERT INTO ... VALUES" Statements

No new revisions were added by this update.

Summary of changes:
 .../catalyst/analysis/ResolveInlineTables.scala| 101 +
 .../spark/sql/catalyst/parser/AstBuilder.scala |   9 +-
 .../EvaluateUnresolvedInlineTable.scala}   |  67 +++---
 .../org/apache/spark/sql/internal/SQLConf.scala|   8 +
 .../analysis/ResolveInlineTablesSuite.scala|  21 +-
 .../spark/sql/catalyst/parser/DDLParserSuite.scala |  36 +++-
 .../sql/InlineTableParsingImprovementsSuite.scala  | 229 +
 7 files changed, 324 insertions(+), 147 deletions(-)
 copy 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/{analysis/ResolveInlineTables.scala
 => util/EvaluateUnresolvedInlineTable.scala} (73%)
 create mode 100644 
sql/core/src/test/scala/org/apache/spark/sql/InlineTableParsingImprovementsSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[jira] [Assigned] (SPARK-48967) Improve performance and memory footprint of "INSERT INTO ... VALUES" Statements

2024-08-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48967:
---

Assignee: Costas Zarifis

> Improve performance and memory footprint of "INSERT INTO ... VALUES" 
> Statements
> ---
>
> Key: SPARK-48967
> URL: https://issues.apache.org/jira/browse/SPARK-48967
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.4
>Reporter: Costas Zarifis
>Assignee: Costas Zarifis
>Priority: Major
>  Labels: pull-request-available
>
> Currently very large "INSERT INTO ... VALUES" statements result into 
> disproportionally large parse trees as each literal will need to remain in 
> the parse tree, until it eventually gets evaluated into a LocalTable, once 
> the appropriate analyzer/optimizer rules have been applied.
>  
> This results in increased memory pressure on the driver, when such large 
> statements are generated, that can lead to OOMs and GC pauses. It also 
> results in suboptimal runtime performance as the time it takes to apply 
> analyzer/optimizer rules is typically proportional to the size of the parse 
> tree.
>  
> Both these issues can be resolved by applying the functions that evaluate the 
> unresolved table into a local table eagerly from the AST Builder, thus 
> short-circuiting the evaluation of such statements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48967) Improve performance and memory footprint of "INSERT INTO ... VALUES" Statements

2024-08-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48967.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47428
[https://github.com/apache/spark/pull/47428]

> Improve performance and memory footprint of "INSERT INTO ... VALUES" 
> Statements
> ---
>
> Key: SPARK-48967
> URL: https://issues.apache.org/jira/browse/SPARK-48967
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.4
>Reporter: Costas Zarifis
>Assignee: Costas Zarifis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently very large "INSERT INTO ... VALUES" statements result into 
> disproportionally large parse trees as each literal will need to remain in 
> the parse tree, until it eventually gets evaluated into a LocalTable, once 
> the appropriate analyzer/optimizer rules have been applied.
>  
> This results in increased memory pressure on the driver, when such large 
> statements are generated, that can lead to OOMs and GC pauses. It also 
> results in suboptimal runtime performance as the time it takes to apply 
> analyzer/optimizer rules is typically proportional to the size of the parse 
> tree.
>  
> Both these issues can be resolved by applying the functions that evaluate the 
> unresolved table into a local table eagerly from the AST Builder, thus 
> short-circuiting the evaluation of such statements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49166][SQL] Support OFFSET in correlated subquery

2024-08-13 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 56a90add735c [SPARK-49166][SQL] Support OFFSET in correlated subquery
56a90add735c is described below

commit 56a90add735c1bf58b17eb6e1c72f23ff762250c
Author: Avery Qi 
AuthorDate: Wed Aug 14 10:40:09 2024 +0800

[SPARK-49166][SQL] Support OFFSET in correlated subquery

### What changes were proposed in this pull request?

1. change DecorrelateInnerSubquery limit handle logic to add handling of 
Offset operator under Limit in correlated subquery
2. add DecorrelateInnerSubquery offset handle logic to handle Offset 
operator in correlated subquery

### Why are the changes needed?

Offset operator in correlated subquery is not supported previously

### Does this PR introduce _any_ user-facing change?

yes, it allows users to write offset in correlated subquery

### How was this patch tested?

GeneratedSubquerySuite.scala

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47673 from averyqi-db/SPARK-49166.

Lead-authored-by: Avery Qi 
Co-authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../jdbc/querytest/GeneratedSubquerySuite.scala|   9 +-
 .../catalyst/optimizer/DecorrelateInnerQuery.scala |  78 -
 .../org/apache/spark/sql/internal/SQLConf.scala|   2 +-
 .../exists-subquery/exists-orderby-limit.sql.out   |  70 +++--
 .../subquery/in-subquery/in-limit.sql.out  | 329 +++--
 .../subquery/subquery-offset.sql.out   | 271 -
 .../inputs/subquery/in-subquery/in-limit.sql   |  41 +++
 .../sql-tests/inputs/subquery/subquery-offset.sql  |  23 ++
 .../exists-subquery/exists-orderby-limit.sql.out   |  51 ++--
 .../results/subquery/in-subquery/in-limit.sql.out  | 190 ++--
 .../results/subquery/subquery-offset.sql.out   | 152 +-
 11 files changed, 795 insertions(+), 421 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
index b526599482da..b6291639a24e 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
@@ -145,10 +145,6 @@ class GeneratedSubquerySuite extends 
DockerJDBCIntegrationSuite with QueryGenera
   None
 }
 
-// SPARK-46446: offset operator in correlated subquery is not supported
-// as it creates incorrect results for now.
-val requireNoOffsetInCorrelatedSubquery = correlationConditions.nonEmpty
-
 // For the Limit clause, consider whether the subquery needs to return 1 
row, or whether the
 // operator to be included is a Limit.
 val limitAndOffsetClause = if (requiresExactlyOneRowOutput) {
@@ -156,11 +152,10 @@ class GeneratedSubquerySuite extends 
DockerJDBCIntegrationSuite with QueryGenera
 } else {
   operatorInSubquery match {
 case lo: LimitAndOffset =>
-  val offsetValue = if (requireNoOffsetInCorrelatedSubquery) 0 else 
lo.offsetValue
-  if (offsetValue == 0 && lo.limitValue == 0) {
+  if (lo.offsetValue == 0 && lo.limitValue == 0) {
 None
   } else {
-Some(LimitAndOffset(lo.limitValue, offsetValue))
+Some(lo)
   }
 case _ => None
   }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
index 1ebf0c7b39a4..424f4b96271d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
@@ -655,19 +655,62 @@ object DecorrelateInnerQuery extends PredicateHelper {
 val newProject = Project(newProjectList ++ referencesToAdd, 
newChild)
 (newProject, joinCond, outerReferenceMap)
 
+  case Offset(offset, input) =>
+// OFFSET K is decorrelated by skipping top k rows per every 
domain value
+// via a row_number() window function, which is similar to limit 
decorrelation.
+// Limit and Offset situation are handled by limit branch as 
offset is the child
+// of limit in that case. This branch is for the case where 
there's no limit operator
+// above offset.
+   

Re: [DISCUSS] Deprecating SparkR

2024-08-13 Thread Wenchen Fan
+1

On Tue, Aug 13, 2024 at 10:50 PM L. C. Hsieh  wrote:

> +1
>
> On Tue, Aug 13, 2024 at 2:54 AM Dongjoon Hyun 
> wrote:
> >
> > +1
> >
> > Dongjoon
> >
> > On Mon, Aug 12, 2024 at 17:52 Holden Karau 
> wrote:
> >>
> >> +1
> >>
> >> Are the sparklyr folks on this list?
> >>
> >> Twitter: https://twitter.com/holdenkarau
> >> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >> Pronouns: she/her
> >>
> >>
> >> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li  wrote:
> >>>
> >>> +1
> >>>
> >>> Hyukjin Kwon  于2024年8月12日周一 16:18写道:
> 
>  +1
> 
>  On Tue, Aug 13, 2024 at 7:04 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> >
> > And just for the record, the stats that I screenshotted in that
> thread I linked to showed the following page views for each sub-section
> under `docs/latest/api/`:
> >
> > - python: 758K
> > - java: 66K
> > - sql: 39K
> > - scala: 35K
> > - r: <1K
> >
> > I don’t recall over what time period those stats were collected for,
> and there are certainly some factors of how the stats are gathered and how
> the various language API docs are accessed that impact those numbers. So
> it’s by no means a solid, objective measure. But I thought it was an
> interesting signal nonetheless.
> >
> >
> > On Aug 12, 2024, at 5:50 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> >
> > Not an R user myself, but +1.
> >
> > I first wondered about the future of SparkR after noticing how low
> the visit stats were for the R API docs as compared to Python and Scala. (I
> can’t seem to find those visit stats for the API docs anymore.)
> >
> >
> > On Aug 12, 2024, at 11:47 AM, Shivaram Venkataraman <
> shivaram.venkatara...@gmail.com> wrote:
> >
> > Hi
> >
> > About ten years ago, I created the original SparkR package as part
> of my research at UC Berkeley [SPARK-5654]. After my PhD I started as a
> professor at UW-Madison and my contributions to SparkR have been in the
> background given my availability. I continue to be involved in the
> community and teach a popular course at UW-Madison which uses Apache Spark
> for programming assignments.
> >
> > As the original contributor and author of a research paper on
> SparkR, I also continue to get private emails from users. A common question
> I get is whether one should use SparkR in Apache Spark or the sparklyr
> package (built on top of Apache Spark). You can also see this in
> StackOverflow questions and other blog posts online:
> https://www.google.com/search?q=sparkr+vs+sparklyr . While, I have
> encouraged users to choose the SparkR package as it is maintained by the
> Apache project, the more I looked into sparklyr, the more I was convinced
> that it is a better choice for R users that want to leverage the power of
> Spark:
> >
> > (1) sparklyr is developed by a community of developers who
> understand the R programming language deeply, and as a result is more
> idiomatic. In hindsight, sparklyr’s more idiomatic approach would have been
> a better choice than the Scala-like API we have in SparkR.
> >
> > (2) Contributions to SparkR have decreased slowly. Over the last two
> years, there have been 65 commits on the Spark R codebase (compared to
> ~2200 on the Spark Python code base). In contrast Sparklyr has over 300
> commits in the same period..
> >
> > (3) Previously, using and deploying sparklyr had been cumbersome as
> it needed careful alignment of versions between Apache Spark and sparklyr.
> However, the sparklyr community has implemented a new Spark Connect based
> architecture which eliminates this issue.
> >
> > (4) The sparklyr community has maintained their package on CRAN – it
> takes some effort to do this as the CRAN release process requires passing a
> number of tests. While SparkR was on CRAN initially, we could not maintain
> that given our release process and cadence. This makes sparklyr much more
> accessible to the R community.
> >
> > So it is with a bittersweet feeling that I’m writing this email to
> propose that we deprecate SparkR, and recommend sparklyr as the R language
> binding for Spark. This will reduce complexity of our own codebase, and
> more importantly reduce confusion for users. As the sparklyr package is
> distributed using the same permissive license as Apache Spark, there should
> be no downside for existing SparkR users in adopting it.
> >
> > My proposal is to mark SparkR as deprecated in the upcoming Spark 4
> release, and remove it from Apache Spark with the following major release,
> Spark 5.
> >
> > I’m looking forward to hearing your thoughts and feedback on this
> proposal and I’m happy to create the SPIP ticket for a vote on this
> proposal using this email thread as the justification.
> >
> > Thanks
> > S

[jira] [Assigned] (SPARK-49207) Fix SplitPart one-to-many case mapping (UTF8_LCASE)

2024-08-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49207:
---

Assignee: Uroš Bojanić

> Fix SplitPart one-to-many case mapping (UTF8_LCASE)
> ---
>
> Key: SPARK-49207
> URL: https://issues.apache.org/jira/browse/SPARK-49207
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Fix the following string expressions to handle one-to-many case mapping 
> properly:
>  * SplitPart
>  * StringSplitSQL
>  
> Examples of incorrect results (under {{UTF8_LCASE}} collation):
> {code:java}
> SplitPart("Ai\u0307B", "İ", 2) // returns: "\u0307B" (incorrect), instead of: 
> "B" (correct)
> SplitPart("AİB", "i\u0307", 1) // returns: "AİB", instead of: "A", "B" 
> (correct){code}
>  
> {code:java}
> StringSplitSQL("Ai\u0307B", "İ") // returns: ["A", "\u0307B"] (incorrect), 
> instead of: ["A", "B"] (correct)
> StringSplitSQL("AİB", "i\u0307") // returns: ["AİB"] (incorrect), instead of: 
> ["A", "B"] (correct){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49207) Fix SplitPart one-to-many case mapping (UTF8_LCASE)

2024-08-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49207.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47715
[https://github.com/apache/spark/pull/47715]

> Fix SplitPart one-to-many case mapping (UTF8_LCASE)
> ---
>
> Key: SPARK-49207
> URL: https://issues.apache.org/jira/browse/SPARK-49207
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Fix the following string expressions to handle one-to-many case mapping 
> properly:
>  * SplitPart
>  * StringSplitSQL
>  
> Examples of incorrect results (under {{UTF8_LCASE}} collation):
> {code:java}
> SplitPart("Ai\u0307B", "İ", 2) // returns: "\u0307B" (incorrect), instead of: 
> "B" (correct)
> SplitPart("AİB", "i\u0307", 1) // returns: "AİB", instead of: "A", "B" 
> (correct){code}
>  
> {code:java}
> StringSplitSQL("Ai\u0307B", "İ") // returns: ["A", "\u0307B"] (incorrect), 
> instead of: ["A", "B"] (correct)
> StringSplitSQL("AİB", "i\u0307") // returns: ["AİB"] (incorrect), instead of: 
> ["A", "B"] (correct){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49207][SQL] Fix one-to-many case mapping in SplitPart and StringSplitSQL

2024-08-13 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d82c69520c9b [SPARK-49207][SQL] Fix one-to-many case mapping in 
SplitPart and StringSplitSQL
d82c69520c9b is described below

commit d82c69520c9b4cc769ee929c874181256e4c96ea
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue Aug 13 23:45:22 2024 +0800

[SPARK-49207][SQL] Fix one-to-many case mapping in SplitPart and 
StringSplitSQL

### What changes were proposed in this pull request?
Fix the following string expressions to handle one-to-many case mapping 
properly:

- SplitPart
- StringSplitSQL

Examples of incorrect results (under `UTF8_LCASE` collation):

```
SplitPart("Ai\u0307B", "İ", 2) // returns: "\u0307B" (incorrect), instead 
of: "B" (correct)
SplitPart("AİB", "i\u0307", 1) // returns: "AİB", instead of: "A", "B" 
(correct)

StringSplitSQL("Ai\u0307B", "İ") // returns: ["A", "\u0307B"] (incorrect), 
instead of: ["A", "B"] (correct)
StringSplitSQL("AİB", "i\u0307") // returns: ["AİB"] (incorrect), instead 
of: ["A", "B"] (correct)
```

### Why are the changes needed?
Currently, some string expressions are giving wrong results when working 
with one-to-many case mapping.

### Does this PR introduce _any_ user-facing change?
Yes, this expression will now work properly with surrogate pairs: 
`split_part`.

### How was this patch tested?
New tests in `CollationSupportSuite`.

### Was this patch authored or co-authored using generative AI tooling?
Yes.

Closes #47715 from uros-db/fix-splitpart.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java|  40 --
 .../spark/unsafe/types/CollationSupportSuite.java  | 154 ++---
 2 files changed, 161 insertions(+), 33 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index ff608d7ea10e..9f26cc0bac21 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -36,7 +36,6 @@ import java.util.HashSet;
 import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
-import java.util.regex.Pattern;
 
 /**
  * Utility class for collation-aware UTF8String operations.
@@ -1208,16 +1207,35 @@ public class CollationAwareUTF8String {
 
   public static UTF8String[] lowercaseSplitSQL(final UTF8String string, final 
UTF8String delimiter,
   final int limit) {
-  if (delimiter.numBytes() == 0) return new UTF8String[] { string };
-  if (string.numBytes() == 0) return new UTF8String[] { 
UTF8String.EMPTY_UTF8 };
-  Pattern pattern = Pattern.compile(Pattern.quote(delimiter.toString()),
-CollationSupport.lowercaseRegexFlags);
-  String[] splits = pattern.split(string.toString(), limit);
-  UTF8String[] res = new UTF8String[splits.length];
-  for (int i = 0; i < res.length; i++) {
-res[i] = UTF8String.fromString(splits[i]);
+if (delimiter.numBytes() == 0) return new UTF8String[] { string };
+if (string.numBytes() == 0) return new UTF8String[] { 
UTF8String.EMPTY_UTF8 };
+
+List strings = new ArrayList<>();
+UTF8String lowercaseDelimiter = lowerCaseCodePoints(delimiter);
+int startIndex = 0, nextMatch = 0, nextMatchLength;
+while (nextMatch != MATCH_NOT_FOUND) {
+  if (limit > 0 && strings.size() == limit - 1) {
+break;
+  }
+  nextMatch = lowercaseFind(string, lowercaseDelimiter, startIndex);
+  if (nextMatch != MATCH_NOT_FOUND) {
+nextMatchLength = lowercaseMatchLengthFrom(string, lowercaseDelimiter, 
nextMatch);
+strings.add(string.substring(startIndex, nextMatch));
+startIndex = nextMatch + nextMatchLength;
   }
-  return res;
+}
+if (startIndex <= string.numChars()) {
+  strings.add(string.substring(startIndex, string.numChars()));
+}
+if (limit == 0) {
+  // Remove trailing empty strings
+  int i = strings.size() - 1;
+  while (i >= 0 && strings.get(i).numBytes() == 0) {
+strings.remove(i);
+i--;
+  }
+}
+return strings.toArray(new UTF8String[0]);
   }
 
   public static UTF8String[] icuSplit

(spark) branch master updated: [SPARK-49204][SQL] Fix surrogate pair handling in StringReplace

2024-08-13 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e77c8fbe5940 [SPARK-49204][SQL] Fix surrogate pair handling in 
StringReplace
e77c8fbe5940 is described below

commit e77c8fbe5940fe990c14806f75312d9471b13ee3
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue Aug 13 23:38:05 2024 +0800

[SPARK-49204][SQL] Fix surrogate pair handling in StringReplace

### What changes were proposed in this pull request?
Fix the following string expression to handle surrogate pairs properly:

- StringRepeat

The issue has to do with counting surrogate pairs, which are single Unicode 
code points (and single UTF-8 characters), but are represented using 2 
characters in UTF-16 (Java String).

Example of incorrect result (under `UNICODE` collation, but similar issues 
are noted for all ICU collations):

```
StringReplace("😄a", "a", "b") // returns: "😄ab" (incorrect), instead of: 
"😄b" (correct)
```

### Why are the changes needed?
Currently, some string expressions are giving wrong results when working 
with surrogate pairs.

### Does this PR introduce _any_ user-facing change?
Yes, this expression will now work properly with surrogate pairs: `repeat`.

### How was this patch tested?
New tests in `CollationSupportSuite`.

### Was this patch authored or co-authored using generative AI tooling?
Yes.

Closes #47710 from uros-db/surrogate-replace.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 100 +++--
 .../spark/unsafe/types/CollationSupportSuite.java  | 228 +
 2 files changed, 181 insertions(+), 147 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index f89090a839ef..ff608d7ea10e 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -300,114 +300,82 @@ public class CollationAwareUTF8String {
 return lowerCaseCodePoints(left).binaryCompare(lowerCaseCodePoints(right));
   }
 
-  /*
+  /**
* Performs string replacement for ICU collations by searching for instances 
of the search
-   * string in the `src` string, with respect to the specified collation, and 
then replacing
+   * string in the `target` string, with respect to the specified collation, 
and then replacing
* them with the replace string. The method returns a new UTF8String with 
all instances of the
* search string replaced using the replace string. Similar to 
UTF8String.findInSet behavior
-   * used for UTF8_BINARY, the method returns the `src` string if the `search` 
string is empty.
+   * used for UTF8_BINARY, the method returns the `target` string if the 
`search` string is empty.
*
-   * @param src the string to be searched in
+   * @param target the string to be searched in
* @param search the string to be searched for
* @param replace the string to be used as replacement
* @param collationId the collation ID to use for string search
* @return the position of the first occurrence of `match` in `set`
*/
-  public static UTF8String replace(final UTF8String src, final UTF8String 
search,
+  public static UTF8String replace(final UTF8String target, final UTF8String 
search,
   final UTF8String replace, final int collationId) {
 // This collation aware implementation is based on existing implementation 
on UTF8String
-if (src.numBytes() == 0 || search.numBytes() == 0) {
-  return src;
-}
-
-StringSearch stringSearch = CollationFactory.getStringSearch(src, search, 
collationId);
-
-// Find the first occurrence of the search string.
-int end = stringSearch.next();
-if (end == StringSearch.DONE) {
-  // Search string was not found, so string is unchanged.
-  return src;
-}
-
-// Initialize byte positions
-int c = 0;
-int byteStart = 0; // position in byte
-int byteEnd = 0; // position in byte
-while (byteEnd < src.numBytes() && c < end) {
-  byteEnd += UTF8String.numBytesForFirstByte(src.getByte(byteEnd));
-  c += 1;
+if (target.numBytes() == 0 || search.numBytes() == 0) {
+  return target;
 }
 
-// At least one match was found. Estimate space needed for result.
-// The 16x multiplier here is chosen to match commons-lang3's 
implementation.
-i

(spark) branch master updated: [SPARK-49204][SQL] Fix surrogate pair handling in StringInstr and StringLocate

2024-08-13 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e38e13506b09 [SPARK-49204][SQL] Fix surrogate pair handling in 
StringInstr and StringLocate
e38e13506b09 is described below

commit e38e13506b09b9739c01d9ebfd673ad47878b1d0
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue Aug 13 23:35:44 2024 +0800

[SPARK-49204][SQL] Fix surrogate pair handling in StringInstr and 
StringLocate

### What changes were proposed in this pull request?
Fix the following string expressions to handle surrogate pairs properly:

- StringInstr
- StringLocate

The issue has to do with counting surrogate pairs, which are single Unicode 
code points (and single UTF-8 characters), but are represented using 2 
characters in UTF-16 (Java String).

Example of incorrect results (under `UNICODE` collation, but similar issues 
are noted for all ICU collations):

```
StringInstr("😄a", "a") // returns: 3 (incorrect), instead of: 2 (correct)
StringLocate("a", "😄a") // returns: 3 (incorrect), instead of: 2 (correct)
```

### Why are the changes needed?
Currently, some string expressions are giving wrong results when working 
with surrogate pairs.

### Does this PR introduce _any_ user-facing change?
Yes, these expressions will now work properly with surrogate pairs: 
`instr`, `locate`/`position`.

### How was this patch tested?
New tests in `CollationSupportSuite`.

### Was this patch authored or co-authored using generative AI tooling?
Yes.

Closes #47711 from uros-db/surrogate-indexof.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java|  25 +-
 .../spark/unsafe/types/CollationSupportSuite.java  | 504 +++--
 2 files changed, 386 insertions(+), 143 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index e9b9f15cae2e..f89090a839ef 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -701,11 +701,26 @@ public class CollationAwareUTF8String {
   final int start, final int collationId) {
 if (pattern.numBytes() == 0) return target.indexOfEmpty(start);
 if (target.numBytes() == 0) return MATCH_NOT_FOUND;
-
-StringSearch stringSearch = CollationFactory.getStringSearch(target, 
pattern, collationId);
-stringSearch.setIndex(start);
-
-return stringSearch.next();
+// Initialize the string search with respect to the specified ICU 
collation.
+String targetStr = target.toValidString();
+String patternStr = pattern.toValidString();
+StringSearch stringSearch =
+  CollationFactory.getStringSearch(targetStr, patternStr, collationId);
+stringSearch.setOverlapping(true);
+// Start the search from `start`-th code point (NOT necessarily from the 
`start`-th character).
+int startIndex = targetStr.offsetByCodePoints(0, start);
+stringSearch.setIndex(startIndex);
+// Perform the search and return the next result, starting from the 
specified position.
+int searchIndex = stringSearch.next();
+if (searchIndex == StringSearch.DONE) {
+  return MATCH_NOT_FOUND;
+}
+// Convert the search index from character count to code point count.
+int indexOf = targetStr.codePointCount(0, searchIndex);
+if (indexOf < start) {
+  return MATCH_NOT_FOUND;
+}
+return indexOf;
   }
 
   private static int findIndex(final StringSearch stringSearch, int count) {
diff --git 
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
 
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
index 6a3e17b85da8..75b71f86a5bf 100644
--- 
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
+++ 
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
@@ -858,8 +858,12 @@ public class CollationSupportSuite {
   "Ss Fi Ffi Ff St Σημερινος Ασημενιος İota");
   }
 
-  private void assertStringInstr(String string, String substring, String 
collationName,
-  Integer expected) throws SparkException {
+  /**
+   * Verify the behaviour of the `StringInstr` collation support class.
+   */
+
+  private void assertStringInstr(String string, String substring,
+  String collationName, int expected) th

(spark) branch master updated: [SPARK-49204][SQL] Fix surrogate pair handling in SubstringIndex

2024-08-13 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7feea93ce4a3 [SPARK-49204][SQL] Fix surrogate pair handling in 
SubstringIndex
7feea93ce4a3 is described below

commit 7feea93ce4a3f2da6fad5f502e3b4db3a0da8fb2
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue Aug 13 23:33:46 2024 +0800

[SPARK-49204][SQL] Fix surrogate pair handling in SubstringIndex

### What changes were proposed in this pull request?
Fix the following string expression to handle surrogate pairs properly:

- SubstringIndex

The issue has to do with counting surrogate pairs, which are single Unicode 
code points (and single UTF-8 characters), but are represented using 2 
characters in UTF-16 (Java String).

Example of incorrect result (under `UNICODE` collation, but similar issues 
are noted for all ICU collations):

```
SubstringIndex("😄a", "a") // returns: "😄a" (incorrect), instead of: "😄" 
(correct)
```

### Why are the changes needed?
Currently, some string expressions are giving wrong results when working 
with surrogate pairs.

### Does this PR introduce _any_ user-facing change?
Yes, this expression will now work properly with surrogate pairs: 
`substr_index`.

### How was this patch tested?
New tests in `CollationSupportSuite`.

### Was this patch authored or co-authored using generative AI tooling?
Yes.

Closes #47712 from uros-db/surrogate-substr.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 107 +---
 .../spark/unsafe/types/CollationSupportSuite.java  | 192 +++--
 2 files changed, 219 insertions(+), 80 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index 91a7cd912ca2..e9b9f15cae2e 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -26,8 +26,6 @@ import com.ibm.icu.util.ULocale;
 import org.apache.spark.unsafe.UTF8StringBuilder;
 import org.apache.spark.unsafe.types.UTF8String;
 
-import static org.apache.spark.unsafe.Platform.BYTE_ARRAY_OFFSET;
-import static org.apache.spark.unsafe.Platform.copyMemory;
 import static org.apache.spark.unsafe.types.UTF8String.CodePointIteratorType;
 
 import java.text.CharacterIterator;
@@ -710,16 +708,34 @@ public class CollationAwareUTF8String {
 return stringSearch.next();
   }
 
-  private static int find(UTF8String target, UTF8String pattern, int start,
-  int collationId) {
-assert (pattern.numBytes() > 0);
-
-StringSearch stringSearch = CollationFactory.getStringSearch(target, 
pattern, collationId);
-// Set search start position (start from character at start position)
-stringSearch.setIndex(target.bytePosToChar(start));
+  private static int findIndex(final StringSearch stringSearch, int count) {
+assert(count >= 0);
+int index = 0;
+while (count > 0) {
+  int nextIndex = stringSearch.next();
+  if (nextIndex == StringSearch.DONE) {
+return MATCH_NOT_FOUND;
+  } else if (nextIndex == index && index != 0) {
+stringSearch.setIndex(stringSearch.getIndex() + 
stringSearch.getMatchLength());
+  } else {
+count--;
+index = nextIndex;
+  }
+}
+return index;
+  }
 
-// Return either the byte position or -1 if not found
-return target.charPosToByte(stringSearch.next());
+  private static int findIndexReverse(final StringSearch stringSearch, int 
count) {
+assert(count >= 0);
+int index = 0;
+while (count > 0) {
+  index = stringSearch.previous();
+  if (index == StringSearch.DONE) {
+return MATCH_NOT_FOUND;
+  }
+  count--;
+}
+return index + stringSearch.getMatchLength();
   }
 
   public static UTF8String subStringIndex(final UTF8String string, final 
UTF8String delimiter,
@@ -727,63 +743,30 @@ public class CollationAwareUTF8String {
 if (delimiter.numBytes() == 0 || count == 0 || string.numBytes() == 0) {
   return UTF8String.EMPTY_UTF8;
 }
+String str = string.toValidString();
+String delim = delimiter.toValidString();
+StringSearch stringSearch = CollationFactory.getStringSearch(str, delim, 
collationId);
+stringSearch.setOverlapping(true);
 if (count > 0) {
-  int idx = -1;
-  while (count > 0) {
- 

[jira] [Assigned] (SPARK-49204) Handle surrogate pairs properly

2024-08-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49204:
---

Assignee: Uroš Bojanić

> Handle surrogate pairs properly
> ---
>
> Key: SPARK-49204
> URL: https://issues.apache.org/jira/browse/SPARK-49204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Fix the following string expressions to handle one-to-many case mapping 
> properly:
>  * StringReplace
>  * StringInstr
>  * StringLocate
>  * SubstringIndex
>  * StringTrim
>  * StringTrimLeft
>  * StringTrimRight
>  
> Examples of incorrect results (under {{ICU}} collations):
> {code:java}
> StringReplace("😄a", "a", "b") // returns: "😄ab" (incorrect), instead of: "😄b" 
> (correct){code}
>  
> {code:java}
> StringInstr("😄a", "a") // returns: 3 (incorrect), instead of: 2 
> (correct){code}
>  
> {code:java}
> StringLocate("a", "😄a") // returns: 3 (incorrect), instead of: 2 
> (correct){code}
>  
> {code:java}
> SubstringIndex("😄a", "a") // returns: "😄a" (incorrect), instead of: "😄" 
> (correct){code}
>  
> {code:java}
> StringTrim("😄", "😄") // returns: "😄" (incorrect), instead of: "" 
> (correct){code}
>  
> {code:java}
> StringTrimLeft("😄", "😄") // returns: "😄" (incorrect), instead of: "" 
> (correct){code}
>  
> {code:java}
> StringTrimRight("😄", "😄") // returns: "😄" (incorrect), instead of: "" 
> (correct){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49204) Handle surrogate pairs properly

2024-08-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49204.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47713
[https://github.com/apache/spark/pull/47713]

> Handle surrogate pairs properly
> ---
>
> Key: SPARK-49204
> URL: https://issues.apache.org/jira/browse/SPARK-49204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Fix the following string expressions to handle one-to-many case mapping 
> properly:
>  * StringReplace
>  * StringInstr
>  * StringLocate
>  * SubstringIndex
>  * StringTrim
>  * StringTrimLeft
>  * StringTrimRight
>  
> Examples of incorrect results (under {{ICU}} collations):
> {code:java}
> StringReplace("😄a", "a", "b") // returns: "😄ab" (incorrect), instead of: "😄b" 
> (correct){code}
>  
> {code:java}
> StringInstr("😄a", "a") // returns: 3 (incorrect), instead of: 2 
> (correct){code}
>  
> {code:java}
> StringLocate("a", "😄a") // returns: 3 (incorrect), instead of: 2 
> (correct){code}
>  
> {code:java}
> SubstringIndex("😄a", "a") // returns: "😄a" (incorrect), instead of: "😄" 
> (correct){code}
>  
> {code:java}
> StringTrim("😄", "😄") // returns: "😄" (incorrect), instead of: "" 
> (correct){code}
>  
> {code:java}
> StringTrimLeft("😄", "😄") // returns: "😄" (incorrect), instead of: "" 
> (correct){code}
>  
> {code:java}
> StringTrimRight("😄", "😄") // returns: "😄" (incorrect), instead of: "" 
> (correct){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49204][SQL] Fix surrogate pair handling in StringTrim

2024-08-13 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 175deba21740 [SPARK-49204][SQL] Fix surrogate pair handling in 
StringTrim
175deba21740 is described below

commit 175deba21740b30f887f31f8c645ea28ffe02dfa
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue Aug 13 23:30:37 2024 +0800

[SPARK-49204][SQL] Fix surrogate pair handling in StringTrim

### What changes were proposed in this pull request?
Fix the following string expression to handle surrogate pairs properly:

- StringTrim
- StringTrimLeft
- StringTrimRight

The issue has to do with counting surrogate pairs, which are single Unicode 
code points (and single UTF-8 characters), but are represented using 2 
characters in UTF-16 (Java String).

Example of incorrect results (under `UNICODE` collation, but similar issues 
are noted for all ICU collations):

```
StringTrim("😄", "😄") // returns: "😄" (incorrect), instead of: "" (correct)
StringTrimLeft("😄", "😄") // returns: "😄" (incorrect), instead of: "" 
(correct)
StringTrimRight("😄", "😄") // returns: "😄" (incorrect), instead of: "" 
(correct)
```

### Why are the changes needed?
Currently, some string expressions are giving wrong results when working 
with surrogate pairs.

### Does this PR introduce _any_ user-facing change?
Yes, these expressions will now work properly with surrogate pairs: 
`trim`/`ltrim`/`rtrim`.

### How was this patch tested?
New tests in `CollationSupportSuite`.

### Was this patch authored or co-authored using generative AI tooling?
Yes.

Closes #47713 from uros-db/surrogate-trim.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java|4 +-
 .../spark/unsafe/types/CollationSupportSuite.java  | 1005 +++-
 2 files changed, 583 insertions(+), 426 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index b57f172428ac..91a7cd912ca2 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -1080,7 +1080,7 @@ public class CollationAwareUTF8String {
   CodePointIteratorType.CODE_POINT_ITERATOR_MAKE_VALID);
 while (trimIter.hasNext()) {
   int codePoint = trimIter.next();
-  trimChars.putIfAbsent(codePoint, String.valueOf((char) codePoint));
+  trimChars.putIfAbsent(codePoint, new 
String(Character.toChars(codePoint)));
 }
 
 // Iterate over srcString from the left and find the first character that 
is not in trimChars.
@@ -1190,7 +1190,7 @@ public class CollationAwareUTF8String {
   CodePointIteratorType.CODE_POINT_ITERATOR_MAKE_VALID);
 while (trimIter.hasNext()) {
   int codePoint = trimIter.next();
-  trimChars.putIfAbsent(codePoint, String.valueOf((char) codePoint));
+  trimChars.putIfAbsent(codePoint, new 
String(Character.toChars(codePoint)));
 }
 
 // Iterate over srcString from the left and find the first character that 
is not in trimChars.
diff --git 
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
 
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
index 4301bf56b6d5..63e32b0500af 100644
--- 
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
+++ 
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
@@ -1446,28 +1446,28 @@ public class CollationSupportSuite {
 
   }
 
-  private void assertStringTrim(
-  String collation,
-  String sourceString,
-  String trimString,
-  String expectedResultString) throws SparkException {
+  /**
+   * Verify the behaviour of the `StringTrim` collation support class.
+   */
+
+  private void assertStringTrim(String collationName, String sourceString, 
String trimString,
+  String expected) throws SparkException {
 // Prepare the input and expected result.
-int collationId = CollationFactory.collationNameToId(collation);
+int collationId = CollationFactory.collationNameToId(collationName);
 UTF8String src = UTF8String.fromString(sourceString);
 UTF8String trim = UTF8String.fromString(trimString);
-UTF8String resultTrimLeftRight, resultTrimRightLeft;
-String res

Re: Welcome new Apache Spark committers

2024-08-13 Thread Wenchen Fan
Congratulations!

On Tue, Aug 13, 2024 at 6:06 PM Peter Toth  wrote:

> Congratulations!
>
> Gengliang Wang  ezt írta (időpont: 2024. aug. 13., K,
> 6:15):
>
>> Congratulations, everyone!
>>
>> On Mon, Aug 12, 2024 at 7:10 PM Denny Lee  wrote:
>>
>>> Congrats Allison, Martin, and Haejoon!
>>>
>>> On Tue, Aug 13, 2024 at 9:59 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Congrats everyone!

 On Tue, Aug 13, 2024 at 9:21 AM Xiao Li  wrote:

> Congratulations!
>
> Hyukjin Kwon  于2024年8月12日周一 17:19写道:
>
>> Hi all,
>>
>> The Spark PMC recently voted to add three new committers. Please join
>> me in welcoming them to their new role!
>>
>> - Martin Grund
>> - Haejoon Lee
>> - Allison Wang
>>
>> They consistently made contributions to the project and clearly
>> showed their expertise. We are very excited to have them join as 
>> committers!
>>
>>


Re: Welcoming a new PMC member

2024-08-13 Thread Wenchen Fan
Congratulations!

On Tue, Aug 13, 2024 at 4:13 PM Ruifeng Zheng  wrote:

> Congratulations!
>
> On Tue, Aug 13, 2024 at 3:59 PM Martin Grund 
> wrote:
>
>> Congratulations!
>>
>> On Tue, Aug 13, 2024 at 9:37 AM Peter Toth  wrote:
>>
>>> Congratulations!
>>>
>>> Mridul Muralidharan  ezt írta (időpont: 2024. aug.
>>> 13., K, 8:46):
>>>

 Congratulations Kent !

 Regards,
 Mridul

 On Mon, Aug 12, 2024 at 8:46 PM Dongjoon Hyun 
 wrote:

> Congratulations, Kent.
>
> Dongjoon.
>
> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li  wrote:
>
>> Congratulations !
>>
>> Hyukjin Kwon  于2024年8月12日周一 17:20写道:
>>
>>> Hi all,
>>>
>>> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join
>>> me in welcoming him to his new role!
>>>
>>>


[jira] [Resolved] (SPARK-49152) V2SessionCatalog should use V2Command

2024-08-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49152.
-
Resolution: Fixed

> V2SessionCatalog should use V2Command
> -
>
> Key: SPARK-49152
> URL: https://issues.apache.org/jira/browse/SPARK-49152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49152) V2SessionCatalog should use V2Command

2024-08-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-49152:

Fix Version/s: 4.0.0
   3.5.3

> V2SessionCatalog should use V2Command
> -
>
> Key: SPARK-49152
> URL: https://issues.apache.org/jira/browse/SPARK-49152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-49152][SQL] V2SessionCatalog should use V2Command

2024-08-12 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new d8242194a8c7 [SPARK-49152][SQL] V2SessionCatalog should use V2Command
d8242194a8c7 is described below

commit d8242194a8c799e75ebc962f113f6f6731987c7c
Author: Rui Wang 
AuthorDate: Tue Aug 13 11:06:41 2024 +0800

[SPARK-49152][SQL] V2SessionCatalog should use V2Command

### What changes were proposed in this pull request?

V2SessionCatalog should use V2Command when possible.

### Why are the changes needed?

This is because the session catalog can be overwritten thus the 
overwritten's catalog should use v2 commands, otherwise the V1Command will 
still call hive metastore or the built-in session catalog.
### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

NO

Closes #47724 from amaliujia/branch-3.5.

Lead-authored-by: Rui Wang 
Co-authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/analysis/ResolveSessionCatalog.scala  | 69 --
 .../datasources/v2/DataSourceV2Strategy.scala  |  5 +-
 .../DataSourceV2DataFrameSessionCatalogSuite.scala |  2 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 21 +++
 .../sql/connector/TestV2SessionCatalogBase.scala   | 16 +++--
 .../command/v2/ShowCreateTableSuite.scala  |  2 +-
 .../apache/spark/sql/internal/CatalogSuite.scala   |  2 +-
 7 files changed, 76 insertions(+), 41 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
index d8e19c994c59..8c679c4d57fc 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
@@ -27,7 +27,7 @@ import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.catalyst.util.{quoteIfNeeded, toPrettySQL, 
ResolveDefaultColumns => DefaultCols}
 import org.apache.spark.sql.catalyst.util.ResolveDefaultColumns._
-import org.apache.spark.sql.connector.catalog.{CatalogManager, CatalogV2Util, 
LookupCatalog, SupportsNamespaces, V1Table}
+import org.apache.spark.sql.connector.catalog.{CatalogManager, CatalogPlugin, 
CatalogV2Util, LookupCatalog, SupportsNamespaces, V1Table}
 import org.apache.spark.sql.connector.expressions.Transform
 import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
 import org.apache.spark.sql.execution.command._
@@ -66,7 +66,7 @@ class ResolveSessionCatalog(val catalogManager: 
CatalogManager)
   throw QueryCompilationErrors.unsupportedTableOperationError(ident, 
"REPLACE COLUMNS")
 
 case a @ AlterColumn(ResolvedTable(catalog, ident, table: V1Table, _), _, 
_, _, _, _, _)
-if isSessionCatalog(catalog) =>
+if supportsV1Command(catalog) =>
   if (a.column.name.length > 1) {
 throw QueryCompilationErrors.unsupportedTableOperationError(
   catalog, ident, "ALTER COLUMN with qualified column")
@@ -117,13 +117,13 @@ class ResolveSessionCatalog(val catalogManager: 
CatalogManager)
 case UnsetViewProperties(ResolvedViewIdentifier(ident), keys, ifExists) =>
   AlterTableUnsetPropertiesCommand(ident, keys, ifExists, isView = true)
 
-case DescribeNamespace(DatabaseInSessionCatalog(db), extended, output) if 
conf.useV1Command =>
+case DescribeNamespace(ResolvedV1Database(db), extended, output) if 
conf.useV1Command =>
   DescribeDatabaseCommand(db, extended, output)
 
-case SetNamespaceProperties(DatabaseInSessionCatalog(db), properties) if 
conf.useV1Command =>
+case SetNamespaceProperties(ResolvedV1Database(db), properties) if 
conf.useV1Command =>
   AlterDatabasePropertiesCommand(db, properties)
 
-case SetNamespaceLocation(DatabaseInSessionCatalog(db), location) if 
conf.useV1Command =>
+case SetNamespaceLocation(ResolvedV1Database(db), location) if 
conf.useV1Command =>
   if (StringUtils.isEmpty(location)) {
 throw QueryExecutionErrors.invalidEmptyLocationError(location)
   }
@@ -218,7 +218,7 @@ class ResolveSessionCatalog(val catalogManager: 
CatalogManager)
 case DropTable(ResolvedIdentifier(FakeSystemCatalog, ident), _, _) =>
   DropTempViewCommand(ident)
 
-case DropView(ResolvedV1Identifier(ident), ifExists) =>
+case DropView(ResolvedIdentifierInSessionCatalog(ident), ifExists) =>
   DropTableCommand(id

Re: [VOTE] Archive Spark Documentations in Apache Archives

2024-08-12 Thread Wenchen Fan
+1

On Tue, Aug 13, 2024 at 1:57 AM Holden Karau  wrote:

> +1
>
>
> On Mon, Aug 12, 2024 at 10:17 AM Dongjoon Hyun 
> wrote:
>
>> +1 for the proposals
>> - enhancing the release process to put the docs to `release` directory in
>> order to archive.
>> - uploading old releases via SVN manually to archive.
>>
>> Since deletion is not a scope of this vote, I don't see any risk here.
>> Thank you, Kent.
>>
>> Dongjoon.
>>
>> On 2024/08/12 09:07:47 Kent Yao wrote:
>> > Archive Spark Documentations in Apache Archives
>> >
>> > Hi dev,
>> >
>> > To address the issue of the Spark website repository size
>> > reaching the storage limit for GitHub-hosted runners [1], I suggest
>> > enhancing step [2] in our release process by relocating the
>> > documentation releases from the dev[3] directory to the release
>> > directory[4]. Then it would captured by the Apache Archives
>> > service[5] to create permanent links, which would be alternative
>> > endpoints for our documentation, like
>> >
>> >
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/_site/index.html
>> > for
>> > https://spark.apache.org/docs/3.5.2/index.html
>> >
>> > Note that the previous example still uses the staging repository,
>> > which will become
>> > https://archive.apache.org/dist/spark/docs/3.5.2/index.html.
>> >
>> > For older releases hosted on the Spark website [6], we also need to
>> > upload them via SVN manually.
>> >
>> > After that, when we reach the threshold again, we can delete some of
>> > the old ones on page [6], and update their links on page [7] or use
>> > redirection.
>> >
>> > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-49209
>> >
>> > Please vote on the idea of  Archive Spark Documentations in
>> > Apache Archives for the next 72 hours:
>> >
>> > [ ] +1: Accept the proposal
>> > [ ] +0
>> > [ ] -1: I don’t think this is a good idea because …
>> >
>> > Bests,
>> > Kent Yao
>> >
>> > [1] https://lists.apache.org/thread/o0w4gqoks23xztdmjjj26jkp1yyg2bvq
>> > [2]
>> https://spark.apache.org/release-process.html#upload-to-apache-release-directory
>> > [3] https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/
>> > [4] https://dist.apache.org/repos/dist/release/spark/docs/3.5.2
>> > [5] https://archive.apache.org/dist/spark/
>> > [6] https://github.com/apache/spark-website/tree/asf-site/site/docs
>> > [7] https://spark.apache.org/documentation.html
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>


(spark) branch master updated: [SPARK-47741][SQL][TESTS][FOLLOWUP] Make `EXEC IMMEDIATE STACK OVERFLOW` in `ExecuteImmediateEndToEndSuite` check the intercepted exception

2024-08-12 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dba12056 [SPARK-47741][SQL][TESTS][FOLLOWUP] Make `EXEC IMMEDIATE 
STACK OVERFLOW` in `ExecuteImmediateEndToEndSuite` check the intercepted 
exception
dba12056 is described below

commit dba1205601044d97c1e662afabe9ab34c5cb
Author: yangjie01 
AuthorDate: Mon Aug 12 22:53:21 2024 +0800

[SPARK-47741][SQL][TESTS][FOLLOWUP] Make `EXEC IMMEDIATE STACK OVERFLOW` in 
`ExecuteImmediateEndToEndSuite` check the intercepted exception

### What changes were proposed in this pull request?
This pr change the test case `EXEC IMMEDIATE STACK OVERFLOW` in 
`ExecuteImmediateEndToEndSuite` to check existing exception.

### Why are the changes needed?
Before this pr:


https://github.com/apache/spark/blob/e3ba74b65b9d534655fbcf40bb2c00b7f5c69418/sql/core/src/test/scala/org/apache/spark/sql/execution/ExecuteImmediateEndToEndSuite.scala#L44-L60

The test case intercepts an exception on lines 48 to 50, but the 
`exception` used in `checkError` is not related to the previously intercepted 
exception. Moreover, although the test is looped twice, the two `checkError` 
calls are actually checking the same SQL scenario, which appears to be a coding 
error.

### Does this PR introduce _any_ user-facing change?
No, just for test

### How was this patch tested?
- Pass GitHub Action
- locally test

```
build/mvn clean install -pl sql/core -am -Dtest=none 
-DwildcardSuites=org.apache.spark.sql.execution.ExecuteImmediateEndToEndSuite
```

```
ExecuteImmediateEndToEndSuite:
19:21:45.191 WARN org.apache.spark.util.Utils: Your hostname, 
MacBook-Pro.local, resolves to a loopback address: 127.0.0.1; using 
172.22.200.238 instead (on interface en0)

19:21:45.193 WARN org.apache.spark.util.Utils: Set SPARK_LOCAL_IP if you 
need to bind to another address

19:21:45.330 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable

- SPARK-47033: EXECUTE IMMEDIATE USING does not recognize session variable 
names
- EXEC IMMEDIATE STACK OVERFLOW
19:21:47.722 WARN 
org.apache.spark.sql.execution.ExecuteImmediateEndToEndSuite:

= POSSIBLE THREAD LEAK IN SUITE 
o.a.s.sql.execution.ExecuteImmediateEndToEndSuite, threads: rpc-boss-3-1 
(daemon=true), shuffle-boss-6-1 (daemon=true) =

Run completed in 4 seconds, 765 milliseconds.
Total number of tests run: 2
Suites: completed 2, aborted 0
Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47718 from LuciferYang/SPARK-47741-FOLLOWUP.

Authored-by: yangjie01 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/execution/ExecuteImmediateEndToEndSuite.scala  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/ExecuteImmediateEndToEndSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/ExecuteImmediateEndToEndSuite.scala
index 6b0f0b5582dc..cae22eda32f8 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/ExecuteImmediateEndToEndSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/ExecuteImmediateEndToEndSuite.scala
@@ -50,7 +50,7 @@ class ExecuteImmediateEndToEndSuite extends QueryTest with 
SharedSparkSession {
 }
 
 checkError(
-  exception = intercept[ParseException](sql(query).collect()),
+  exception = e,
   errorClass = "FAILED_TO_PARSE_TOO_COMPLEX",
   parameters = Map(),
   context = ExpectedContext(


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[jira] [Assigned] (SPARK-46632) EquivalentExpressions throw IllegalStateException

2024-08-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46632:
---

Assignee: Mingliang Zhu

> EquivalentExpressions throw IllegalStateException
> -
>
> Key: SPARK-46632
> URL: https://issues.apache.org/jira/browse/SPARK-46632
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: zhangzhenhao
>Assignee: Mingliang Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>
> EquivalentExpressions throw IllegalStateException with some IF expresssion
> ```scala
> import org.apache.spark.sql.catalyst.dsl.expressions.DslExpression
> import org.apache.spark.sql.catalyst.expressions.\{EquivalentExpressions, If, 
> Literal}
> import org.apache.spark.sql.functions.col
> val one = Literal(1.0)
> val y = col("y").expr
> val e1 = If(
>   Literal(true),
>   y * one * one + one * one * y,
>   y * one * one + one * one * y
> )
> (new EquivalentExpressions).addExprTree(e1)
> ```
>  
> result is 
> ```
> java.lang.IllegalStateException: Cannot update expression: (1.0 * 1.0) in 
> map: Map(ExpressionEquals(('y * 1.0)) -> ExpressionStats(('y * 1.0))) with 
> use count: -1
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprInMap(EquivalentExpressions.scala:85)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:198)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1(EquivalentExpressions.scala:200)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1$adapted(EquivalentExpressions.scala:200)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:200)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1(EquivalentExpressions.scala:200)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1$adapted(EquivalentExpressions.scala:200)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:200)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateCommonExprs(EquivalentExpressions.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$3(EquivalentExpressions.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$3$adapted(EquivalentExpressions.scala:201)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:188)
>   ... 49 elided
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46632) EquivalentExpressions throw IllegalStateException

2024-08-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46632.
-
Fix Version/s: 4.0.0
   3.5.3
   Resolution: Fixed

Issue resolved by pull request 46135
[https://github.com/apache/spark/pull/46135]

> EquivalentExpressions throw IllegalStateException
> -
>
> Key: SPARK-46632
> URL: https://issues.apache.org/jira/browse/SPARK-46632
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: zhangzhenhao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>
> EquivalentExpressions throw IllegalStateException with some IF expresssion
> ```scala
> import org.apache.spark.sql.catalyst.dsl.expressions.DslExpression
> import org.apache.spark.sql.catalyst.expressions.\{EquivalentExpressions, If, 
> Literal}
> import org.apache.spark.sql.functions.col
> val one = Literal(1.0)
> val y = col("y").expr
> val e1 = If(
>   Literal(true),
>   y * one * one + one * one * y,
>   y * one * one + one * one * y
> )
> (new EquivalentExpressions).addExprTree(e1)
> ```
>  
> result is 
> ```
> java.lang.IllegalStateException: Cannot update expression: (1.0 * 1.0) in 
> map: Map(ExpressionEquals(('y * 1.0)) -> ExpressionStats(('y * 1.0))) with 
> use count: -1
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprInMap(EquivalentExpressions.scala:85)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:198)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1(EquivalentExpressions.scala:200)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1$adapted(EquivalentExpressions.scala:200)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:200)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1(EquivalentExpressions.scala:200)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1$adapted(EquivalentExpressions.scala:200)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:200)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateCommonExprs(EquivalentExpressions.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$3(EquivalentExpressions.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$3$adapted(EquivalentExpressions.scala:201)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:188)
>   ... 49 elided
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-46632][SQL] Fix subexpression elimination when equivalent ternary expressions have different children

2024-08-12 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new deac7807efb4 [SPARK-46632][SQL] Fix subexpression elimination when 
equivalent ternary expressions have different children
deac7807efb4 is described below

commit deac7807efb486dd8f3c8bec39b90cbecb6ae767
Author: zml1206 
AuthorDate: Mon Aug 12 16:30:13 2024 +0800

[SPARK-46632][SQL] Fix subexpression elimination when equivalent ternary 
expressions have different children

Remove unexpected exception thrown in 
`EquivalentExpressions.updateExprInMap()`. Equivalent expressions may contain 
different children, it should happen expression not in map and `useCount` is -1.
For example, before this PR will throw IllegalStateException
```
Seq((1, 2, 3), (2, 3, 4)).toDF("a", "b", "c")
  .selectExpr("case when a + b + c>3 then 1 when c + a + b>0 then 2 
else 0 end as d").show()
```

Bug fix.

No.

New unit test, before this PR will throw IllegalStateException: *** with 
use count: -1

No.

Closes #46135 from zml1206/SPARK-46632.

Authored-by: zml1206 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 2fb8dffe40ffd613b4cf7f59843160baba0d078c)
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/expressions/EquivalentExpressions.scala |  4 
 .../catalyst/expressions/SubexpressionEliminationSuite.scala | 12 
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
index 1a84859cc3a1..5d8d428e27d6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
@@ -79,10 +79,6 @@ class EquivalentExpressions(
 case _ =>
   if (useCount > 0) {
 map.put(wrapper, ExpressionStats(expr)(useCount))
-  } else {
-// Should not happen
-throw new IllegalStateException(
-  s"Cannot update expression: $expr in map: $map with use count: 
$useCount")
   }
   false
   }
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
index f369635a3267..e9faeba2411c 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
@@ -494,6 +494,18 @@ class SubexpressionEliminationSuite extends SparkFunSuite 
with ExpressionEvalHel
 checkShortcut(Or(equal, Literal(true)), 1)
 checkShortcut(Not(And(equal, Literal(false))), 1)
   }
+
+  test("Equivalent ternary expressions have different children") {
+val add1 = Add(Add(Literal(1), Literal(2)), Literal(3))
+val add2 = Add(Add(Literal(3), Literal(1)), Literal(2))
+val conditions1 = (GreaterThan(add1, Literal(3)), Literal(1)) ::
+  (GreaterThan(add2, Literal(0)), Literal(2)) :: Nil
+
+val caseWhenExpr1 = CaseWhen(conditions1, Literal(0))
+val equivalence1 = new EquivalentExpressions
+equivalence1.addExprTree(caseWhenExpr1)
+assert(equivalence1.getCommonSubexpressions.size == 1)
+  }
 }
 
 case class CodegenFallbackExpression(child: Expression)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (f33aa0abbcda -> 2fb8dffe40ff)

2024-08-12 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from f33aa0abbcda [SPARK-48778][SQL][TESTS] Improve collation support 
testing - add unit tests for string expressions
 add 2fb8dffe40ff [SPARK-46632][SQL] Fix subexpression elimination when 
equivalent ternary expressions have different children

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/expressions/EquivalentExpressions.scala |  4 
 .../catalyst/expressions/SubexpressionEliminationSuite.scala | 12 
 2 files changed, 12 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.4 updated: [SPARK-48204][INFRA][FOLLOW] fix release scripts for the "finalize" step

2024-08-12 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 917c45e2af01 [SPARK-48204][INFRA][FOLLOW] fix release scripts for the 
"finalize" step
917c45e2af01 is described below

commit 917c45e2af01b385c2123c90607cf1bf90e82b94
Author: Wenchen Fan 
AuthorDate: Mon Jun 3 12:37:24 2024 +0900

[SPARK-48204][INFRA][FOLLOW] fix release scripts for the "finalize" step

Necessary fixes to finalize the spark 4.0 preview release. The major one is 
that pypi now requires API token instead of username/password for 
authentication.

release

no

manual

no

Closes #46840 from cloud-fan/script.
    
Authored-by: Wenchen Fan 
Signed-off-by: Hyukjin Kwon 
---
 dev/create-release/do-release-docker.sh |  6 +++---
 dev/create-release/release-build.sh | 31 ++-
 2 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/dev/create-release/do-release-docker.sh 
b/dev/create-release/do-release-docker.sh
index 88398bc14dd0..ea3105b3d0a7 100755
--- a/dev/create-release/do-release-docker.sh
+++ b/dev/create-release/do-release-docker.sh
@@ -84,8 +84,8 @@ if [ ! -z "$RELEASE_STEP" ] && [ "$RELEASE_STEP" = "finalize" 
]; then
 error "Exiting."
   fi
 
-  if [ -z "$PYPI_PASSWORD" ]; then
-stty -echo && printf "PyPi password: " && read PYPI_PASSWORD && printf 
'\n' && stty echo
+  if [ -z "$PYPI_API_TOKEN" ]; then
+stty -echo && printf "PyPi API token: " && read PYPI_API_TOKEN && printf 
'\n' && stty echo
   fi
 fi
 
@@ -142,7 +142,7 @@ GIT_NAME=$GIT_NAME
 GIT_EMAIL=$GIT_EMAIL
 GPG_KEY=$GPG_KEY
 ASF_PASSWORD=$ASF_PASSWORD
-PYPI_PASSWORD=$PYPI_PASSWORD
+PYPI_API_TOKEN=$PYPI_API_TOKEN
 GPG_PASSPHRASE=$GPG_PASSPHRASE
 RELEASE_STEP=$RELEASE_STEP
 USER=$USER
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index e0588ae934cd..99841916cf29 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -95,8 +95,8 @@ init_java
 init_maven_sbt
 
 if [[ "$1" == "finalize" ]]; then
-  if [[ -z "$PYPI_PASSWORD" ]]; then
-error 'The environment variable PYPI_PASSWORD is not set. Exiting.'
+  if [[ -z "$PYPI_API_TOKEN" ]]; then
+error 'The environment variable PYPI_API_TOKEN is not set. Exiting.'
   fi
 
   git config --global user.name "$GIT_NAME"
@@ -104,22 +104,27 @@ if [[ "$1" == "finalize" ]]; then
 
   # Create the git tag for the new release
   echo "Creating the git tag for the new release"
-  rm -rf spark
-  git clone "https://$ASF_USERNAME:$ASF_PASSWORD@$ASF_SPARK_REPO"; -b master
-  cd spark
-  git tag "v$RELEASE_VERSION" "$RELEASE_TAG"
-  git push origin "v$RELEASE_VERSION"
-  cd ..
-  rm -rf spark
-  echo "git tag v$RELEASE_VERSION created"
+  if check_for_tag "v$RELEASE_VERSION"; then
+echo "v$RELEASE_VERSION already exists. Skip creating it."
+  else
+rm -rf spark
+git clone "https://$ASF_USERNAME:$ASF_PASSWORD@$ASF_SPARK_REPO"; -b master
+cd spark
+git tag "v$RELEASE_VERSION" "$RELEASE_TAG"
+git push origin "v$RELEASE_VERSION"
+cd ..
+rm -rf spark
+echo "git tag v$RELEASE_VERSION created"
+  fi
 
   # download PySpark binary from the dev directory and upload to PyPi.
   echo "Uploading PySpark to PyPi"
   svn co --depth=empty "$RELEASE_STAGING_LOCATION/$RELEASE_TAG-bin" svn-spark
   cd svn-spark
-  svn update "pyspark-$RELEASE_VERSION.tar.gz"
-  svn update "pyspark-$RELEASE_VERSION.tar.gz.asc"
-  TWINE_USERNAME=spark-upload TWINE_PASSWORD="$PYPI_PASSWORD" twine upload \
+  PYSPARK_VERSION=`echo "$RELEASE_VERSION" |  sed -e "s/-/./" -e 
"s/preview/dev/"`
+  svn update "pyspark-$PYSPARK_VERSION.tar.gz"
+  svn update "pyspark-$PYSPARK_VERSION.tar.gz.asc"
+  twine upload -u __token__  -p $PYPI_API_TOKEN \
 --repository-url https://upload.pypi.org/legacy/ \
 "pyspark-$RELEASE_VERSION.tar.gz" \
 "pyspark-$RELEASE_VERSION.tar.gz.asc"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-48204][INFRA][FOLLOW] fix release scripts for the "finalize" step

2024-08-12 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 4a9dae9b8cdb [SPARK-48204][INFRA][FOLLOW] fix release scripts for the 
"finalize" step
4a9dae9b8cdb is described below

commit 4a9dae9b8cdb822c9a0827639dcabe6224df14e3
Author: Wenchen Fan 
AuthorDate: Mon Jun 3 12:37:24 2024 +0900

[SPARK-48204][INFRA][FOLLOW] fix release scripts for the "finalize" step

Necessary fixes to finalize the spark 4.0 preview release. The major one is 
that pypi now requires API token instead of username/password for 
authentication.

release

no

manual

no

Closes #46840 from cloud-fan/script.
    
Authored-by: Wenchen Fan 
Signed-off-by: Hyukjin Kwon 
---
 dev/create-release/do-release-docker.sh |  6 +++---
 dev/create-release/release-build.sh | 31 ++-
 2 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/dev/create-release/do-release-docker.sh 
b/dev/create-release/do-release-docker.sh
index 88398bc14dd0..ea3105b3d0a7 100755
--- a/dev/create-release/do-release-docker.sh
+++ b/dev/create-release/do-release-docker.sh
@@ -84,8 +84,8 @@ if [ ! -z "$RELEASE_STEP" ] && [ "$RELEASE_STEP" = "finalize" 
]; then
 error "Exiting."
   fi
 
-  if [ -z "$PYPI_PASSWORD" ]; then
-stty -echo && printf "PyPi password: " && read PYPI_PASSWORD && printf 
'\n' && stty echo
+  if [ -z "$PYPI_API_TOKEN" ]; then
+stty -echo && printf "PyPi API token: " && read PYPI_API_TOKEN && printf 
'\n' && stty echo
   fi
 fi
 
@@ -142,7 +142,7 @@ GIT_NAME=$GIT_NAME
 GIT_EMAIL=$GIT_EMAIL
 GPG_KEY=$GPG_KEY
 ASF_PASSWORD=$ASF_PASSWORD
-PYPI_PASSWORD=$PYPI_PASSWORD
+PYPI_API_TOKEN=$PYPI_API_TOKEN
 GPG_PASSPHRASE=$GPG_PASSPHRASE
 RELEASE_STEP=$RELEASE_STEP
 USER=$USER
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index e0588ae934cd..99841916cf29 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -95,8 +95,8 @@ init_java
 init_maven_sbt
 
 if [[ "$1" == "finalize" ]]; then
-  if [[ -z "$PYPI_PASSWORD" ]]; then
-error 'The environment variable PYPI_PASSWORD is not set. Exiting.'
+  if [[ -z "$PYPI_API_TOKEN" ]]; then
+error 'The environment variable PYPI_API_TOKEN is not set. Exiting.'
   fi
 
   git config --global user.name "$GIT_NAME"
@@ -104,22 +104,27 @@ if [[ "$1" == "finalize" ]]; then
 
   # Create the git tag for the new release
   echo "Creating the git tag for the new release"
-  rm -rf spark
-  git clone "https://$ASF_USERNAME:$ASF_PASSWORD@$ASF_SPARK_REPO"; -b master
-  cd spark
-  git tag "v$RELEASE_VERSION" "$RELEASE_TAG"
-  git push origin "v$RELEASE_VERSION"
-  cd ..
-  rm -rf spark
-  echo "git tag v$RELEASE_VERSION created"
+  if check_for_tag "v$RELEASE_VERSION"; then
+echo "v$RELEASE_VERSION already exists. Skip creating it."
+  else
+rm -rf spark
+git clone "https://$ASF_USERNAME:$ASF_PASSWORD@$ASF_SPARK_REPO"; -b master
+cd spark
+git tag "v$RELEASE_VERSION" "$RELEASE_TAG"
+git push origin "v$RELEASE_VERSION"
+cd ..
+rm -rf spark
+echo "git tag v$RELEASE_VERSION created"
+  fi
 
   # download PySpark binary from the dev directory and upload to PyPi.
   echo "Uploading PySpark to PyPi"
   svn co --depth=empty "$RELEASE_STAGING_LOCATION/$RELEASE_TAG-bin" svn-spark
   cd svn-spark
-  svn update "pyspark-$RELEASE_VERSION.tar.gz"
-  svn update "pyspark-$RELEASE_VERSION.tar.gz.asc"
-  TWINE_USERNAME=spark-upload TWINE_PASSWORD="$PYPI_PASSWORD" twine upload \
+  PYSPARK_VERSION=`echo "$RELEASE_VERSION" |  sed -e "s/-/./" -e 
"s/preview/dev/"`
+  svn update "pyspark-$PYSPARK_VERSION.tar.gz"
+  svn update "pyspark-$PYSPARK_VERSION.tar.gz.asc"
+  twine upload -u __token__  -p $PYPI_API_TOKEN \
 --repository-url https://upload.pypi.org/legacy/ \
 "pyspark-$RELEASE_VERSION.tar.gz" \
 "pyspark-$RELEASE_VERSION.tar.gz.asc"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (436a6d1a1899 -> 2465cb0d35ae)

2024-08-12 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 436a6d1a1899 [SPARK-49198][CONNECT] Prune more jars required for Spark 
Connect shell
 add 2465cb0d35ae [SPARK-49152][SQL] V2SessionCatalog should use V2Command

No new revisions were added by this update.

Summary of changes:
 .../catalyst/analysis/ResolveSessionCatalog.scala  | 74 --
 .../datasources/v2/DataSourceV2Strategy.scala  |  5 +-
 .../org/apache/spark/sql/CollationSuite.scala  |  3 +
 .../DataSourceV2DataFrameSessionCatalogSuite.scala |  2 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 17 ++---
 .../sql/connector/TestV2SessionCatalogBase.scala   | 16 +++--
 .../command/v2/ShowCreateTableSuite.scala  |  2 +-
 .../apache/spark/sql/internal/CatalogSuite.scala   |  2 +-
 8 files changed, 80 insertions(+), 41 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[jira] [Resolved] (SPARK-49188) Internal error on concat_ws called on array of arrays of string

2024-08-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49188.
-
Fix Version/s: 4.0.0
 Assignee: Nikola Mandic
   Resolution: Fixed

> Internal error on concat_ws called on array of arrays of string
> ---
>
> Key: SPARK-49188
> URL: https://issues.apache.org/jira/browse/SPARK-49188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Applying the following sequence of queries in *ANSI mode* 
> (spark.sql.ansi.enabled=true):
> {code:java}
> create table test_table(dat array) using parquet;
> select concat_ws(',', collect_list(dat)) FROM test_table; {code}
> yields an internal error:
> {code:java}
> org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase 
> analysis failed with an internal error. You hit a bug in Spark or the Spark 
> plugins you use. Please, report this bug to the corresponding communities or 
> vendors, and provide the full stack trace. SQLSTATE: XX000
> ...
> Caused by: java.lang.NullPointerException: Cannot invoke 
> "org.apache.spark.sql.types.AbstractDataType.simpleString()" because 
> "" is null
>   at 
> org.apache.spark.sql.errors.DataTypeErrorsBase.toSQLType(DataTypeErrorsBase.scala:55)
>   at 
> org.apache.spark.sql.errors.DataTypeErrorsBase.toSQLType$(DataTypeErrorsBase.scala:51)
>   at org.apache.spark.sql.catalyst.expressions.Cast$.toSQLType(Cast.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast$.typeCheckFailureMessage(Cast.scala:426)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.typeCheckFailureInCast(Cast.scala:501)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.checkInputDataTypes(Cast.scala:525)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.resolved$lzycompute(Cast.scala:551)
>   at org.apache.spark.sql.catalyst.expressions.Cast.resolved(Cast.scala:550)
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49188][SQL] Internal error on concat_ws called on array of arrays of string

2024-08-11 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e893b32196af [SPARK-49188][SQL] Internal error on concat_ws called on 
array of arrays of string
e893b32196af is described below

commit e893b32196afbf4b8e620239eb7aa61b38747241
Author: Nikola Mandic 
AuthorDate: Mon Aug 12 11:43:51 2024 +0800

[SPARK-49188][SQL] Internal error on concat_ws called on array of arrays of 
string

### What changes were proposed in this pull request?

Applying the following sequence of queries in **ANSI mode** 
(`spark.sql.ansi.enabled=true`):
```
create table test_table(dat array) using parquet;
select concat_ws(',', collect_list(dat)) FROM test_table;
```
yields an internal error:
```
org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase 
analysis failed with an internal error. You hit a bug in Spark or the Spark 
plugins you use. Please, report this bug to the corresponding communities or 
vendors, and provide the full stack trace. SQLSTATE: XX000
...
Caused by: java.lang.NullPointerException: Cannot invoke 
"org.apache.spark.sql.types.AbstractDataType.simpleString()" because "" 
is null
  at 
org.apache.spark.sql.errors.DataTypeErrorsBase.toSQLType(DataTypeErrorsBase.scala:55)
  at 
org.apache.spark.sql.errors.DataTypeErrorsBase.toSQLType$(DataTypeErrorsBase.scala:51)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$.toSQLType(Cast.scala:44)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$.typeCheckFailureMessage(Cast.scala:426)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.typeCheckFailureInCast(Cast.scala:501)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.checkInputDataTypes(Cast.scala:525)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.resolved$lzycompute(Cast.scala:551)
  at org.apache.spark.sql.catalyst.expressions.Cast.resolved(Cast.scala:550)
...
```
Fix the problematic pattern-matching rule in `AnsiTypeCoercion`.

### Why are the changes needed?

Replace the internal error with proper user-facing error.

### Does this PR introduce _any_ user-facing change?

Yes, it removes the internal error user could produce by running queries.

### How was this patch tested?

Added tests to `AnsiTypeCoercionSuite` and `StringFunctionsSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47691 from nikolamand-db/SPARK-49188.

Authored-by: Nikola Mandic 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/AnsiTypeCoercion.scala   |  2 +-
 .../catalyst/analysis/AnsiTypeCoercionSuite.scala  | 51 ++
 .../apache/spark/sql/StringFunctionsSuite.scala| 29 
 3 files changed, 81 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala
index 9989ca79ed27..17b1c4e249f5 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala
@@ -198,7 +198,7 @@ object AnsiTypeCoercion extends TypeCoercionBase {
 Some(a.defaultConcreteType)
 
   case (ArrayType(fromType, _), AbstractArrayType(toType)) =>
-Some(implicitCast(fromType, toType).map(ArrayType(_, true)).orNull)
+implicitCast(fromType, toType).map(ArrayType(_, true))
 
   // When the target type is `TypeCollection`, there is another branch to 
find the
   // "closet convertible data type" below.
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercionSuite.scala
index 38acb56ad1e0..de600d881b62 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercionSuite.scala
@@ -26,6 +26,7 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.types.DataTypeUtils
 import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.types.{AbstractArrayType, 
StringTypeAnyCollation}
 import org.apache.spark.sql.types._
 
 class AnsiTypeCoercionSuite extends TypeCoercionSuiteBase {
@@ -1047,4 +1048,54 @@ class AnsiTypeCoercionSuite extends 
TypeCoercionSuiteBase {
 AnsiTypeCoercion.GetDateFieldOperations, operation(ts), 
operation

  1   2   3   4   5   6   7   8   9   10   >