(spark) branch branch-3.4 updated: [SPARK-48204][INFRA][FOLLOW] fix release scripts for the "finalize" step

2024-08-12 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 917c45e2af01 [SPARK-48204][INFRA][FOLLOW] fix release scripts for the 
"finalize" step
917c45e2af01 is described below

commit 917c45e2af01b385c2123c90607cf1bf90e82b94
Author: Wenchen Fan 
AuthorDate: Mon Jun 3 12:37:24 2024 +0900

[SPARK-48204][INFRA][FOLLOW] fix release scripts for the "finalize" step

Necessary fixes to finalize the spark 4.0 preview release. The major one is 
that pypi now requires API token instead of username/password for 
authentication.

release

no

manual

no

Closes #46840 from cloud-fan/script.
    
Authored-by: Wenchen Fan 
Signed-off-by: Hyukjin Kwon 
---
 dev/create-release/do-release-docker.sh |  6 +++---
 dev/create-release/release-build.sh | 31 ++-
 2 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/dev/create-release/do-release-docker.sh 
b/dev/create-release/do-release-docker.sh
index 88398bc14dd0..ea3105b3d0a7 100755
--- a/dev/create-release/do-release-docker.sh
+++ b/dev/create-release/do-release-docker.sh
@@ -84,8 +84,8 @@ if [ ! -z "$RELEASE_STEP" ] && [ "$RELEASE_STEP" = "finalize" 
]; then
 error "Exiting."
   fi
 
-  if [ -z "$PYPI_PASSWORD" ]; then
-stty -echo && printf "PyPi password: " && read PYPI_PASSWORD && printf 
'\n' && stty echo
+  if [ -z "$PYPI_API_TOKEN" ]; then
+stty -echo && printf "PyPi API token: " && read PYPI_API_TOKEN && printf 
'\n' && stty echo
   fi
 fi
 
@@ -142,7 +142,7 @@ GIT_NAME=$GIT_NAME
 GIT_EMAIL=$GIT_EMAIL
 GPG_KEY=$GPG_KEY
 ASF_PASSWORD=$ASF_PASSWORD
-PYPI_PASSWORD=$PYPI_PASSWORD
+PYPI_API_TOKEN=$PYPI_API_TOKEN
 GPG_PASSPHRASE=$GPG_PASSPHRASE
 RELEASE_STEP=$RELEASE_STEP
 USER=$USER
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index e0588ae934cd..99841916cf29 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -95,8 +95,8 @@ init_java
 init_maven_sbt
 
 if [[ "$1" == "finalize" ]]; then
-  if [[ -z "$PYPI_PASSWORD" ]]; then
-error 'The environment variable PYPI_PASSWORD is not set. Exiting.'
+  if [[ -z "$PYPI_API_TOKEN" ]]; then
+error 'The environment variable PYPI_API_TOKEN is not set. Exiting.'
   fi
 
   git config --global user.name "$GIT_NAME"
@@ -104,22 +104,27 @@ if [[ "$1" == "finalize" ]]; then
 
   # Create the git tag for the new release
   echo "Creating the git tag for the new release"
-  rm -rf spark
-  git clone "https://$ASF_USERNAME:$ASF_PASSWORD@$ASF_SPARK_REPO"; -b master
-  cd spark
-  git tag "v$RELEASE_VERSION" "$RELEASE_TAG"
-  git push origin "v$RELEASE_VERSION"
-  cd ..
-  rm -rf spark
-  echo "git tag v$RELEASE_VERSION created"
+  if check_for_tag "v$RELEASE_VERSION"; then
+echo "v$RELEASE_VERSION already exists. Skip creating it."
+  else
+rm -rf spark
+git clone "https://$ASF_USERNAME:$ASF_PASSWORD@$ASF_SPARK_REPO"; -b master
+cd spark
+git tag "v$RELEASE_VERSION" "$RELEASE_TAG"
+git push origin "v$RELEASE_VERSION"
+cd ..
+rm -rf spark
+echo "git tag v$RELEASE_VERSION created"
+  fi
 
   # download PySpark binary from the dev directory and upload to PyPi.
   echo "Uploading PySpark to PyPi"
   svn co --depth=empty "$RELEASE_STAGING_LOCATION/$RELEASE_TAG-bin" svn-spark
   cd svn-spark
-  svn update "pyspark-$RELEASE_VERSION.tar.gz"
-  svn update "pyspark-$RELEASE_VERSION.tar.gz.asc"
-  TWINE_USERNAME=spark-upload TWINE_PASSWORD="$PYPI_PASSWORD" twine upload \
+  PYSPARK_VERSION=`echo "$RELEASE_VERSION" |  sed -e "s/-/./" -e 
"s/preview/dev/"`
+  svn update "pyspark-$PYSPARK_VERSION.tar.gz"
+  svn update "pyspark-$PYSPARK_VERSION.tar.gz.asc"
+  twine upload -u __token__  -p $PYPI_API_TOKEN \
 --repository-url https://upload.pypi.org/legacy/ \
 "pyspark-$RELEASE_VERSION.tar.gz" \
 "pyspark-$RELEASE_VERSION.tar.gz.asc"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-48204][INFRA][FOLLOW] fix release scripts for the "finalize" step

2024-08-12 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 4a9dae9b8cdb [SPARK-48204][INFRA][FOLLOW] fix release scripts for the 
"finalize" step
4a9dae9b8cdb is described below

commit 4a9dae9b8cdb822c9a0827639dcabe6224df14e3
Author: Wenchen Fan 
AuthorDate: Mon Jun 3 12:37:24 2024 +0900

[SPARK-48204][INFRA][FOLLOW] fix release scripts for the "finalize" step

Necessary fixes to finalize the spark 4.0 preview release. The major one is 
that pypi now requires API token instead of username/password for 
authentication.

release

no

manual

no

Closes #46840 from cloud-fan/script.
    
Authored-by: Wenchen Fan 
Signed-off-by: Hyukjin Kwon 
---
 dev/create-release/do-release-docker.sh |  6 +++---
 dev/create-release/release-build.sh | 31 ++-
 2 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/dev/create-release/do-release-docker.sh 
b/dev/create-release/do-release-docker.sh
index 88398bc14dd0..ea3105b3d0a7 100755
--- a/dev/create-release/do-release-docker.sh
+++ b/dev/create-release/do-release-docker.sh
@@ -84,8 +84,8 @@ if [ ! -z "$RELEASE_STEP" ] && [ "$RELEASE_STEP" = "finalize" 
]; then
 error "Exiting."
   fi
 
-  if [ -z "$PYPI_PASSWORD" ]; then
-stty -echo && printf "PyPi password: " && read PYPI_PASSWORD && printf 
'\n' && stty echo
+  if [ -z "$PYPI_API_TOKEN" ]; then
+stty -echo && printf "PyPi API token: " && read PYPI_API_TOKEN && printf 
'\n' && stty echo
   fi
 fi
 
@@ -142,7 +142,7 @@ GIT_NAME=$GIT_NAME
 GIT_EMAIL=$GIT_EMAIL
 GPG_KEY=$GPG_KEY
 ASF_PASSWORD=$ASF_PASSWORD
-PYPI_PASSWORD=$PYPI_PASSWORD
+PYPI_API_TOKEN=$PYPI_API_TOKEN
 GPG_PASSPHRASE=$GPG_PASSPHRASE
 RELEASE_STEP=$RELEASE_STEP
 USER=$USER
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index e0588ae934cd..99841916cf29 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -95,8 +95,8 @@ init_java
 init_maven_sbt
 
 if [[ "$1" == "finalize" ]]; then
-  if [[ -z "$PYPI_PASSWORD" ]]; then
-error 'The environment variable PYPI_PASSWORD is not set. Exiting.'
+  if [[ -z "$PYPI_API_TOKEN" ]]; then
+error 'The environment variable PYPI_API_TOKEN is not set. Exiting.'
   fi
 
   git config --global user.name "$GIT_NAME"
@@ -104,22 +104,27 @@ if [[ "$1" == "finalize" ]]; then
 
   # Create the git tag for the new release
   echo "Creating the git tag for the new release"
-  rm -rf spark
-  git clone "https://$ASF_USERNAME:$ASF_PASSWORD@$ASF_SPARK_REPO"; -b master
-  cd spark
-  git tag "v$RELEASE_VERSION" "$RELEASE_TAG"
-  git push origin "v$RELEASE_VERSION"
-  cd ..
-  rm -rf spark
-  echo "git tag v$RELEASE_VERSION created"
+  if check_for_tag "v$RELEASE_VERSION"; then
+echo "v$RELEASE_VERSION already exists. Skip creating it."
+  else
+rm -rf spark
+git clone "https://$ASF_USERNAME:$ASF_PASSWORD@$ASF_SPARK_REPO"; -b master
+cd spark
+git tag "v$RELEASE_VERSION" "$RELEASE_TAG"
+git push origin "v$RELEASE_VERSION"
+cd ..
+rm -rf spark
+echo "git tag v$RELEASE_VERSION created"
+  fi
 
   # download PySpark binary from the dev directory and upload to PyPi.
   echo "Uploading PySpark to PyPi"
   svn co --depth=empty "$RELEASE_STAGING_LOCATION/$RELEASE_TAG-bin" svn-spark
   cd svn-spark
-  svn update "pyspark-$RELEASE_VERSION.tar.gz"
-  svn update "pyspark-$RELEASE_VERSION.tar.gz.asc"
-  TWINE_USERNAME=spark-upload TWINE_PASSWORD="$PYPI_PASSWORD" twine upload \
+  PYSPARK_VERSION=`echo "$RELEASE_VERSION" |  sed -e "s/-/./" -e 
"s/preview/dev/"`
+  svn update "pyspark-$PYSPARK_VERSION.tar.gz"
+  svn update "pyspark-$PYSPARK_VERSION.tar.gz.asc"
+  twine upload -u __token__  -p $PYPI_API_TOKEN \
 --repository-url https://upload.pypi.org/legacy/ \
 "pyspark-$RELEASE_VERSION.tar.gz" \
 "pyspark-$RELEASE_VERSION.tar.gz.asc"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (436a6d1a1899 -> 2465cb0d35ae)

2024-08-12 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 436a6d1a1899 [SPARK-49198][CONNECT] Prune more jars required for Spark 
Connect shell
 add 2465cb0d35ae [SPARK-49152][SQL] V2SessionCatalog should use V2Command

No new revisions were added by this update.

Summary of changes:
 .../catalyst/analysis/ResolveSessionCatalog.scala  | 74 --
 .../datasources/v2/DataSourceV2Strategy.scala  |  5 +-
 .../org/apache/spark/sql/CollationSuite.scala  |  3 +
 .../DataSourceV2DataFrameSessionCatalogSuite.scala |  2 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 17 ++---
 .../sql/connector/TestV2SessionCatalogBase.scala   | 16 +++--
 .../command/v2/ShowCreateTableSuite.scala  |  2 +-
 .../apache/spark/sql/internal/CatalogSuite.scala   |  2 +-
 8 files changed, 80 insertions(+), 41 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[jira] [Resolved] (SPARK-49188) Internal error on concat_ws called on array of arrays of string

2024-08-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49188.
-
Fix Version/s: 4.0.0
 Assignee: Nikola Mandic
   Resolution: Fixed

> Internal error on concat_ws called on array of arrays of string
> ---
>
> Key: SPARK-49188
> URL: https://issues.apache.org/jira/browse/SPARK-49188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Applying the following sequence of queries in *ANSI mode* 
> (spark.sql.ansi.enabled=true):
> {code:java}
> create table test_table(dat array) using parquet;
> select concat_ws(',', collect_list(dat)) FROM test_table; {code}
> yields an internal error:
> {code:java}
> org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase 
> analysis failed with an internal error. You hit a bug in Spark or the Spark 
> plugins you use. Please, report this bug to the corresponding communities or 
> vendors, and provide the full stack trace. SQLSTATE: XX000
> ...
> Caused by: java.lang.NullPointerException: Cannot invoke 
> "org.apache.spark.sql.types.AbstractDataType.simpleString()" because 
> "" is null
>   at 
> org.apache.spark.sql.errors.DataTypeErrorsBase.toSQLType(DataTypeErrorsBase.scala:55)
>   at 
> org.apache.spark.sql.errors.DataTypeErrorsBase.toSQLType$(DataTypeErrorsBase.scala:51)
>   at org.apache.spark.sql.catalyst.expressions.Cast$.toSQLType(Cast.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast$.typeCheckFailureMessage(Cast.scala:426)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.typeCheckFailureInCast(Cast.scala:501)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.checkInputDataTypes(Cast.scala:525)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.resolved$lzycompute(Cast.scala:551)
>   at org.apache.spark.sql.catalyst.expressions.Cast.resolved(Cast.scala:550)
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49188][SQL] Internal error on concat_ws called on array of arrays of string

2024-08-11 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e893b32196af [SPARK-49188][SQL] Internal error on concat_ws called on 
array of arrays of string
e893b32196af is described below

commit e893b32196afbf4b8e620239eb7aa61b38747241
Author: Nikola Mandic 
AuthorDate: Mon Aug 12 11:43:51 2024 +0800

[SPARK-49188][SQL] Internal error on concat_ws called on array of arrays of 
string

### What changes were proposed in this pull request?

Applying the following sequence of queries in **ANSI mode** 
(`spark.sql.ansi.enabled=true`):
```
create table test_table(dat array) using parquet;
select concat_ws(',', collect_list(dat)) FROM test_table;
```
yields an internal error:
```
org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase 
analysis failed with an internal error. You hit a bug in Spark or the Spark 
plugins you use. Please, report this bug to the corresponding communities or 
vendors, and provide the full stack trace. SQLSTATE: XX000
...
Caused by: java.lang.NullPointerException: Cannot invoke 
"org.apache.spark.sql.types.AbstractDataType.simpleString()" because "" 
is null
  at 
org.apache.spark.sql.errors.DataTypeErrorsBase.toSQLType(DataTypeErrorsBase.scala:55)
  at 
org.apache.spark.sql.errors.DataTypeErrorsBase.toSQLType$(DataTypeErrorsBase.scala:51)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$.toSQLType(Cast.scala:44)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$.typeCheckFailureMessage(Cast.scala:426)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.typeCheckFailureInCast(Cast.scala:501)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.checkInputDataTypes(Cast.scala:525)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.resolved$lzycompute(Cast.scala:551)
  at org.apache.spark.sql.catalyst.expressions.Cast.resolved(Cast.scala:550)
...
```
Fix the problematic pattern-matching rule in `AnsiTypeCoercion`.

### Why are the changes needed?

Replace the internal error with proper user-facing error.

### Does this PR introduce _any_ user-facing change?

Yes, it removes the internal error user could produce by running queries.

### How was this patch tested?

Added tests to `AnsiTypeCoercionSuite` and `StringFunctionsSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47691 from nikolamand-db/SPARK-49188.

Authored-by: Nikola Mandic 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/AnsiTypeCoercion.scala   |  2 +-
 .../catalyst/analysis/AnsiTypeCoercionSuite.scala  | 51 ++
 .../apache/spark/sql/StringFunctionsSuite.scala| 29 
 3 files changed, 81 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala
index 9989ca79ed27..17b1c4e249f5 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala
@@ -198,7 +198,7 @@ object AnsiTypeCoercion extends TypeCoercionBase {
 Some(a.defaultConcreteType)
 
   case (ArrayType(fromType, _), AbstractArrayType(toType)) =>
-Some(implicitCast(fromType, toType).map(ArrayType(_, true)).orNull)
+implicitCast(fromType, toType).map(ArrayType(_, true))
 
   // When the target type is `TypeCollection`, there is another branch to 
find the
   // "closet convertible data type" below.
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercionSuite.scala
index 38acb56ad1e0..de600d881b62 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercionSuite.scala
@@ -26,6 +26,7 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.types.DataTypeUtils
 import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.types.{AbstractArrayType, 
StringTypeAnyCollation}
 import org.apache.spark.sql.types._
 
 class AnsiTypeCoercionSuite extends TypeCoercionSuiteBase {
@@ -1047,4 +1048,54 @@ class AnsiTypeCoercionSuite extends 
TypeCoercionSuiteBase {
 AnsiTypeCoercion.GetDateFieldOperations, operation(ts), 
operation

[jira] [Resolved] (SPARK-49183) V2SessionCatalog.createTable should respect PROP_IS_MANAGED_LOCATION

2024-08-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49183.
-
Fix Version/s: 4.0.0
   3.5.3
   Resolution: Fixed

Issue resolved by pull request 47684
[https://github.com/apache/spark/pull/47684]

> V2SessionCatalog.createTable should respect PROP_IS_MANAGED_LOCATION
> 
>
> Key: SPARK-49183
> URL: https://issues.apache.org/jira/browse/SPARK-49183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49183) V2SessionCatalog.createTable should respect PROP_IS_MANAGED_LOCATION

2024-08-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49183:
---

Assignee: Wenchen Fan

> V2SessionCatalog.createTable should respect PROP_IS_MANAGED_LOCATION
> 
>
> Key: SPARK-49183
> URL: https://issues.apache.org/jira/browse/SPARK-49183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-49183][SQL] V2SessionCatalog.createTable should respect PROP_IS_MANAGED_LOCATION

2024-08-11 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 204dd814e39e [SPARK-49183][SQL] V2SessionCatalog.createTable should 
respect PROP_IS_MANAGED_LOCATION
204dd814e39e is described below

commit 204dd814e39efecd83623069351fc0c46fa38474
Author: Wenchen Fan 
AuthorDate: Mon Aug 12 11:19:47 2024 +0800

[SPARK-49183][SQL] V2SessionCatalog.createTable should respect 
PROP_IS_MANAGED_LOCATION

Even if the table definition has a location, the table should be a managed 
table if `PROP_IS_MANAGED_LOCATION` is specified, in 
`V2SessionCatalog.createTable`.

It's a bug fix. A custom `spark_catalog` may generate custom location for 
managed table and delegate the actual table creation to `V2SessionCatalog`. The 
table should still be a managed table if `PROP_IS_MANAGED_LOCATION` is 
specified.

Yes, now users who use custom `spark_catalog` that generates custom 
location for managed table, can correctly create managed tables.

a new test

no

Closes #47684 from cloud-fan/table.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
(cherry picked from commit ed04e160940a62fe3f5e82bda502941c7aa75a29)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/connector/catalog/TableCatalog.java  |  3 +-
 .../datasources/v2/ShowCreateTableExec.scala   |  2 +-
 .../datasources/v2/V2SessionCatalog.scala  |  4 ++-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 34 +-
 4 files changed, 39 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
index d99e7e14b011..29c2da307a0f 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
@@ -52,7 +52,8 @@ public interface TableCatalog extends CatalogPlugin {
 
   /**
* A reserved property to indicate that the table location is managed, not 
user-specified.
-   * If this property is "true", SHOW CREATE TABLE will not generate the 
LOCATION clause.
+   * If this property is "true", it means it's a managed table even if it has 
a location. As an
+   * example, SHOW CREATE TABLE will not generate the LOCATION clause.
*/
   String PROP_IS_MANAGED_LOCATION = "is_managed_location";
 
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateTableExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateTableExec.scala
index 6fa51ed63bd4..64f76f59c286 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateTableExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateTableExec.scala
@@ -120,7 +120,7 @@ case class ShowCreateTableExec(
   private def showTableLocation(table: Table, builder: StringBuilder): Unit = {
 val isManagedOption = 
Option(table.properties.get(TableCatalog.PROP_IS_MANAGED_LOCATION))
 // Only generate LOCATION clause if it's not managed.
-if (isManagedOption.forall(_.equalsIgnoreCase("false"))) {
+if (isManagedOption.isEmpty || 
!isManagedOption.get.equalsIgnoreCase("true")) {
   Option(table.properties.get(TableCatalog.PROP_LOCATION))
 .map("LOCATION '" + escapeSingleQuotedString(_) + "'\n")
 .foreach(builder.append)
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
index a7062a9a596c..d7ab23cf08dd 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
@@ -121,7 +121,9 @@ class V2SessionCatalog(catalog: SessionCatalog)
 val storage = 
DataSource.buildStorageFormatFromOptions(toOptions(tableProperties.toMap))
 .copy(locationUri = location.map(CatalogUtils.stringToURI))
 val isExternal = properties.containsKey(TableCatalog.PROP_EXTERNAL)
-val tableType = if (isExternal || location.isDefined) {
+val isManagedLocation = 
Option(properties.get(TableCatalog.PROP_IS_MANAGED_LOCATION))
+  .exists(_.equalsIgnoreCase("true"))
+val tableType = if (isExternal || (location.isDefined && 
!isManagedLocation)) {
   CatalogTableType.EXTERNAL
 } else {
   CatalogTableType.MANAGED
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataS

(spark) branch master updated: [SPARK-49183][SQL] V2SessionCatalog.createTable should respect PROP_IS_MANAGED_LOCATION

2024-08-11 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ed04e160940a [SPARK-49183][SQL] V2SessionCatalog.createTable should 
respect PROP_IS_MANAGED_LOCATION
ed04e160940a is described below

commit ed04e160940a62fe3f5e82bda502941c7aa75a29
Author: Wenchen Fan 
AuthorDate: Mon Aug 12 11:19:47 2024 +0800

[SPARK-49183][SQL] V2SessionCatalog.createTable should respect 
PROP_IS_MANAGED_LOCATION

### What changes were proposed in this pull request?

Even if the table definition has a location, the table should be a managed 
table if `PROP_IS_MANAGED_LOCATION` is specified, in 
`V2SessionCatalog.createTable`.

### Why are the changes needed?

It's a bug fix. A custom `spark_catalog` may generate custom location for 
managed table and delegate the actual table creation to `V2SessionCatalog`. The 
table should still be a managed table if `PROP_IS_MANAGED_LOCATION` is 
specified.

### Does this PR introduce _any_ user-facing change?

Yes, now users who use custom `spark_catalog` that generates custom 
location for managed table, can correctly create managed tables.

### How was this patch tested?

a new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47684 from cloud-fan/table.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/connector/catalog/TableCatalog.java  |  3 +-
 .../datasources/v2/ShowCreateTableExec.scala   |  2 +-
 .../datasources/v2/V2SessionCatalog.scala  |  4 ++-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 34 +-
 4 files changed, 39 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
index 74700789dde0..37b0589ab511 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
@@ -52,7 +52,8 @@ public interface TableCatalog extends CatalogPlugin {
 
   /**
* A reserved property to indicate that the table location is managed, not 
user-specified.
-   * If this property is "true", SHOW CREATE TABLE will not generate the 
LOCATION clause.
+   * If this property is "true", it means it's a managed table even if it has 
a location. As an
+   * example, SHOW CREATE TABLE will not generate the LOCATION clause.
*/
   String PROP_IS_MANAGED_LOCATION = "is_managed_location";
 
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateTableExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateTableExec.scala
index 102214e36c91..37339a34af3d 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateTableExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateTableExec.scala
@@ -120,7 +120,7 @@ case class ShowCreateTableExec(
   private def showTableLocation(table: Table, builder: StringBuilder): Unit = {
 val isManagedOption = 
Option(table.properties.get(TableCatalog.PROP_IS_MANAGED_LOCATION))
 // Only generate LOCATION clause if it's not managed.
-if (isManagedOption.forall(_.equalsIgnoreCase("false"))) {
+if (isManagedOption.isEmpty || 
!isManagedOption.get.equalsIgnoreCase("true")) {
   Option(table.properties.get(TableCatalog.PROP_LOCATION))
 .map("LOCATION '" + escapeSingleQuotedString(_) + "'\n")
 .foreach(builder.append)
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
index bc1e2c92faa8..c12e507ecd35 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
@@ -170,7 +170,9 @@ class V2SessionCatalog(catalog: SessionCatalog)
 val storage = 
DataSource.buildStorageFormatFromOptions(toOptions(tableProperties))
   .copy(locationUri = location.map(CatalogUtils.stringToURI))
 val isExternal = properties.containsKey(TableCatalog.PROP_EXTERNAL)
-val tableType = if (isExternal || location.isDefined) {
+val isManagedLocation = 
Option(properties.get(TableCatalog.PROP_IS_MANAGED_LOCATION))
+  .exists(_.equalsIgnoreCase("true"))
+val tableType = if (isExternal || (location.isDefined && 
!isManagedL

[jira] [Assigned] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)

2024-08-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48937:
---

Assignee: Uroš Bojanić

> Fix collation support for the StringToMap expression (binary & lowercase 
> collation only)
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then move on to implementation and 
> testing. You will find this expression in the *complexTypeCreator.scala* 
> file. However, this experssion is currently implemented as pass-through 
> function, which is wrong because it doesn't provide appropriate collation 
> awareness for non-default delimiters.
>  
> Example 1.
> {code:java}
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
> This query will give the correct result, regardless of the collation.
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Example 2.
> {code:java}
> SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
> This query will give the *incorrect* result, under UTF8_LCASE collation. The 
> correct result should be:
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
> reflect how this function should be used with collation in SparkSQL, and feel 
> free to use your chosen Spark SQL Editor to experiment with the existing 
> functions to learn more about how they work. In addition, look into the 
> possible use-cases and implementation of similar functions within other other 
> open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringToMap* expression so 
> that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
> StringTypeBinaryLcase). To understand what changes were introduced in order 
> to enable full collation support for other existing functions in Spark, take 
> a look at the related Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)

2024-08-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48937.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47621
[https://github.com/apache/spark/pull/47621]

> Fix collation support for the StringToMap expression (binary & lowercase 
> collation only)
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then move on to implementation and 
> testing. You will find this expression in the *complexTypeCreator.scala* 
> file. However, this experssion is currently implemented as pass-through 
> function, which is wrong because it doesn't provide appropriate collation 
> awareness for non-default delimiters.
>  
> Example 1.
> {code:java}
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
> This query will give the correct result, regardless of the collation.
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Example 2.
> {code:java}
> SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
> This query will give the *incorrect* result, under UTF8_LCASE collation. The 
> correct result should be:
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
> reflect how this function should be used with collation in SparkSQL, and feel 
> free to use your chosen Spark SQL Editor to experiment with the existing 
> functions to learn more about how they work. In addition, look into the 
> possible use-cases and implementation of similar functions within other other 
> open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringToMap* expression so 
> that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
> StringTypeBinaryLcase). To understand what changes were introduced in order 
> to enable full collation support for other existing functions in Spark, take 
> a look at the related Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48937][SQL] Add collation support for StringToMap string expressions

2024-08-09 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d8aff6e04941 [SPARK-48937][SQL] Add collation support for StringToMap 
string expressions
d8aff6e04941 is described below

commit d8aff6e0494198c48962fbe6e6b93edff2855067
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Fri Aug 9 22:00:10 2024 +0800

[SPARK-48937][SQL] Add collation support for StringToMap string expressions

### What changes were proposed in this pull request?
Add collation awareness for `StringToMap` string expression.

### Why are the changes needed?
`StringToMap` should be collation aware when splitting strings on specified 
delimiters.

### Does this PR introduce _any_ user-facing change?
Yes, `StringToMap` is now collation aware.

### How was this patch tested?
New unit tests and e2e sql tests for `str_to_map`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47621 from uros-db/fix-str-to-map.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 57 ++
 .../spark/sql/catalyst/util/CollationSupport.java  | 30 ++--
 .../expressions/codegen/CodeGenerator.scala|  3 +-
 .../catalyst/expressions/complexTypeCreator.scala  | 13 +++--
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 40 +--
 5 files changed, 96 insertions(+), 47 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index 501d173fc485..b57f172428ac 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -32,10 +32,13 @@ import static 
org.apache.spark.unsafe.types.UTF8String.CodePointIteratorType;
 
 import java.text.CharacterIterator;
 import java.text.StringCharacterIterator;
+import java.util.ArrayList;
 import java.util.HashMap;
 import java.util.HashSet;
 import java.util.Iterator;
+import java.util.List;
 import java.util.Map;
+import java.util.regex.Pattern;
 
 /**
  * Utility class for collation-aware UTF8String operations.
@@ -1226,6 +1229,60 @@ public class CollationAwareUTF8String {
 return UTF8String.fromString(src.substring(0, charIndex));
   }
 
+  public static UTF8String[] splitSQL(final UTF8String input, final UTF8String 
delim,
+  final int limit, final int collationId) {
+if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+  return input.split(delim, limit);
+} else if 
(CollationFactory.fetchCollation(collationId).supportsLowercaseEquality) {
+  return lowercaseSplitSQL(input, delim, limit);
+} else {
+  return icuSplitSQL(input, delim, limit, collationId);
+}
+  }
+
+  public static UTF8String[] lowercaseSplitSQL(final UTF8String string, final 
UTF8String delimiter,
+  final int limit) {
+  if (delimiter.numBytes() == 0) return new UTF8String[] { string };
+  if (string.numBytes() == 0) return new UTF8String[] { 
UTF8String.EMPTY_UTF8 };
+  Pattern pattern = Pattern.compile(Pattern.quote(delimiter.toString()),
+CollationSupport.lowercaseRegexFlags);
+  String[] splits = pattern.split(string.toString(), limit);
+  UTF8String[] res = new UTF8String[splits.length];
+  for (int i = 0; i < res.length; i++) {
+res[i] = UTF8String.fromString(splits[i]);
+  }
+  return res;
+  }
+
+  public static UTF8String[] icuSplitSQL(final UTF8String string, final 
UTF8String delimiter,
+  final int limit, final int collationId) {
+if (delimiter.numBytes() == 0) return new UTF8String[] { string };
+if (string.numBytes() == 0) return new UTF8String[] { 
UTF8String.EMPTY_UTF8 };
+List strings = new ArrayList<>();
+String target = string.toString(), pattern = delimiter.toString();
+StringSearch stringSearch = CollationFactory.getStringSearch(target, 
pattern, collationId);
+int start = 0, end;
+while ((end = stringSearch.next()) != StringSearch.DONE) {
+  if (limit > 0 && strings.size() == limit - 1) {
+break;
+  }
+  strings.add(UTF8String.fromString(target.substring(start, end)));
+  start = end + stringSearch.getMatchLength();
+}
+if (start <= target.length()) {
+  strings.add(UTF8String.fromString(target.substring(start)));
+}
+if (limit == 0) {
+  // Remove trailing empty strings
+  int i = strings.size() - 1;
+  while (i >

Re: [VOTE] Release Spark 3.5.2 (RC5)

2024-08-09 Thread Wenchen Fan
+1

On Fri, Aug 9, 2024 at 6:04 PM Peter Toth  wrote:

> +1
>
> huaxin gao  ezt írta (időpont: 2024. aug. 8., Cs,
> 21:19):
>
>> +1
>>
>> On Thu, Aug 8, 2024 at 11:41 AM L. C. Hsieh  wrote:
>>
>>> Then,
>>>
>>> +1 again
>>>
>>> On Thu, Aug 8, 2024 at 11:38 AM Dongjoon Hyun 
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > I'm resending my vote.
>>> >
>>> > Dongjoon.
>>> >
>>> > On 2024/08/06 16:06:00 Kent Yao wrote:
>>> > > Hi dev,
>>> > >
>>> > > Please vote on releasing the following candidate as Apache Spark
>>> version 3.5.2.
>>> > >
>>> > > The vote is open until Aug 9, 17:00:00 UTC, and passes if a majority
>>> +1
>>> > > PMC votes are cast, with a minimum of 3 +1 votes.
>>> > >
>>> > > [ ] +1 Release this package as Apache Spark 3.5.2
>>> > > [ ] -1 Do not release this package because ...
>>> > >
>>> > > To learn more about Apache Spark, please see
>>> https://spark.apache.org/
>>> > >
>>> > > The tag to be voted on is v3.5.2-rc5 (commit
>>> > > bb7846dd487f259994fdc69e18e03382e3f64f42):
>>> > > https://github.com/apache/spark/tree/v3.5.2-rc5
>>> > >
>>> > > The release files, including signatures, digests, etc. can be found
>>> at:
>>> > > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-bin/
>>> > >
>>> > > Signatures used for Spark RCs can be found in this file:
>>> > > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> > >
>>> > > The staging repository for this release can be found at:
>>> > >
>>> https://repository.apache.org/content/repositories/orgapachespark-1462/
>>> > >
>>> > > The documentation corresponding to this release can be found at:
>>> > > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/
>>> > >
>>> > > The list of bug fixes going into 3.5.2 can be found at the following
>>> URL:
>>> > > https://issues.apache.org/jira/projects/SPARK/versions/12353980
>>> > >
>>> > > FAQ
>>> > >
>>> > > =
>>> > > How can I help test this release?
>>> > > =
>>> > >
>>> > > If you are a Spark user, you can help us test this release by taking
>>> > > an existing Spark workload and running on this release candidate,
>>> then
>>> > > reporting any regressions.
>>> > >
>>> > > If you're working in PySpark you can set up a virtual env and install
>>> > > the current RC via "pip install
>>> > >
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-bin/pyspark-3.5.2.tar.gz
>>> "
>>> > > and see if anything important breaks.
>>> > > In the Java/Scala, you can add the staging repository to your
>>> projects
>>> > > resolvers and test
>>> > > with the RC (make sure to clean up the artifact cache before/after so
>>> > > you don't end up building with an out of date RC going forward).
>>> > >
>>> > > ===
>>> > > What should happen to JIRA tickets still targeting 3.5.2?
>>> > > ===
>>> > >
>>> > > The current list of open tickets targeted at 3.5.2 can be found at:
>>> > > https://issues.apache.org/jira/projects/SPARK and search for
>>> > > "Target Version/s" = 3.5.2
>>> > >
>>> > > Committers should look at those and triage. Extremely important bug
>>> > > fixes, documentation, and API tweaks that impact compatibility should
>>> > > be worked on immediately. Everything else please retarget to an
>>> > > appropriate release.
>>> > >
>>> > > ==
>>> > > But my bug isn't fixed?
>>> > > ==
>>> > >
>>> > > In order to make timely releases, we will typically not hold the
>>> > > release unless the bug in question is a regression from the previous
>>> > > release. That being said, if there is something which is a regression
>>> > > that has not been correctly targeted please ping me or a committer to
>>> > > help target the issue.
>>> > >
>>> > > Thanks,
>>> > > Kent Yao
>>> > >
>>> > > -
>>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >
>>> > >
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


[jira] [Resolved] (SPARK-49163) Attempt to create table based on broken parquet partition data results in internal error

2024-08-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49163.
-
Resolution: Fixed

Issue resolved by pull request 47668
[https://github.com/apache/spark/pull/47668]

> Attempt to create table based on broken parquet partition data results in 
> internal error
> 
>
> Key: SPARK-49163
> URL: https://issues.apache.org/jira/browse/SPARK-49163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Create an example parquet table with partitions and insert data in Spark:
> {code:java}
> create table t(col1 string, col2 string, col3 string) using parquet location 
> 'some/path/parquet-test' partitioned by (col1, col2);
> insert into t (col1, col2, col3) values ('a', 'b', 'c');{code}
> Go into the parquet-test path in the filesystem and try to copy parquet data 
> file from path col1=a/col2=b directory into col1=a. After that, try to create 
> new table based on parquet data in Spark:
> {code:java}
> create table broken_table using parquet location 'some/path/parquet-test'; 
> {code}
> This query errors with internal error. Stack trace excerpts:
> {code:java}
> org.apache.spark.SparkException: [INTERNAL_ERROR] Eagerly executed command 
> failed. You hit a bug in Spark or the Spark plugins you use. Please, report 
> this bug to the corresponding communities or vendors, and provide the full 
> stack trace. SQLSTATE: XX000
> ...
> Caused by: java.lang.AssertionError: assertion failed: Conflicting partition 
> column names detected:        Partition column name list #0: col1
>         Partition column name list #1: col1, col2For partitioned table 
> directories, data files should only live in leaf directories.
> And directories at the same level should have the same partition column name.
> Please check the following directories for unexpected files or inconsistent 
> partition column names:        file:some/path/parquet-test/col1=a
>         file:some/path/parquet-test/col1=a/col2=b
>   at scala.Predef$.assert(Predef.scala:279)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:391)
> ...{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49163) Attempt to create table based on broken parquet partition data results in internal error

2024-08-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49163:
---

Assignee: Nikola Mandic

> Attempt to create table based on broken parquet partition data results in 
> internal error
> 
>
> Key: SPARK-49163
> URL: https://issues.apache.org/jira/browse/SPARK-49163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Create an example parquet table with partitions and insert data in Spark:
> {code:java}
> create table t(col1 string, col2 string, col3 string) using parquet location 
> 'some/path/parquet-test' partitioned by (col1, col2);
> insert into t (col1, col2, col3) values ('a', 'b', 'c');{code}
> Go into the parquet-test path in the filesystem and try to copy parquet data 
> file from path col1=a/col2=b directory into col1=a. After that, try to create 
> new table based on parquet data in Spark:
> {code:java}
> create table broken_table using parquet location 'some/path/parquet-test'; 
> {code}
> This query errors with internal error. Stack trace excerpts:
> {code:java}
> org.apache.spark.SparkException: [INTERNAL_ERROR] Eagerly executed command 
> failed. You hit a bug in Spark or the Spark plugins you use. Please, report 
> this bug to the corresponding communities or vendors, and provide the full 
> stack trace. SQLSTATE: XX000
> ...
> Caused by: java.lang.AssertionError: assertion failed: Conflicting partition 
> column names detected:        Partition column name list #0: col1
>         Partition column name list #1: col1, col2For partitioned table 
> directories, data files should only live in leaf directories.
> And directories at the same level should have the same partition column name.
> Please check the following directories for unexpected files or inconsistent 
> partition column names:        file:some/path/parquet-test/col1=a
>         file:some/path/parquet-test/col1=a/col2=b
>   at scala.Predef$.assert(Predef.scala:279)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:391)
> ...{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49163][SQL] Attempt to create table based on broken parquet partition data results should return user-facing error

2024-08-09 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fd3069ab113c [SPARK-49163][SQL] Attempt to create table based on 
broken parquet partition data results should return user-facing error
fd3069ab113c is described below

commit fd3069ab113c12a6dfa338abfd8c99acd707dfbd
Author: Nikola Mandic 
AuthorDate: Fri Aug 9 21:50:48 2024 +0800

[SPARK-49163][SQL] Attempt to create table based on broken parquet 
partition data results should return user-facing error

### What changes were proposed in this pull request?

Create an example parquet table with partitions and insert data in Spark:
```
create table t(col1 string, col2 string, col3 string) using parquet 
location 'some/path/parquet-test' partitioned by (col1, col2);
insert into t (col1, col2, col3) values ('a', 'b', 'c');
```
Go into the `parquet-test` path in the filesystem and try to copy parquet 
data file from path `col1=a/col2=b` directory into `col1=a`. After that, try to 
create new table based on parquet data in Spark:
```
create table broken_table using parquet location 'some/path/parquet-test';
```
This query errors with internal error. Stack trace excerpts:
```
org.apache.spark.SparkException: [INTERNAL_ERROR] Eagerly executed command 
failed. You hit a bug in Spark or the Spark plugins you use. Please, report 
this bug to the corresponding communities or vendors, and provide the full 
stack trace. SQLSTATE: XX000
...
Caused by: java.lang.AssertionError: assertion failed: Conflicting 
partition column names detected:Partition column name list #0: col1
Partition column name list #1: col1, col2For partitioned table 
directories, data files should only live in leaf directories.
And directories at the same level should have the same partition column 
name.
Please check the following directories for unexpected files or inconsistent 
partition column names:file:some/path/parquet-test/col1=a
file:some/path/parquet-test/col1=a/col2=b
  at scala.Predef$.assert(Predef.scala:279)
  at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:391)
...
```
Fix this by changing internal error to user-facing error.

### Why are the changes needed?

Replace internal error with user-facing one for valid sequence of Spark SQL 
operations.

### Does this PR introduce _any_ user-facing change?

Yes, it presents the user with regular error instead of internal error.

### How was this patch tested?

Added checks to `ParquetPartitionDiscoverySuite` which simulate the 
described scenario by manually breaking parquet table in the filesystem.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47668 from nikolamand-db/SPARK-49163.

    Authored-by: Nikola Mandic 
Signed-off-by: Wenchen Fan 
---
 .../src/main/resources/error/error-conditions.json |  11 +++
 .../spark/sql/errors/QueryExecutionErrors.scala|  12 +++
 .../execution/datasources/PartitioningUtils.scala  |  20 ++--
 .../sql/execution/datasources/FileIndexSuite.scala |   2 +-
 .../parquet/ParquetPartitionDiscoverySuite.scala   | 110 ++---
 5 files changed, 108 insertions(+), 47 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index 4766c7790915..3512fe34e92a 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -625,6 +625,17 @@
 ],
 "sqlState" : "4"
   },
+  "CONFLICTING_PARTITION_COLUMN_NAMES" : {
+"message" : [
+  "Conflicting partition column names detected:",
+  "",
+  "For partitioned table directories, data files should only live in leaf 
directories.",
+  "And directories at the same level should have the same partition column 
name.",
+  "Please check the following directories for unexpected files or 
inconsistent partition column names:",
+  ""
+],
+"sqlState" : "KD009"
+  },
   "CONNECT" : {
 "message" : [
   "Generic Spark Connect error."
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
index 9bfb81ad821b..eb25387af5a7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.

[jira] [Created] (SPARK-49183) V2SessionCatalog.createTable should respect PROP_IS_MANAGED_LOCATION

2024-08-09 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-49183:
---

 Summary: V2SessionCatalog.createTable should respect 
PROP_IS_MANAGED_LOCATION
 Key: SPARK-49183
 URL: https://issues.apache.org/jira/browse/SPARK-49183
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Wenchen Fan
It makes sense to me to only keep the doc files for the latest
maintenance release. i.e. remove the docs for 3.5.0 and only keep 3.5.1.

On Thu, Aug 8, 2024 at 8:06 PM Kent Yao  wrote:

> Hi dev,
>
> The current size of the spark-website repository is approximately 16GB,
> exceeding the storage limit of GitHub-hosted runners.  The GitHub actions
> have been failing recently in the actions/checkout step caused by
> 'No space left on device' errors.
>
> Filesystem  Size  Used Avail Use% Mounted on
> overlay  73G   58G   16G  80% /
> tmpfs64M 0   64M   0% /dev
> tmpfs   7.9G 0  7.9G   0% /sys/fs/cgroup
> shm  64M 0   64M   0% /dev/shm
> /dev/root73G   58G   16G  80% /__w
> tmpfs   1.6G  1.2M  1.6G   1% /run/docker.sock
> tmpfs   7.9G 0  7.9G   0% /proc/acpi
> tmpfs   7.9G 0  7.9G   0% /proc/scsi
> tmpfs   7.9G 0  7.9G   0% /sys/firmware
>
>
> The documentation for each version contributes the most volume. Since
> version
>  3.5.0, the documentation size has grown 3-4 times larger than the
> size of 3.4.x,
>  with more than 1GB.
>
>
> 9.9M ./0.6.0
>  10M ./0.6.1
>  10M ./0.6.2
>  15M ./0.7.0
>  16M ./0.7.2
>  16M ./0.7.3
>  20M ./0.8.0
>  20M ./0.8.1
>  38M ./0.9.0
>  38M ./0.9.1
>  38M ./0.9.2
>  36M ./1.0.0
>  38M ./1.0.1
>  38M ./1.0.2
>  48M ./1.1.0
>  48M ./1.1.1
>  73M ./1.2.0
>  73M ./1.2.1
>  74M ./1.2.2
>  69M ./1.3.0
>  73M ./1.3.1
>  68M ./1.4.0
>  70M ./1.4.1
>  80M ./1.5.0
>  78M ./1.5.1
>  78M ./1.5.2
>  87M ./1.6.0
>  87M ./1.6.1
>  87M ./1.6.2
>  86M ./1.6.3
> 117M ./2.0.0
> 119M ./2.0.0-preview
> 118M ./2.0.1
> 118M ./2.0.2
> 121M ./2.1.0
> 121M ./2.1.1
> 122M ./2.1.2
> 122M ./2.1.3
> 130M ./2.2.0
> 131M ./2.2.1
> 132M ./2.2.2
> 131M ./2.2.3
> 141M ./2.3.0
> 141M ./2.3.1
> 141M ./2.3.2
> 142M ./2.3.3
> 142M ./2.3.4
> 145M ./2.4.0
> 146M ./2.4.1
> 145M ./2.4.2
> 144M ./2.4.3
> 145M ./2.4.4
> 143M ./2.4.5
> 143M ./2.4.6
> 143M ./2.4.7
> 143M ./2.4.8
> 197M ./3.0.0
> 185M ./3.0.0-preview
> 197M ./3.0.0-preview2
> 198M ./3.0.1
> 198M ./3.0.2
> 205M ./3.0.3
> 239M ./3.1.1
> 239M ./3.1.2
> 239M ./3.1.3
> 840M ./3.2.0
> 842M ./3.2.1
> 282M ./3.2.2
> 244M ./3.2.3
> 282M ./3.2.4
> 295M ./3.3.0
> 297M ./3.3.1
> 297M ./3.3.2
> 297M ./3.3.3
> 297M ./3.3.4
> 314M ./3.4.0
> 314M ./3.4.1
> 328M ./3.4.2
> 324M ./3.4.3
> 1.1G ./3.5.0
> 1.2G ./3.5.1
> 1.1G ./4.0.0-preview1
>
> I'm concerned about publishing the documentation for version 3.5.2
> to the asf-site. So, I have merged PR[2] to eliminate this potential
> blocker.
>
> Considering that the problem still exists, should we temporarily archive
> some of the outdated version documents? For example, only keep
> the latest version for each feature release in the asf-site branch. Or,
> Do you have any other suggestions?
>
>
> Bests,
> Kent Yao
>
>
> [1]
> https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories
> [2] https://github.com/apache/spark-website/pull/543
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[jira] [Resolved] (SPARK-49138) Fix CollationTypeCasts of several expressions

2024-08-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49138.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47649
[https://github.com/apache/spark/pull/47649]

> Fix CollationTypeCasts of several expressions
> -
>
> Key: SPARK-49138
> URL: https://issues.apache.org/jira/browse/SPARK-49138
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49138][SQL] Fix CollationTypeCasts of several expressions

2024-08-08 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 01a6991a338c [SPARK-49138][SQL] Fix CollationTypeCasts of several 
expressions
01a6991a338c is described below

commit 01a6991a338cc3f0eac61e97470ef9dbdb170971
Author: Mihailo Milosevic 
AuthorDate: Thu Aug 8 21:55:12 2024 +0800

[SPARK-49138][SQL] Fix CollationTypeCasts of several expressions

### What changes were proposed in this pull request?
Fix for CreateMap and ArrayAppend expressions.

### Why are the changes needed?
While adding TypeCoercion for collations these two expressions were missed 
out, so this PR adds them to the rules so that they follow expected behaviour.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added testcases in CollationSuite.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47649 from mihailom-db/fixtypecoercion.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CollationTypeCasts.scala |  9 -
 .../org/apache/spark/sql/CollationSuite.scala  | 38 +-
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
index 276062ce211d..9c7b5aaecd78 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
@@ -83,10 +83,17 @@ object CollationTypeCasts extends TypeCoercionRule {
   val Seq(newInput, newDefault) = collateToSingleType(Seq(input, default))
   framelessOffsetWindow.withNewChildren(Seq(newInput, offset, newDefault))
 
+case mapCreate : CreateMap if mapCreate.children.size % 2 == 0 =>
+  // We only take in mapCreate if it has even number of children, as 
otherwise it should fail
+  // with wrong number of arguments
+  val newKeys = collateToSingleType(mapCreate.keys)
+  val newValues = collateToSingleType(mapCreate.values)
+  mapCreate.withNewChildren(newKeys.zip(newValues).flatMap(pair => 
Seq(pair._1, pair._2)))
+
 case otherExpr @ (
   _: In | _: InSubquery | _: CreateArray | _: ArrayJoin | _: Concat | _: 
Greatest | _: Least |
   _: Coalesce | _: ArrayContains | _: ArrayExcept | _: ConcatWs | _: Mask 
| _: StringReplace |
-  _: StringTranslate | _: StringTrim | _: StringTrimLeft | _: 
StringTrimRight |
+  _: StringTranslate | _: StringTrim | _: StringTrimLeft | _: 
StringTrimRight | _: ArrayAppend |
   _: ArrayIntersect | _: ArrayPosition | _: ArrayRemove | _: ArrayUnion | 
_: ArraysOverlap |
   _: Contains | _: EndsWith | _: EqualNullSafe | _: EqualTo | _: FindInSet 
| _: GreaterThan |
   _: GreaterThanOrEqual | _: LessThan | _: LessThanOrEqual | _: StartsWith 
| _: StringInstr |
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
index e9e3432195a4..dd678ac48c68 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
@@ -34,7 +34,7 @@ import 
org.apache.spark.sql.execution.aggregate.{HashAggregateExec, ObjectHashAg
 import org.apache.spark.sql.execution.columnar.InMemoryTableScanExec
 import org.apache.spark.sql.execution.joins._
 import org.apache.spark.sql.internal.{SqlApiConf, SQLConf}
-import org.apache.spark.sql.types.{MapType, StringType, StructField, 
StructType}
+import org.apache.spark.sql.types.{ArrayType, MapType, StringType, 
StructField, StructType}
 
 class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
@@ -579,6 +579,42 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
+  test("SPARK-49138: ArrayAppend and CreateMap coercion testing") {
+val df_array_append = sql(s"SELECT array_append(array('a', 'b'), 'c' 
COLLATE UNICODE)")
+// array_append expression
+checkAnswer(df_array_append, Seq(Row(Seq("a", "b", "c"
+assert(df_array_append.schema.head.dataType == 
ArrayType(StringType("UNICODE"), true))
+
+// make sure we fail this query even when collations are in
+checkError(
+  exception = intercept[AnalysisException] {
+sql("select map('a' COLLATE UTF8_LCASE, 'b', 'c')")
+

(spark) branch master updated: [SPARK-47410][SQL][FOLLOWUP] Update StringType interface for collation support

2024-08-08 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 61d8c3b2b45f [SPARK-47410][SQL][FOLLOWUP] Update StringType interface 
for collation support
61d8c3b2b45f is described below

commit 61d8c3b2b45f6fc454690d575ac449a30003ac23
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Thu Aug 8 21:50:42 2024 +0800

[SPARK-47410][SQL][FOLLOWUP] Update StringType interface for collation 
support

### What changes were proposed in this pull request?
Following up on SPARK-47410, add `supportsLowercaseEquality` to 
`StringType`.

### Why are the changes needed?
Make the `StringType` API complete.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47622 from uros-db/update-string-type.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala 
b/sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala
index 424f35135fb4..383c58210d8b 100644
--- a/sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala
+++ b/sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala
@@ -39,6 +39,9 @@ class StringType private(val collationId: Int) extends 
AtomicType with Serializa
   def supportsBinaryEquality: Boolean =
 CollationFactory.fetchCollation(collationId).supportsBinaryEquality
 
+  def supportsLowercaseEquality: Boolean =
+CollationFactory.fetchCollation(collationId).supportsLowercaseEquality
+
   def isUTF8BinaryCollation: Boolean =
 collationId == CollationFactory.UTF8_BINARY_COLLATION_ID
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[jira] [Assigned] (SPARK-49139) Enable collations by default for Spark 4.0.0 Preview 2

2024-08-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49139:
---

Assignee: Uroš Bojanić

> Enable collations by default for Spark 4.0.0 Preview 2
> --
>
> Key: SPARK-49139
> URL: https://issues.apache.org/jira/browse/SPARK-49139
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49139) Enable collations by default for Spark 4.0.0 Preview 2

2024-08-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49139.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47650
[https://github.com/apache/spark/pull/47650]

> Enable collations by default for Spark 4.0.0 Preview 2
> --
>
> Key: SPARK-49139
> URL: https://issues.apache.org/jira/browse/SPARK-49139
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49125) Allow writing CSV with duplicated column names

2024-08-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49125.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47633
[https://github.com/apache/spark/pull/47633]

> Allow writing CSV with duplicated column names
> --
>
> Key: SPARK-49125
> URL: https://issues.apache.org/jira/browse/SPARK-49125
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated (a713a66e1c4f -> 43d881a94f2e)

2024-08-08 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from a713a66e1c4f [SPARK-49142][CONNECT][PYTHON] Lower Spark Connect client 
log level to debug
 add 43d881a94f2e [SPARK-49139][SQL] Enable collations by default

No new revisions were added by this update.

Summary of changes:
 .../expressions/collationExpressions.scala | 14 --
 .../spark/sql/catalyst/parser/AstBuilder.scala |  3 ---
 .../spark/sql/errors/QueryCompilationErrors.scala  |  6 -
 .../org/apache/spark/sql/internal/SQLConf.scala| 10 ---
 .../spark/sql/execution/SparkSqlParser.scala   |  3 ---
 .../spark/sql/execution/datasources/rules.scala| 22 +--
 .../sql/internal/BaseSessionStateBuilder.scala |  1 -
 .../sql/errors/QueryCompilationErrorsSuite.scala   | 31 --
 .../apache/spark/sql/internal/SQLConfSuite.scala   |  8 --
 .../spark/sql/hive/HiveSessionStateBuilder.scala   |  1 -
 10 files changed, 1 insertion(+), 98 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[jira] [Assigned] (SPARK-49125) Allow writing CSV with duplicated column names

2024-08-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49125:
---

Assignee: Wenchen Fan

> Allow writing CSV with duplicated column names
> --
>
> Key: SPARK-49125
> URL: https://issues.apache.org/jira/browse/SPARK-49125
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49125][SQL] Allow duplicated column names in CSV writing

2024-08-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d0d98c15182d [SPARK-49125][SQL] Allow duplicated column names in CSV 
writing
d0d98c15182d is described below

commit d0d98c15182d56b29fc9e6570a4b5c7cff35eb83
Author: Wenchen Fan 
AuthorDate: Thu Aug 8 10:12:52 2024 +0800

[SPARK-49125][SQL] Allow duplicated column names in CSV writing

### What changes were proposed in this pull request?

In file source writing, we disallow duplicated column names in the input 
query for all formats, because most formats don't do well with duplicate 
columns. However, this is not a good decision as long as there are formats 
allowing duplicated column names, and CSV is one of them.

This PR improves the `FileFormat` API to indicate if the format allows 
duplicated column names or not, and only perform the duplicated name check for 
formats that don't allow it.

### Why are the changes needed?

enable more use cases for the CSV data source

### Does this PR introduce _any_ user-facing change?

Yes, now users can write to CSV with duplicated column names.

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47633 from cloud-fan/csv.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/datasources/FileFormat.scala |  5 +
 .../datasources/InsertIntoHadoopFsRelationCommand.scala  |  9 +
 .../sql/execution/datasources/csv/CSVFileFormat.scala|  1 +
 .../spark/sql/execution/datasources/v2/FileWrite.scala   |  8 +---
 .../sql/execution/datasources/v2/csv/CSVWrite.scala  |  3 +++
 .../spark/sql/execution/datasources/csv/CSVSuite.scala   | 16 
 .../spark/sql/test/DataFrameReaderWriterSuite.scala  |  1 -
 7 files changed, 35 insertions(+), 8 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
index 36c59950fe20..7788f3287ac4 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
@@ -188,6 +188,11 @@ trait FileFormat {
*/
   def supportFieldName(name: String): Boolean = true
 
+  /**
+   * Returns whether this format allows duplicated column names in the input 
query during writing.
+   */
+  def allowDuplicatedColumnNames: Boolean = false
+
   /**
* All fields the file format's _metadata struct defines.
*
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
index fe6ec094812e..aed129c7dccc 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
@@ -79,10 +79,11 @@ case class InsertIntoHadoopFsRelationCommand(
   staticPartitions.size)
 
   override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
-// Most formats don't do well with duplicate columns, so lets not allow 
that
-SchemaUtils.checkColumnNameDuplication(
-  outputColumnNames,
-  sparkSession.sessionState.conf.caseSensitiveAnalysis)
+if (!fileFormat.allowDuplicatedColumnNames) {
+  SchemaUtils.checkColumnNameDuplication(
+outputColumnNames,
+sparkSession.sessionState.conf.caseSensitiveAnalysis)
+}
 
 val hadoopConf = 
sparkSession.sessionState.newHadoopConfWithOptions(options)
 val fs = outputPath.getFileSystem(hadoopConf)
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
index 931a3610507f..da29642a41cc 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
@@ -157,4 +157,5 @@ class CSVFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 case _ => false
   }
 
+  override def allowDuplicatedColumnNames: Boolean = true
 }
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWrite.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWrite.scala
index 52f44e33ea11..cdcf6f21fd00 100644
--- 
a/sql/core/sr

(spark) branch master updated (435a01a7fc2d -> 8a8f65d2a2ec)

2024-08-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 435a01a7fc2d [SPARK-49115][DOCS] Add docs for state metadata source 
for operators using schema format v2 such as transformWithState
 add 8a8f65d2a2ec [SPARK-48843][FOLLOWUP] Adding more tests that would have 
failed had the fix ap…

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/connect/test_connect_basic.py   | 6 ++
 .../src/test/scala/org/apache/spark/sql/ParametersSuite.scala| 9 +
 2 files changed, 15 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (ce4f185a98be -> 7769ef11f367)

2024-08-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ce4f185a98be [SPARK-49134][INFRA] Support retry for deploying 
artifacts to Nexus staging repository
 add 7769ef11f367 [SPARK-49082][SQL] Widening type promotions in 
`AvroDeserializer`

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/avro/AvroDeserializer.scala   |  9 ++
 .../org/apache/spark/sql/avro/AvroSuite.scala  | 33 ++
 2 files changed, 42 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48989][SQL][FOLLOWUP] Fix SubstringIndex codegen

2024-08-06 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c3985ac256e4 [SPARK-48989][SQL][FOLLOWUP] Fix SubstringIndex codegen
c3985ac256e4 is described below

commit c3985ac256e43d8a53850c3586aca7f5cef448e3
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue Aug 6 23:39:11 2024 +0800

[SPARK-48989][SQL][FOLLOWUP] Fix SubstringIndex codegen

### What changes were proposed in this pull request?
Fix genCode for `SubstringIndex`.

### Why are the changes needed?
GenCode was not fully correct in the original PR.

### Does this PR introduce _any_ user-facing change?
Yes, genCode now works properly for all collations.

### How was this patch tested?
Additional tests for other collations.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47610 from uros-db/followup-48989.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/catalyst/util/CollationSupport.java | 5 ++---
 .../test/scala/org/apache/spark/sql/StringFunctionsSuite.scala   | 9 +++--
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
index b6ff7abe5c1a..f160661af389 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
@@ -477,7 +477,7 @@ public final class CollationSupport {
   } else if (collation.supportsLowercaseEquality) {
 return String.format(expr + "Lowercase(%s, %s, %s)", string, 
delimiter, count);
   } else {
-return String.format(expr + "ICU(%s, %s, %d, %s)", string, delimiter, 
count, collationId);
+return String.format(expr + "ICU(%s, %s, %s, %d)", string, delimiter, 
count, collationId);
   }
 }
 public static UTF8String execBinary(final UTF8String string, final 
UTF8String delimiter,
@@ -490,8 +490,7 @@ public final class CollationSupport {
 }
 public static UTF8String execICU(final UTF8String string, final UTF8String 
delimiter,
 final int count, final int collationId) {
-  return CollationAwareUTF8String.subStringIndex(string, delimiter, count,
-collationId);
+  return CollationAwareUTF8String.subStringIndex(string, delimiter, count, 
collationId);
 }
   }
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala
index 37becda0286e..1cfde5f0f81d 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala
@@ -425,12 +425,17 @@ class StringFunctionsSuite extends QueryTest with 
SharedSparkSession {
   Row("www.apache")
 )
 
+// TODO SPARK-48779 Move E2E SQL tests with column input to collations.sql 
golden file.
 val testTable = "test_substring_index"
 withTable(testTable) {
   sql(s"CREATE TABLE $testTable (num int) USING parquet")
   sql(s"INSERT INTO $testTable VALUES (1), (2), (3), (NULL)")
-  val query = s"SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) as sub_str 
FROM $testTable"
-  checkAnswer(sql(query), Seq(Row(1, "a"), Row(2, "a_a"), Row(3, "a_a_a"), 
Row(null, null)))
+  Seq("UTF8_BINARY", "UTF8_LCASE", "UNICODE", 
"UNICODE_CI").foreach(collation =>
+withSQLConf(SQLConf.DEFAULT_COLLATION.key -> collation) {
+  val query = s"SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) as 
sub_str FROM $testTable"
+  checkAnswer(sql(query), Seq(Row(1, "a"), Row(2, "a_a"), Row(3, 
"a_a_a"), Row(null, null)))
+}
+  )
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[jira] [Created] (SPARK-49125) Allow writing CSV with duplicated column names

2024-08-06 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-49125:
---

 Summary: Allow writing CSV with duplicated column names
 Key: SPARK-49125
 URL: https://issues.apache.org/jira/browse/SPARK-49125
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48911) Improve collation support testing for various expressions

2024-08-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48911.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47372
[https://github.com/apache/spark/pull/47372]

> Improve collation support testing for various expressions
> -
>
> Key: SPARK-48911
> URL: https://issues.apache.org/jira/browse/SPARK-48911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48911][SQL][TESTS] Improve collation support testing for various expressions

2024-08-06 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bb584de1442e [SPARK-48911][SQL][TESTS] Improve collation support 
testing for various expressions
bb584de1442e is described below

commit bb584de1442e34f844a3577b8356dccdadaa0c74
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue Aug 6 22:15:18 2024 +0800

[SPARK-48911][SQL][TESTS] Improve collation support testing for various 
expressions

### What changes were proposed in this pull request?
Add more e2e sql tests for various spark expressions that take collated 
strings.

### Why are the changes needed?
Expand collation testing coverage for various expressions.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New collation-related e2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47372 from uros-db/add-tests.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 821 +
 1 file changed, 821 insertions(+)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
index 2473a9228194..bf478d3e0d21 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
@@ -2319,6 +2319,827 @@ class CollationSQLExpressionsSuite
 )
   }
 
+  test("min_by supports collation") {
+val collation = "UNICODE"
+val query = s"SELECT min_by(x, y) FROM VALUES ('a', 10), ('b', 50), ('c', 
20) AS tab(x, y);"
+withSQLConf(SqlApiConf.DEFAULT_COLLATION -> collation) {
+  checkAnswer(
+sql(query),
+Seq(
+  Row("a")
+)
+  )
+  // check result row data type
+  val dataType = StringType(collation)
+  assert(sql(query).schema.head.dataType == dataType)
+}
+  }
+
+  test("max_by supports collation") {
+val collation = "UNICODE"
+val query = s"SELECT max_by(x, y) FROM VALUES ('a', 10), ('b', 50), ('c', 
20) AS tab(x, y);"
+withSQLConf(SqlApiConf.DEFAULT_COLLATION -> collation) {
+  checkAnswer(
+sql(query),
+Seq(
+  Row("b")
+)
+  )
+  // check result row data type
+  val dataType = StringType(collation)
+  assert(sql(query).schema.head.dataType == dataType)
+}
+  }
+
+  test("array supports collation") {
+val collation = "UNICODE"
+val query = s"SELECT array('a', 'b', 'c');"
+withSQLConf(SqlApiConf.DEFAULT_COLLATION -> collation) {
+  checkAnswer(
+sql(query),
+Seq(
+  Row(Seq("a", "b", "c"))
+)
+  )
+  // check result row data type
+  val dataType = ArrayType(StringType(collation), false)
+  assert(sql(query).schema.head.dataType == dataType)
+}
+  }
+
+  test("array_agg supports collation") {
+val collation = "UNICODE"
+val query = s"SELECT array_agg(col) FROM VALUES ('a'), ('b'), ('c') AS 
tab(col);"
+withSQLConf(SqlApiConf.DEFAULT_COLLATION -> collation) {
+  checkAnswer(
+sql(query),
+Seq(
+  Row(Seq("a", "b", "c"))
+)
+  )
+  // check result row data type
+  val dataType = ArrayType(StringType(collation), false)
+  assert(sql(query).schema.head.dataType == dataType)
+}
+  }
+
+  test("array_contains supports collation") {
+val collation = "UNICODE"
+val query = s"SELECT array_contains(array('a', 'b', 'c'), 'b');"
+withSQLConf(SqlApiConf.DEFAULT_COLLATION -> collation) {
+  checkAnswer(
+sql(query),
+Seq(
+  Row(true)
+)
+  )
+  // check result row data type
+  val dataType = BooleanType
+  assert(sql(query).schema.head.dataType == dataType)
+}
+  }
+
+  test("arrays_overlap supports collation") {
+val collation = "UNICODE"
+val query = s"SELECT arrays_overlap(array('a', 'b', 'c'), array('c', 'd', 
'e'));"
+withSQLConf(SqlApiConf.DEFAULT_COLLATION -> collation) {
+  checkAnswe

(spark) branch branch-3.5 updated: [SPARK-49099][SQL] CatalogManager.setCurrentNamespace should respect custom session catalog

2024-08-06 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new f2e260192542 [SPARK-49099][SQL] CatalogManager.setCurrentNamespace 
should respect custom session catalog
f2e260192542 is described below

commit f2e2601925425a65f9a20a0aa03966d8c81f6466
Author: Rui Wang 
AuthorDate: Tue Aug 6 16:50:26 2024 +0800

[SPARK-49099][SQL] CatalogManager.setCurrentNamespace should respect custom 
session catalog

Refactor CatalogManager.setCurrentNamespace so it unifies the handling of 
`SupportsNamespaces` and SessionCatalog, thus it can respect custom session 
catalog.

This is a bug.

CatalogManager.setCurrentNamespace should respect custom session catalog. 
Without this PR, a custom session catalog won't work as Spark uses 
`spark_catalog` to see if the catalog is session log and then directly call 
V1SessionCatalog.

No

UT

No

Closes #47592 from amaliujia/refactor_setCurrentNamespace.

Lead-authored-by: Rui Wang 
Co-authored-by: Wenchen Fan 
Co-authored-by: Kent Yao 
Signed-off-by: Kent Yao 
---
 .../spark/sql/catalyst/analysis/Analyzer.scala | 18 +--
 .../sql/catalyst/catalog/SessionCatalog.scala  |  6 -
 .../sql/connector/catalog/CatalogManager.scala | 27 --
 .../catalyst/analysis/LookupFunctionsSuite.scala   |  2 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala |  9 +++-
 .../org/apache/spark/sql/test/SQLTestUtils.scala   |  5 ++--
 .../org/apache/spark/sql/hive/test/TestHive.scala  |  1 +
 7 files changed, 54 insertions(+), 14 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 4388740a409d..bd917fd73e20 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -23,6 +23,7 @@ import java.util.concurrent.atomic.AtomicBoolean
 
 import scala.collection.mutable
 import scala.collection.mutable.ArrayBuffer
+import scala.jdk.CollectionConverters._
 import scala.util.{Failure, Random, Success, Try}
 
 import org.apache.spark.sql.AnalysisException
@@ -78,7 +79,7 @@ object SimpleAnalyzer extends Analyzer(
   override def resolver: Resolver = caseSensitiveResolution
 }
 
-object FakeV2SessionCatalog extends TableCatalog with FunctionCatalog {
+object FakeV2SessionCatalog extends TableCatalog with FunctionCatalog with 
SupportsNamespaces {
   private def fail() = throw new UnsupportedOperationException
   override def listTables(namespace: Array[String]): Array[Identifier] = fail()
   override def loadTable(ident: Identifier): Table = {
@@ -92,10 +93,23 @@ object FakeV2SessionCatalog extends TableCatalog with 
FunctionCatalog {
   override def alterTable(ident: Identifier, changes: TableChange*): Table = 
fail()
   override def dropTable(ident: Identifier): Boolean = fail()
   override def renameTable(oldIdent: Identifier, newIdent: Identifier): Unit = 
fail()
-  override def initialize(name: String, options: CaseInsensitiveStringMap): 
Unit = fail()
+  override def initialize(name: String, options: CaseInsensitiveStringMap): 
Unit = {}
   override def name(): String = CatalogManager.SESSION_CATALOG_NAME
   override def listFunctions(namespace: Array[String]): Array[Identifier] = 
fail()
   override def loadFunction(ident: Identifier): UnboundFunction = fail()
+  override def listNamespaces(): Array[Array[String]] = fail()
+  override def listNamespaces(namespace: Array[String]): Array[Array[String]] 
= fail()
+  override def loadNamespaceMetadata(namespace: Array[String]): 
util.Map[String, String] = {
+if (namespace.length == 1) {
+  mutable.HashMap[String, String]().asJava
+} else {
+  throw new NoSuchNamespaceException(namespace)
+}
+  }
+  override def createNamespace(
+namespace: Array[String], metadata: util.Map[String, String]): Unit = 
fail()
+  override def alterNamespace(namespace: Array[String], changes: 
NamespaceChange*): Unit = fail()
+  override def dropNamespace(namespace: Array[String], cascade: Boolean): 
Boolean = fail()
 }
 
 /**
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
index 392c911ddb8e..cba928dfa924 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
@@ -330,12 +330,16 @@ class SessionCatalog(
   def getCurrentDatabase: String = synchronized { currentDb }
 
 

(spark) branch master updated: [SPARK-48978][SQL] Implement ASCII fast path in collation support for UTF8_LCASE

2024-08-06 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new babe7a05a6e4 [SPARK-48978][SQL] Implement ASCII fast path in collation 
support for UTF8_LCASE
babe7a05a6e4 is described below

commit babe7a05a6e459233511d1fd4b0dd73d012d83e6
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue Aug 6 17:07:39 2024 +0800

[SPARK-48978][SQL] Implement ASCII fast path in collation support for 
UTF8_LCASE

### What changes were proposed in this pull request?
Optimize collation support implementation for ASCII inputs across string 
functions.

### Why are the changes needed?
Improve collation support performance for ASCII strings.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
Yes.

Closes #47326 from uros-db/ascii-fast-paths.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 66 +-
 .../spark/sql/catalyst/util/CollationSupport.java  |  8 +--
 .../org/apache/spark/unsafe/types/UTF8String.java  | 25 +++-
 3 files changed, 90 insertions(+), 9 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index 5b005f152c51..501d173fc485 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -59,7 +59,7 @@ public class CollationAwareUTF8String {
* @param startPos the start position for searching (in the target string)
* @return whether the target string starts with the specified prefix in 
UTF8_LCASE
*/
-  public static boolean lowercaseMatchFrom(
+  private static boolean lowercaseMatchFrom(
   final UTF8String target,
   final UTF8String lowercasePattern,
   int startPos) {
@@ -161,7 +161,7 @@ public class CollationAwareUTF8String {
* @param endPos the end position for searching (in the target string)
* @return whether the target string ends with the specified suffix in 
lowercase
*/
-  public static boolean lowercaseMatchUntil(
+  private static boolean lowercaseMatchUntil(
   final UTF8String target,
   final UTF8String lowercasePattern,
   int endPos) {
@@ -619,6 +619,58 @@ public class CollationAwareUTF8String {
 return 0;
   }
 
+  /**
+   * Checks whether the target string contains the pattern string, with 
respect to the UTF8_LCASE
+   * collation. This method generally works with respect to code-point based 
comparison logic.
+   *
+   * @param target the string to be searched in
+   * @param pattern the string to be searched for
+   * @return whether the target string contains the pattern string
+   */
+  public static boolean lowercaseContains(final UTF8String target, final 
UTF8String pattern) {
+// Fast path for ASCII-only strings.
+if (target.isFullAscii() && pattern.isFullAscii()) {
+  return target.toLowerCase().contains(pattern.toLowerCase());
+}
+// Slow path for non-ASCII strings.
+return CollationAwareUTF8String.lowercaseIndexOfSlow(target, pattern, 0) 
>= 0;
+  }
+
+  /**
+   * Checks whether the target string starts with the pattern string, with 
respect to the UTF8_LCASE
+   * collation. This method generally works with respect to code-point based 
comparison logic.
+   *
+   * @param target the string to be searched in
+   * @param pattern the string to be searched for
+   * @return whether the target string starts with the pattern string
+   */
+  public static boolean lowercaseStartsWith(final UTF8String target, final 
UTF8String pattern) {
+// Fast path for ASCII-only strings.
+if (target.isFullAscii() && pattern.isFullAscii()) {
+  return target.toLowerCase().startsWith(pattern.toLowerCase());
+}
+// Slow path for non-ASCII strings.
+return CollationAwareUTF8String.lowercaseMatchFrom(target, 
lowerCaseCodePointsSlow(pattern), 0);
+  }
+
+  /**
+   * Checks whether the target string ends with the pattern string, with 
respect to the UTF8_LCASE
+   * collation. This method generally works with respect to code-point based 
comparison logic.
+   *
+   * @param target the string to be searched in
+   * @param pattern the string to be searched for
+   * @return whether the target string ends with the pattern string
+   */
+  public static boolean lowercaseEndsWith(final UTF8String target, final 
UTF8String p

[jira] [Resolved] (SPARK-48978) Optimize collation support for ASCII strings (all collations)

2024-08-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48978.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47326
[https://github.com/apache/spark/pull/47326]

> Optimize collation support for ASCII strings (all collations)
> -
>
> Key: SPARK-48978
> URL: https://issues.apache.org/jira/browse/SPARK-48978
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49109) Rename leftover BinaryLcase to Lcase

2024-08-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49109:
---

Assignee: Mihailo Milosevic

> Rename leftover BinaryLcase to Lcase
> 
>
> Key: SPARK-49109
> URL: https://issues.apache.org/jira/browse/SPARK-49109
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49109) Rename leftover BinaryLcase to Lcase

2024-08-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49109.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47602
[https://github.com/apache/spark/pull/47602]

> Rename leftover BinaryLcase to Lcase
> 
>
> Key: SPARK-49109
> URL: https://issues.apache.org/jira/browse/SPARK-49109
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49109][SQL] Rename leftover BinaryLcase to Lcase

2024-08-06 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b477753450f3 [SPARK-49109][SQL] Rename leftover BinaryLcase to Lcase
b477753450f3 is described below

commit b477753450f360391749847efd7dbe15ca61a038
Author: Mihailo Milosevic 
AuthorDate: Tue Aug 6 15:10:31 2024 +0800

[SPARK-49109][SQL] Rename leftover BinaryLcase to Lcase

### What changes were proposed in this pull request?
Renaming of all leftover binaryLcase to Lcase.

### Why are the changes needed?
The code should follow proper naming.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No need for additional testing, it is a variable naming change.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47602 from mihailom-db/SPARK-49109.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
---
 .../spark/unsafe/types/CollationFactorySuite.scala |  6 ++--
 .../sql/internal/types/AbstractStringType.scala|  2 +-
 .../org/apache/spark/sql/types/StringType.scala|  2 +-
 .../spark/sql/CollationExpressionWalkerSuite.scala | 36 +++---
 4 files changed, 23 insertions(+), 23 deletions(-)

diff --git 
a/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala
 
b/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala
index 23dae47f6ff2..321d1ccd700f 100644
--- 
a/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala
+++ 
b/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala
@@ -41,9 +41,9 @@ class CollationFactorySuite extends AnyFunSuite with Matchers 
{ // scalastyle:ig
 assert(utf8Binary.supportsBinaryEquality)
 
 assert(UTF8_LCASE_COLLATION_ID == 1)
-val utf8BinaryLcase = fetchCollation(UTF8_LCASE_COLLATION_ID)
-assert(utf8BinaryLcase.collationName == "UTF8_LCASE")
-assert(!utf8BinaryLcase.supportsBinaryEquality)
+val utf8Lcase = fetchCollation(UTF8_LCASE_COLLATION_ID)
+assert(utf8Lcase.collationName == "UTF8_LCASE")
+assert(!utf8Lcase.supportsBinaryEquality)
 
 assert(UNICODE_COLLATION_ID == (1 << 29))
 val unicode = fetchCollation(UNICODE_COLLATION_ID)
diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/internal/types/AbstractStringType.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/internal/types/AbstractStringType.scala
index 0828c2d6fc10..05d1701eff74 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/internal/types/AbstractStringType.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/internal/types/AbstractStringType.scala
@@ -42,7 +42,7 @@ case object StringTypeBinary extends AbstractStringType {
 case object StringTypeBinaryLcase extends AbstractStringType {
   override private[sql] def acceptsType(other: DataType): Boolean =
 other.isInstanceOf[StringType] && 
(other.asInstanceOf[StringType].supportsBinaryEquality ||
-  other.asInstanceOf[StringType].isUTF8BinaryLcaseCollation)
+  other.asInstanceOf[StringType].isUTF8LcaseCollation)
 }
 
 /**
diff --git a/sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala 
b/sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala
index 6ec55db008c7..424f35135fb4 100644
--- a/sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala
+++ b/sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala
@@ -42,7 +42,7 @@ class StringType private(val collationId: Int) extends 
AtomicType with Serializa
   def isUTF8BinaryCollation: Boolean =
 collationId == CollationFactory.UTF8_BINARY_COLLATION_ID
 
-  def isUTF8BinaryLcaseCollation: Boolean =
+  def isUTF8LcaseCollation: Boolean =
 collationId == CollationFactory.UTF8_LCASE_COLLATION_ID
 
   /**
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationExpressionWalkerSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationExpressionWalkerSuite.scala
index b88c67e56c9b..4bc6ab0f6e0f 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationExpressionWalkerSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationExpressionWalkerSuite.scala
@@ -39,7 +39,7 @@ class CollationExpressionWalkerSuite extends SparkFunSuite 
with SharedSparkSessi
 
   case object Utf8Binary extends CollationType
 
-  case object Utf8BinaryLcase extends CollationType
+  case object Utf8Lcase extends CollationType
 
   /**
* Helper function to generate all necessary parameters
@@ -111,7 +111,7 @@ class CollationExpressionWalkerSuite extends SparkFunSuite 
with SharedSparkSessi
   case BinaryType => collationType match {
 case Utf8Binary =>
 

[jira] [Resolved] (SPARK-49018) Fix approx_count_distinct not working correctly with collation

2024-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49018.
-
Resolution: Fixed

Issue resolved by pull request 47503
[https://github.com/apache/spark/pull/47503]

> Fix approx_count_distinct not working correctly with collation
> --
>
> Key: SPARK-49018
> URL: https://issues.apache.org/jira/browse/SPARK-49018
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Viktor Lučić
>Assignee: Viktor Lučić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When running in spark-shell:
> {code:java}
> create table t(col string collate utf8_lcase)
> insert into t values 'a', 'a', 'A'
> select approx_count_distinct(col) from t {code}
> we get 2 as an answer, but it should be 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49018][SQL] Fix approx_count_distinct not working correctly with collation

2024-08-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e8227a8e2442 [SPARK-49018][SQL] Fix approx_count_distinct not working 
correctly with collation
e8227a8e2442 is described below

commit e8227a8e24422f9bcabd1b18c87cc4f3b78a72b3
Author: viktorluc-db 
AuthorDate: Mon Aug 5 20:55:49 2024 +0800

[SPARK-49018][SQL] Fix approx_count_distinct not working correctly with 
collation

### What changes were proposed in this pull request?
Fix for approx_count_distinct not working correctly with collated strings.

### Why are the changes needed?
approx_count_distinct was not working with any collation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New test added to CollationSQLExpressionSuite.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47503 from viktorluc-db/bugfix.

Authored-by: viktorluc-db 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/HyperLogLogPlusPlusHelper.scala  |  3 ++
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 39 ++
 2 files changed, 42 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala
index 6471a746f2ed..fc947386487a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala
@@ -24,6 +24,7 @@ import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.XxHash64Function
 import 
org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers.{DOUBLE_NORMALIZER,
 FLOAT_NORMALIZER}
 import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
 
 // A helper class for HyperLogLogPlusPlus.
 class HyperLogLogPlusPlusHelper(relativeSD: Double) extends Serializable {
@@ -93,6 +94,8 @@ class HyperLogLogPlusPlusHelper(relativeSD: Double) extends 
Serializable {
 val value = dataType match {
   case FloatType => FLOAT_NORMALIZER.apply(_value)
   case DoubleType => DOUBLE_NORMALIZER.apply(_value)
+  case st: StringType if !st.supportsBinaryEquality =>
+CollationFactory.getCollationKeyBytes(_value.asInstanceOf[UTF8String], 
st.collationId)
   case _ => _value
 }
 // Create the hashed value 'x'.
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
index e31411ea212f..2473a9228194 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
@@ -2319,6 +2319,45 @@ class CollationSQLExpressionsSuite
 )
   }
 
+  test("Support HyperLogLogPlusPlus expression with collation") {
+case class HyperLogLogPlusPlusTestCase(
+  collation: String,
+  input: Seq[String],
+  output: Seq[Row]
+)
+
+val testCases = Seq(
+  HyperLogLogPlusPlusTestCase("utf8_binary", Seq("a", "a", "A", "z", "zz", 
"ZZ", "w", "AA",
+"aA", "Aa", "aa"), Seq(Row(10))),
+  HyperLogLogPlusPlusTestCase("utf8_lcase", Seq("a", "a", "A", "z", "zz", 
"ZZ", "w", "AA",
+"aA", "Aa", "aa"), Seq(Row(5))),
+  HyperLogLogPlusPlusTestCase("UNICODE", Seq("a", "a", "A", "z", "zz", 
"ZZ", "w", "AA",
+"aA", "Aa", "aa"), Seq(Row(10))),
+  HyperLogLogPlusPlusTestCase("UNICODE_CI", Seq("a", "a", "A", "z", "zz", 
"ZZ", "w", "AA",
+"aA", "Aa", "aa"), Seq(Row(5)))
+)
+
+testCases.foreach( t => {
+  // Using explicit collate clause
+  val query =
+s"""
+   |SELECT approx_count_distinct(col) FROM VALUES
+   |${t.input.map(s => s"('${s}' collate ${t.collation})").mkString(", 
") } tab(col)
+   |""".stripMargin
+  checkAnswer(sql(query), t.output)
+
+  // Using default collation
+  withSQLConf(SqlApiConf.DEFAULT_COLLATION -> t.collation) {
+val query =
+  s"""
+ |SELECT approx_count_distinct(col) FROM VALUES
+ |${t.input.map(s => s"('${s}')").mkString(", ") } tab(col)
+ |""".stripMargin
+checkAnswer(sql(query), t.output)
+  }
+})
+  }
+
   // TODO: Add more tests for other SQL expressions
 
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[jira] [Assigned] (SPARK-49018) Fix approx_count_distinct not working correctly with collation

2024-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49018:
---

Assignee: Viktor Lučić

> Fix approx_count_distinct not working correctly with collation
> --
>
> Key: SPARK-49018
> URL: https://issues.apache.org/jira/browse/SPARK-49018
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Viktor Lučić
>Assignee: Viktor Lučić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When running in spark-shell:
> {code:java}
> create table t(col string collate utf8_lcase)
> insert into t values 'a', 'a', 'A'
> select approx_count_distinct(col) from t {code}
> we get 2 as an answer, but it should be 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48338][SQL] Improve exceptions thrown from parser/interpreter

2024-08-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e5b6b5ff6f54 [SPARK-48338][SQL] Improve exceptions thrown from 
parser/interpreter
e5b6b5ff6f54 is described below

commit e5b6b5ff6f54b1a0ec59c983ac313dafcc50462b
Author: Dušan Tišma 
AuthorDate: Mon Aug 5 20:18:11 2024 +0800

[SPARK-48338][SQL] Improve exceptions thrown from parser/interpreter

### What changes were proposed in this pull request?
Introduced a new class `SqlScriptingException`, which is thrown during SQL 
script parsing/interpreting, and contains information about the line number on 
which the error occured.

### Why are the changes needed?
Users should know which line of their script caused an error.

### Does this PR introduce _any_ user-facing change?
Users will now see a line number on some of their error messages. The 
format of the new error messages is:
`{LINE:N} errorMessage`
where N is the line number of the error, and errorMessage is the existing 
error message, before this change.

### How was this patch tested?
No new tests required, existing tests are updated.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47553 from dusantism-db/dusantism-db/sql-script-exception.

Authored-by: Dušan Tišma 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/parser/AstBuilder.scala | 13 --
 .../spark/sql/errors/SqlScriptingErrors.scala  | 32 +-
 .../sql/exceptions/SqlScriptingException.scala | 49 ++
 .../catalyst/parser/SqlScriptingParserSuite.scala  |  9 ++--
 4 files changed, 85 insertions(+), 18 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
index 64d232b55b6b..21c2179661f7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@@ -159,10 +159,12 @@ class AstBuilder extends DataTypeAstBuilder
   case Some(c: CreateVariable) =>
 if (allowVarDeclare) {
   throw SqlScriptingErrors.variableDeclarationOnlyAtBeginning(
+c.origin,
 toSQLId(c.name.asInstanceOf[UnresolvedIdentifier].nameParts),
 c.origin.line.get.toString)
 } else {
   throw SqlScriptingErrors.variableDeclarationNotAllowedInScope(
+c.origin,
 toSQLId(c.name.asInstanceOf[UnresolvedIdentifier].nameParts),
 c.origin.line.get.toString)
 }
@@ -181,10 +183,15 @@ class AstBuilder extends DataTypeAstBuilder
 if bl.multipartIdentifier().getText.nonEmpty &&
   bl.multipartIdentifier().getText.toLowerCase(Locale.ROOT) !=
   el.multipartIdentifier().getText.toLowerCase(Locale.ROOT) =>
-throw SqlScriptingErrors.labelsMismatch(
-  bl.multipartIdentifier().getText, el.multipartIdentifier().getText)
+withOrigin(bl) {
+  throw SqlScriptingErrors.labelsMismatch(
+CurrentOrigin.get, bl.multipartIdentifier().getText, 
el.multipartIdentifier().getText)
+}
   case (None, Some(el: EndLabelContext)) =>
-throw 
SqlScriptingErrors.endLabelWithoutBeginLabel(el.multipartIdentifier().getText)
+withOrigin(el) {
+  throw SqlScriptingErrors.endLabelWithoutBeginLabel(
+CurrentOrigin.get, el.multipartIdentifier().getText)
+}
   case _ =>
 }
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/SqlScriptingErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/SqlScriptingErrors.scala
index 8959911dbd8f..dcb6f3077dcc 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/SqlScriptingErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/SqlScriptingErrors.scala
@@ -17,40 +17,50 @@
 
 package org.apache.spark.sql.errors
 
-import org.apache.spark.SparkException
+import org.apache.spark.sql.catalyst.trees.Origin
+import org.apache.spark.sql.exceptions.SqlScriptingException
 
 /**
  * Object for grouping error messages thrown during parsing/interpreting phase
  * of the SQL Scripting Language interpreter.
  */
-private[sql] object SqlScriptingErrors extends QueryErrorsBase {
+private[sql] object SqlScriptingErrors {
 
-  def labelsMismatch(beginLabel: String, endLabel: String): Throwable = {
-new SparkException(
+  def labelsMismatch(origin: Origin, beginLabel: String, endLabel: String): 
Throwable = {
+new SqlScriptingException(
+  origin = origin,
   errorClass = "LABELS_MISMATCH",
   cause = null,

[jira] [Assigned] (SPARK-49063) Fix Between with ScalarSubqueries

2024-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49063:
---

Assignee: Mihailo Milosevic

> Fix Between with ScalarSubqueries
> -
>
> Key: SPARK-49063
> URL: https://issues.apache.org/jira/browse/SPARK-49063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> https://issues.apache.org/jira/browse/SPARK-46366 introduced a regression 
> where types of queries like:
> {code:java}
> SELECT *
> FROM t1
> WHERE (select max(t2.t2c)
>from t2 where t1.t1b = t2.t2b
>   ) between 1 and 20; {code}
> would fail in optimizer rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49063) Fix Between with ScalarSubqueries

2024-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49063.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47581
[https://github.com/apache/spark/pull/47581]

> Fix Between with ScalarSubqueries
> -
>
> Key: SPARK-49063
> URL: https://issues.apache.org/jira/browse/SPARK-49063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> https://issues.apache.org/jira/browse/SPARK-46366 introduced a regression 
> where types of queries like:
> {code:java}
> SELECT *
> FROM t1
> WHERE (select max(t2.t2c)
>from t2 where t1.t1b = t2.t2b
>   ) between 1 and 20; {code}
> would fail in optimizer rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49063][SQL] Fix Between with ScalarSubqueries

2024-08-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6bf6088e7719 [SPARK-49063][SQL] Fix Between with ScalarSubqueries
6bf6088e7719 is described below

commit 6bf6088e7719eccf050910d6ae495b369afb25be
Author: Mihailo Milosevic 
AuthorDate: Mon Aug 5 16:15:04 2024 +0800

[SPARK-49063][SQL] Fix Between with ScalarSubqueries

### What changes were proposed in this pull request?
Fix for between with ScalarSubqueries.

### Why are the changes needed?
There is a regression introduced from a previous PR 
https://github.com/apache/spark/pull/44299. This needs to be addressed as 
between operator was completely broken with resolved ScalarSubqueries.

### Does this PR introduce _any_ user-facing change?
No, the bug is not release yet.

### How was this patch tested?
Tests added to golden file.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47581 from mihailom-db/fixbetween.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/optimizer/subquery.scala|  5 ++-
 .../spark/sql/catalyst/parser/AstBuilder.scala | 10 --
 .../org/apache/spark/sql/internal/SQLConf.scala| 12 +++
 .../scalar-subquery-predicate.sql.out  | 38 ++
 .../scalar-subquery/scalar-subquery-predicate.sql  | 11 +++
 .../scalar-subquery-predicate.sql.out  | 24 ++
 6 files changed, 97 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
index c97e0aedf8c6..1239a5dde130 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
@@ -801,7 +801,10 @@ object RewriteCorrelatedScalarSubquery extends 
Rule[LogicalPlan] with AliasHelpe
 if (Utils.isTesting) {
   assert(mayHaveCountBug.isDefined)
 }
-if (resultWithZeroTups.isEmpty) {
+if (!SQLConf.get.legacyDuplicateBetweenInput && 
currentChild.output.contains(origOutput)) {
+  // If we had multiple of the same scalar subqueries they will 
resolve to the same aliases.
+  currentChild
+} else if (resultWithZeroTups.isEmpty) {
   // CASE 1: Subquery guaranteed not to have the COUNT bug because it 
evaluates to NULL
   // with zero tuples.
   planWithoutCountBug
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
index baeda8dd80be..64d232b55b6b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@@ -2187,8 +2187,14 @@ class AstBuilder extends DataTypeAstBuilder
 // Create the predicate.
 ctx.kind.getType match {
   case SqlBaseParser.BETWEEN =>
-invertIfNotDefined(UnresolvedFunction(
-  "between", Seq(e, expression(ctx.lower), expression(ctx.upper)), 
isDistinct = false))
+if (!SQLConf.get.legacyDuplicateBetweenInput) {
+  invertIfNotDefined(UnresolvedFunction(
+"between", Seq(e, expression(ctx.lower), expression(ctx.upper)), 
isDistinct = false))
+} else {
+  invertIfNotDefined(And(
+GreaterThanOrEqual(e, expression(ctx.lower)),
+LessThanOrEqual(e, expression(ctx.upper
+}
   case SqlBaseParser.IN if ctx.query != null =>
 invertIfNotDefined(InSubquery(getValueExpressions(e), 
ListQuery(plan(ctx.query
   case SqlBaseParser.IN =>
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 6041833e248f..4202f2453c92 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -4614,6 +4614,15 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val LEGACY_DUPLICATE_BETWEEN_INPUT =
+buildConf("spark.sql.legacy.duplicateBetweenInput")
+  .internal()
+  .doc("When true, we use legacy between implementation. This is a flag 
that fixes a " +
+"problem introduced by a between optimization, see ticket 
SPARK-49063.")
+  .version("4.0.0")
+  .booleanConf
+  .crea

[jira] [Assigned] (SPARK-48346) [M0] Support for IF ELSE statement

2024-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48346:
---

Assignee: David Milicevic

> [M0] Support for IF ELSE statement
> --
>
> Key: SPARK-48346
> URL: https://issues.apache.org/jira/browse/SPARK-48346
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
>
> Add support for IF ELSE statements to SQL scripting parser & interpreter:
>  * IF
>  * IF / ELSE
>  * IF / ELSE IF / ELSE
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48346) [M0] Support for IF ELSE statement

2024-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48346.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47442
[https://github.com/apache/spark/pull/47442]

> [M0] Support for IF ELSE statement
> --
>
> Key: SPARK-48346
> URL: https://issues.apache.org/jira/browse/SPARK-48346
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add support for IF ELSE statements to SQL scripting parser & interpreter:
>  * IF
>  * IF / ELSE
>  * IF / ELSE IF / ELSE
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48346][SQL] Support for IF ELSE statements in SQL scripts

2024-08-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f99291ac864d [SPARK-48346][SQL] Support for IF ELSE statements in SQL 
scripts
f99291ac864d is described below

commit f99291ac864d8902aa131fa93d46725a47d88263
Author: David Milicevic 
AuthorDate: Mon Aug 5 15:31:59 2024 +0800

[SPARK-48346][SQL] Support for IF ELSE statements in SQL scripts

### What changes were proposed in this pull request?
This PR proposes introduction of IF/ELSE statement to SQL scripting 
language.
To evaluate conditions in IF or ELSE IF clauses, introduction of boolean 
statement evaluator is required as well.

Changes summary:
- Grammar/parser changes:
  - `ifElseStatement` grammar rule
  - `visitIfElseStatement` rule visitor
  - `IfElseStatement` logical operator
- `IfElseStatementExec` execution node:
  - Internal states - `Condition` and `Body`
  - Iterator implementation - iterate over conditions until the one that 
evaluates to `true` is found
  - Use `StatementBooleanEvaluator` implementation to evaluate conditions
- `DataFrameEvaluator`:
  - Implementation of `StatementBooleanEvaluator`
  - Evaluates results to `true` if it is single row, single column of 
boolean type with value `true`
- `SqlScriptingInterpreter` - add logic to transform `IfElseStatement` to 
`IfElseStatementExec`

### Why are the changes needed?
We are gradually introducing SQL Scripting to Spark, and IF/ELSE is one of 
the basic control flow constructs in the SQL language. For more details, check 
[JIRA item](https://issues.apache.org/jira/browse/SPARK-48346).

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New tests are introduced to all of the three scripting test suites: 
`SqlScriptingParserSuite`, `SqlScriptingExecutionNodeSuite` and 
`SqlScriptingInterpreterSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47442 from davidm-db/sql_scripting_if_else.

Authored-by: David Milicevic 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |   7 +
 .../spark/sql/catalyst/parser/AstBuilder.scala |  15 +-
 .../parser/SqlScriptingLogicalOperators.scala  |  16 ++
 .../catalyst/parser/SqlScriptingParserSuite.scala  | 178 ++
 .../sql/scripting/SqlScriptingExecutionNode.scala  | 105 +++
 .../sql/scripting/SqlScriptingInterpreter.scala|  28 ++-
 .../scripting/SqlScriptingExecutionNodeSuite.scala | 201 ++---
 .../scripting/SqlScriptingInterpreterSuite.scala   | 150 ++-
 8 files changed, 663 insertions(+), 37 deletions(-)

diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
index c7aa56cf920a..841c94d7868f 100644
--- 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
+++ 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
@@ -63,6 +63,7 @@ compoundStatement
 : statement
 | setStatementWithOptionalVarKeyword
 | beginEndCompoundBlock
+| ifElseStatement
 ;
 
 setStatementWithOptionalVarKeyword
@@ -71,6 +72,12 @@ setStatementWithOptionalVarKeyword
 LEFT_PAREN query RIGHT_PAREN
#setVariableWithOptionalKeyword
 ;
 
+ifElseStatement
+: IF booleanExpression THEN conditionalBodies+=compoundBody
+(ELSE IF booleanExpression THEN conditionalBodies+=compoundBody)*
+(ELSE elseBody=compoundBody)? END IF
+;
+
 singleStatement
 : (statement|setResetStatement) SEMICOLON* EOF
 ;
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
index feb3ef4e7155..baeda8dd80be 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@@ -205,10 +205,23 @@ class AstBuilder extends DataTypeAstBuilder
 .map { s =>
   SingleStatement(parsedPlan = visit(s).asInstanceOf[LogicalPlan])
 }.getOrElse {
-  
visit(ctx.beginEndCompoundBlock()).asInstanceOf[CompoundPlanStatement]
+  visitChildren(ctx).asInstanceOf[CompoundPlanStatement]
 }
 }
 
+  override def visitIfElseStatement(ctx: IfElseStatementContext): 
IfElseStatement = {
+IfElseStatement(
+  conditions = ctx.booleanExpression().asScala.toList.map(boolExpr => 
withOrigin(boolExpr) {
+SingleStatement(
+  Project(
+Seq(Alias(expr

[jira] [Assigned] (SPARK-49057) Do not block the AQE loop when submitting query stages

2024-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49057:
---

Assignee: Wenchen Fan

> Do not block the AQE loop when submitting query stages
> --
>
> Key: SPARK-49057
> URL: https://issues.apache.org/jira/browse/SPARK-49057
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49057) Do not block the AQE loop when submitting query stages

2024-08-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49057.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47533
[https://github.com/apache/spark/pull/47533]

> Do not block the AQE loop when submitting query stages
> --
>
> Key: SPARK-49057
> URL: https://issues.apache.org/jira/browse/SPARK-49057
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated (94f88723eed3 -> f01eafd79f3b)

2024-08-04 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 94f88723eed3 [SPARK-49078][SQL] Support show columns syntax in v2 table
 add f01eafd79f3b [SPARK-49057][SQL] Do not block the AQE loop when 
submitting query stages

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/internal/StaticSQLConf.scala  |  10 ++
 .../sql/execution/adaptive/QueryStageExec.scala|   4 +-
 .../execution/exchange/BroadcastExchangeExec.scala |  40 ++--
 .../spark/sql/execution/exchange/Exchange.scala|  13 ---
 .../execution/exchange/ShuffleExchangeExec.scala   |  91 +-
 .../spark/sql/SparkSessionExtensionSuite.scala |   4 +-
 .../adaptive/AdaptiveQueryExecSuite.scala  | 105 +++--
 7 files changed, 149 insertions(+), 118 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[jira] [Updated] (SPARK-49000) Aggregation with DISTINCT gives wrong results when dealing with literals

2024-08-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-49000:

Fix Version/s: 3.5.2

> Aggregation with DISTINCT gives wrong results when dealing with literals
> 
>
> Key: SPARK-49000
> URL: https://issues.apache.org/jira/browse/SPARK-49000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.3, 3.2.4, 3.5.1, 3.3.4, 3.4.3
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Critical
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.4
>
>
> Aggregation with *DISTINCT* gives wrong results when dealing with literals. 
> It appears that this bug affects all (or most) released versions of Spark.
>  
> For example:
> {code:java}
> select count(distinct 1) from t{code}
> returns 1, while the correct result should be 0.
>  
> For reference:
> {code:java}
> select count(1) from t{code}
> returns 0, which is the correct and expected result.
>  
> In these examples, suppose that *t* is a table with any columns).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-49000][SQL][3.5] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates

2024-08-02 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 0008bd1df41a [SPARK-49000][SQL][3.5] Fix "select count(distinct 1) 
from t" where t is empty table by expanding RewriteDistinctAggregates
0008bd1df41a is described below

commit 0008bd1df41aabb155af6f38f4fc491b06d9f314
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Fri Aug 2 22:28:07 2024 +0800

[SPARK-49000][SQL][3.5] Fix "select count(distinct 1) from t" where t is 
empty table by expanding RewriteDistinctAggregates

### What changes were proposed in this pull request?
Fix `RewriteDistinctAggregates` rule to deal properly with aggregation on 
DISTINCT literals. Physical plan for `select count(distinct 1) from t`:

```
-- count(distinct 1)
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(distinct 1)], 
output=[count(DISTINCT 1)#2L])
   +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], 
output=[count#6L])
  +- HashAggregate(keys=[], functions=[], output=[])
 +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20]
+- HashAggregate(keys=[], functions=[], output=[])
   +- FileScan parquet spark_catalog.default.t[] Batched: 
false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
```

Problem is happening when `HashAggregate(keys=[], functions=[], output=[])` 
node yields one row to `partial_count` node, which then captures one row. This 
four-node structure is constructed by `AggUtils.planAggregateWithOneDistinct`.

To fix the problem, we're adding `Expand` node which will force non-empty 
grouping expressions in `HashAggregateExec` nodes. This will in turn enable 
streaming zero rows to parent `partial_count` node, yielding correct final 
result.

### Why are the changes needed?
Aggregation with DISTINCT literal gives wrong results. For example, when 
running on empty table `t`:
`select count(distinct 1) from t` returns 1, while the correct result 
should be 0.
For reference:
`select count(1) from t` returns 0, which is the correct and expected 
result.

### Does this PR introduce _any_ user-facing change?
Yes, this fixes a critical bug in Spark.

### How was this patch tested?
New e2e SQL tests for aggregates with DISTINCT literals.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47566 from uros-db/SPARK-49000-3.5.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../optimizer/RewriteDistinctAggregates.scala  |  16 ++-
 .../apache/spark/sql/DataFrameAggregateSuite.scala | 111 +
 2 files changed, 124 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
index da3cf782f668..801bd2693af4 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
@@ -197,6 +197,17 @@ import org.apache.spark.util.collection.Utils
  * techniques.
  */
 object RewriteDistinctAggregates extends Rule[LogicalPlan] {
+  private def mustRewrite(
+  distinctAggs: Seq[AggregateExpression],
+  groupingExpressions: Seq[Expression]): Boolean = {
+// If there are any distinct AggregateExpressions with filter, we need to 
rewrite the query.
+// Also, if there are no grouping expressions and all distinct aggregate 
expressions are
+// foldable, we need to rewrite the query, e.g. SELECT COUNT(DISTINCT 1). 
Without this case,
+// non-grouping aggregation queries with distinct aggregate expressions 
will be incorrectly
+// handled by the aggregation strategy, causing wrong results when working 
with empty tables.
+distinctAggs.exists(_.filter.isDefined) || (groupingExpressions.isEmpty &&
+  distinctAggs.exists(_.aggregateFunction.children.forall(_.foldable)))
+  }
 
   private def mayNeedtoRewrite(a: Aggregate): Boolean = {
 val aggExpressions = collectAggregateExprs(a)
@@ -204,8 +215,7 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
 // We need at least two distinct aggregates or the single distinct 
aggregate group exists filter
 // clause for this ru

(spark) branch master updated: [SPARK-49093][SQL] GROUP BY with MapType nested inside complex type

2024-08-01 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new aca0d24c8724 [SPARK-49093][SQL] GROUP BY with MapType nested inside 
complex type
aca0d24c8724 is described below

commit aca0d24c8724f1de4e65ad6c663b8e2255dbccd0
Author: Nebojsa Savic 
AuthorDate: Fri Aug 2 14:28:20 2024 +0800

[SPARK-49093][SQL] GROUP BY with MapType nested inside complex type

### What changes were proposed in this pull request?
Currently we are supporting GROUP BY , where column is of MapType, 
but we don't support scenarios where column type contains MapType nested in 
some complex type (i.e. ARRAY>), this PR addresses this issue.

### Why are the changes needed?
We are extending support for MapType columns in GROUP BY clause.

### Does this PR introduce _any_ user-facing change?
Customer would be able to use MapType nested in complex type as part of 
GROUP BY clause.

### How was this patch tested?
Added tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47331 from nebojsa-db/SC-170296.

Authored-by: Nebojsa Savic 
Signed-off-by: Wenchen Fan 
---
 .../InsertMapSortInGroupingExpressions.scala   |  53 +++-
 .../sql-tests/analyzer-results/group-by.sql.out| 130 +++
 .../test/resources/sql-tests/inputs/group-by.sql   |  64 +
 .../resources/sql-tests/results/group-by.sql.out   | 143 +
 4 files changed, 386 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala
index 3d883fa9d9ae..51f6ca374909 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala
@@ -17,11 +17,12 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.MapSort
+import org.apache.spark.sql.catalyst.expressions.{ArrayTransform, 
CreateNamedStruct, Expression, GetStructField, If, IsNull, LambdaFunction, 
Literal, MapFromArrays, MapKeys, MapSort, MapValues, NamedLambdaVariable}
 import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
 import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.catalyst.trees.TreePattern.AGGREGATE
-import org.apache.spark.sql.types.MapType
+import org.apache.spark.sql.types.{ArrayType, MapType, StructType}
+import org.apache.spark.util.ArrayImplicits.SparkArrayOps
 
 /**
  * Adds MapSort to group expressions containing map columns, as the key/value 
paris need to be
@@ -34,12 +35,56 @@ object InsertMapSortInGroupingExpressions extends 
Rule[LogicalPlan] {
 _.containsPattern(AGGREGATE), ruleId) {
 case a @ Aggregate(groupingExpr, _, _) =>
   val newGrouping = groupingExpr.map { expr =>
-if (!expr.isInstanceOf[MapSort] && 
expr.dataType.isInstanceOf[MapType]) {
-  MapSort(expr)
+if (!expr.exists(_.isInstanceOf[MapSort])
+  && expr.dataType.existsRecursively(_.isInstanceOf[MapType])) {
+  insertMapSortRecursively(expr)
 } else {
   expr
 }
   }
   a.copy(groupingExpressions = newGrouping)
   }
+
+  /*
+  Inserts MapSort recursively taking into account when
+  it is nested inside a struct or array.
+   */
+  private def insertMapSortRecursively(e: Expression): Expression = {
+e.dataType match {
+  case m: MapType =>
+// Check if value type of MapType contains MapType (possibly nested)
+// and special handle this case.
+val mapSortExpr = if 
(m.valueType.existsRecursively(_.isInstanceOf[MapType])) {
+  MapFromArrays(MapKeys(e), insertMapSortRecursively(MapValues(e)))
+} else {
+  e
+}
+
+MapSort(mapSortExpr)
+
+  case StructType(fields)
+if 
fields.exists(_.dataType.existsRecursively(_.isInstanceOf[MapType])) =>
+val struct = CreateNamedStruct(fields.zipWithIndex.flatMap { case (f, 
i) =>
+  Seq(Literal(f.name), insertMapSortRecursively(
+GetStructField(e, i, Some(f.name
+}.toImmutableArraySeq)
+if (struct.valExprs.forall(_.isInstanceOf[GetStructField])) {
+  // No field needs MapSort processing, just return the original 
expression.
+  e
+} else if (e.nullable) {
+  If(IsNull(e), Literal(null, struct.dataType), struct)
+} else {
+  struct
+}
+
+  case ArrayType(et, containsNull) if 
et.e

[jira] [Assigned] (SPARK-49065) Rebasing in legacy formatters/parsers must support non JVM default time zones

2024-08-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49065:
---

Assignee: Sumeet Varma

> Rebasing in legacy formatters/parsers must support non JVM default time zones
> -
>
> Key: SPARK-49065
> URL: https://issues.apache.org/jira/browse/SPARK-49065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Sumeet Varma
>Assignee: Sumeet Varma
>Priority: Major
>  Labels: pull-request-available
>
> Currently, rebasing timestamp defaults to JVM timezone and so it produces 
> incorrect results when the explicitly over-riden timezone in the 
> TimestampFormatter library is not the same as JVM timezone.
> To fix it, explicitly pass the overridden timezone parameter to 
> {{rebaseJulianToGregorianMicros}} and {{rebaseGregorianToJulianMicros}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49065) Rebasing in legacy formatters/parsers must support non JVM default time zones

2024-08-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49065.
-
Fix Version/s: 3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 47541
[https://github.com/apache/spark/pull/47541]

> Rebasing in legacy formatters/parsers must support non JVM default time zones
> -
>
> Key: SPARK-49065
> URL: https://issues.apache.org/jira/browse/SPARK-49065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Sumeet Varma
>Assignee: Sumeet Varma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> Currently, rebasing timestamp defaults to JVM timezone and so it produces 
> incorrect results when the explicitly over-riden timezone in the 
> TimestampFormatter library is not the same as JVM timezone.
> To fix it, explicitly pass the overridden timezone parameter to 
> {{rebaseJulianToGregorianMicros}} and {{rebaseGregorianToJulianMicros}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-49065][SQL] Rebasing in legacy formatters/parsers must support non JVM default time zones

2024-08-01 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new a1e7fb18d0c4 [SPARK-49065][SQL] Rebasing in legacy formatters/parsers 
must support non JVM default time zones
a1e7fb18d0c4 is described below

commit a1e7fb18d0c4c4d51e2a5f5b050f528a045960d8
Author: Sumeet Varma 
AuthorDate: Thu Aug 1 22:29:08 2024 +0800

[SPARK-49065][SQL] Rebasing in legacy formatters/parsers must support non 
JVM default time zones

### What changes were proposed in this pull request?

Explicitly pass the overridden timezone parameter to 
`rebaseJulianToGregorianMicros` and `rebaseGregorianToJulianMicros`.

### Why are the changes needed?

Currently, rebasing timestamp defaults to JVM timezone and so it produces 
incorrect results when the explicitly over-riden timezone in the 
TimestampFormatter library is not the same as JVM timezone.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New UT to capture this scenario

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47541 from sumeet-db/rebase_time_zone.

Authored-by: Sumeet Varma 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 63c08632f2d6e559bc6a8396e646389e5402c757)
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/util/SparkDateTimeUtils.scala |  6 
 .../sql/catalyst/util/TimestampFormatter.scala | 12 
 .../catalyst/util/TimestampFormatterSuite.scala| 34 ++
 3 files changed, 46 insertions(+), 6 deletions(-)

diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
index 698e7b37a9ef..980eee9390d0 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
@@ -237,6 +237,9 @@ trait SparkDateTimeUtils {
   def toJavaTimestamp(micros: Long): Timestamp =
 toJavaTimestampNoRebase(rebaseGregorianToJulianMicros(micros))
 
+  def toJavaTimestamp(timeZoneId: String, micros: Long): Timestamp =
+toJavaTimestampNoRebase(rebaseGregorianToJulianMicros(timeZoneId, micros))
+
   /**
* Converts microseconds since the epoch to an instance of 
`java.sql.Timestamp`.
*
@@ -273,6 +276,9 @@ trait SparkDateTimeUtils {
   def fromJavaTimestamp(t: Timestamp): Long =
 rebaseJulianToGregorianMicros(fromJavaTimestampNoRebase(t))
 
+  def fromJavaTimestamp(timeZoneId: String, t: Timestamp): Long =
+rebaseJulianToGregorianMicros(timeZoneId, fromJavaTimestampNoRebase(t))
+
   /**
* Converts an instance of `java.sql.Timestamp` to the number of 
microseconds since
* 1970-01-01T00:00:00.00Z.
diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
index 0866cee9334c..07b32af5c85e 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
@@ -429,11 +429,11 @@ class LegacyFastTimestampFormatter(
 val micros = cal.getMicros()
 cal.set(Calendar.MILLISECOND, 0)
 val julianMicros = Math.addExact(millisToMicros(cal.getTimeInMillis), 
micros)
-rebaseJulianToGregorianMicros(julianMicros)
+rebaseJulianToGregorianMicros(TimeZone.getTimeZone(zoneId), julianMicros)
   }
 
   override def format(timestamp: Long): String = {
-val julianMicros = rebaseGregorianToJulianMicros(timestamp)
+val julianMicros = 
rebaseGregorianToJulianMicros(TimeZone.getTimeZone(zoneId), timestamp)
 cal.setTimeInMillis(Math.floorDiv(julianMicros, MICROS_PER_SECOND) * 
MILLIS_PER_SECOND)
 cal.setMicros(Math.floorMod(julianMicros, MICROS_PER_SECOND))
 fastDateFormat.format(cal)
@@ -443,7 +443,7 @@ class LegacyFastTimestampFormatter(
 if (ts.getNanos == 0) {
   fastDateFormat.format(ts)
 } else {
-  format(fromJavaTimestamp(ts))
+  format(fromJavaTimestamp(zoneId.getId, ts))
 }
   }
 
@@ -467,7 +467,7 @@ class LegacySimpleTimestampFormatter(
   }
 
   override def parse(s: String): Long = {
-fromJavaTimestamp(new Timestamp(sdf.parse(s).getTime))
+fromJavaTimestamp(zoneId.getId, new Timestamp(sdf.parse(s).getTime))
   }
 
   override def parseOptional(s: String): Option[Long] = {
@@ -475,12 +475,12 @@ class LegacySimpleTimestampFormatter(
 if (date == null) {
   None
 } else {
-  Some(fromJavaTimestamp(new Timestamp(date.getTime)))
+  Some(fromJavaTimestamp(zoneId.getId, new Timestamp(date.getTime

(spark) branch master updated: [SPARK-49065][SQL] Rebasing in legacy formatters/parsers must support non JVM default time zones

2024-08-01 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 63c08632f2d6 [SPARK-49065][SQL] Rebasing in legacy formatters/parsers 
must support non JVM default time zones
63c08632f2d6 is described below

commit 63c08632f2d6e559bc6a8396e646389e5402c757
Author: Sumeet Varma 
AuthorDate: Thu Aug 1 22:29:08 2024 +0800

[SPARK-49065][SQL] Rebasing in legacy formatters/parsers must support non 
JVM default time zones

### What changes were proposed in this pull request?

Explicitly pass the overridden timezone parameter to 
`rebaseJulianToGregorianMicros` and `rebaseGregorianToJulianMicros`.

### Why are the changes needed?

Currently, rebasing timestamp defaults to JVM timezone and so it produces 
incorrect results when the explicitly over-riden timezone in the 
TimestampFormatter library is not the same as JVM timezone.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New UT to capture this scenario

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47541 from sumeet-db/rebase_time_zone.

Authored-by: Sumeet Varma 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/util/SparkDateTimeUtils.scala |  6 
 .../sql/catalyst/util/TimestampFormatter.scala | 12 
 .../catalyst/util/TimestampFormatterSuite.scala| 34 ++
 3 files changed, 46 insertions(+), 6 deletions(-)

diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
index 0447d813e26a..a6592ad51c65 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
@@ -251,6 +251,9 @@ trait SparkDateTimeUtils {
   def toJavaTimestamp(micros: Long): Timestamp =
 toJavaTimestampNoRebase(rebaseGregorianToJulianMicros(micros))
 
+  def toJavaTimestamp(timeZoneId: String, micros: Long): Timestamp =
+toJavaTimestampNoRebase(rebaseGregorianToJulianMicros(timeZoneId, micros))
+
   /**
* Converts microseconds since the epoch to an instance of 
`java.sql.Timestamp`.
*
@@ -287,6 +290,9 @@ trait SparkDateTimeUtils {
   def fromJavaTimestamp(t: Timestamp): Long =
 rebaseJulianToGregorianMicros(fromJavaTimestampNoRebase(t))
 
+  def fromJavaTimestamp(timeZoneId: String, t: Timestamp): Long =
+rebaseJulianToGregorianMicros(timeZoneId, fromJavaTimestampNoRebase(t))
+
   /**
* Converts an instance of `java.sql.Timestamp` to the number of 
microseconds since
* 1970-01-01T00:00:00.00Z.
diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
index 9f57f8375c54..79d627b493fd 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
@@ -435,11 +435,11 @@ class LegacyFastTimestampFormatter(
 val micros = cal.getMicros()
 cal.set(Calendar.MILLISECOND, 0)
 val julianMicros = Math.addExact(millisToMicros(cal.getTimeInMillis), 
micros)
-rebaseJulianToGregorianMicros(julianMicros)
+rebaseJulianToGregorianMicros(TimeZone.getTimeZone(zoneId), julianMicros)
   }
 
   override def format(timestamp: Long): String = {
-val julianMicros = rebaseGregorianToJulianMicros(timestamp)
+val julianMicros = 
rebaseGregorianToJulianMicros(TimeZone.getTimeZone(zoneId), timestamp)
 cal.setTimeInMillis(Math.floorDiv(julianMicros, MICROS_PER_SECOND) * 
MILLIS_PER_SECOND)
 cal.setMicros(Math.floorMod(julianMicros, MICROS_PER_SECOND))
 fastDateFormat.format(cal)
@@ -449,7 +449,7 @@ class LegacyFastTimestampFormatter(
 if (ts.getNanos == 0) {
   fastDateFormat.format(ts)
 } else {
-  format(fromJavaTimestamp(ts))
+  format(fromJavaTimestamp(zoneId.getId, ts))
 }
   }
 
@@ -473,7 +473,7 @@ class LegacySimpleTimestampFormatter(
   }
 
   override def parse(s: String): Long = {
-fromJavaTimestamp(new Timestamp(sdf.parse(s).getTime))
+fromJavaTimestamp(zoneId.getId, new Timestamp(sdf.parse(s).getTime))
   }
 
   override def parseOptional(s: String): Option[Long] = {
@@ -481,12 +481,12 @@ class LegacySimpleTimestampFormatter(
 if (date == null) {
   None
 } else {
-  Some(fromJavaTimestamp(new Timestamp(date.getTime)))
+  Some(fromJavaTimestamp(zoneId.getId, new Timestamp(date.getTime)))
 }
   }
 
   override def format(us: Long): String = {
-sdf.format(toJavaTimestamp(us))
+sdf.format

[jira] [Assigned] (SPARK-49074) Fix cached Variant with column size greater than 128KB or individual variant larger than 2kb

2024-07-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49074:
---

Assignee: Richard Chen

> Fix cached Variant with column size greater than 128KB or individual variant 
> larger than 2kb
> 
>
> Key: SPARK-49074
> URL: https://issues.apache.org/jira/browse/SPARK-49074
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the `actualSize` method of the `VARIANT` `columnType` isn't 
> overridden, so we use the default size of 2kb for the `actualSize`. We should 
> define `actualSize` so the cached variant column can correctly be written to 
> the byte buffer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49074][SQL] Fix variant with `df.cache()`

2024-07-31 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bf3ad7e94afb [SPARK-49074][SQL] Fix variant with `df.cache()`
bf3ad7e94afb is described below

commit bf3ad7e94afb5416d995dc22344566899ee7c4b0
Author: Richard Chen 
AuthorDate: Thu Aug 1 08:48:19 2024 +0800

[SPARK-49074][SQL] Fix variant with `df.cache()`

### What changes were proposed in this pull request?

Currently, the `actualSize` method of the `VARIANT` `columnType` isn't 
overridden, so we use the default size of 2kb for the `actualSize`. We should 
define `actualSize` so the cached variant column can correctly be written to 
the byte buffer.

Currently, if the avg per-variant size is greater than 2KB and the total 
column size is greater than 128KB (the default initial buffer size), an 
exception will be (incorrectly) thrown.

### Why are the changes needed?

to fix caching larger variants (in df.cache()), such as the ones included 
in the UTs.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

added UT

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47559 from richardc-db/fix_variant_cache.

Authored-by: Richard Chen 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/columnar/ColumnType.scala  |  6 +++
 .../scala/org/apache/spark/sql/VariantSuite.scala  | 53 ++
 2 files changed, 59 insertions(+)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala
index 5cc3a3d83d4c..60695a6c5d49 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala
@@ -829,6 +829,12 @@ private[columnar] object VARIANT
   /** Chosen to match the default size set in `VariantType`. */
   override def defaultSize: Int = 2048
 
+  override def actualSize(row: InternalRow, ordinal: Int): Int = {
+val v = getField(row, ordinal)
+// 4 bytes each for the integers representing the 'value' and 'metadata' 
lengths.
+8 + v.getValue().length + v.getMetadata().length
+  }
+
   override def getField(row: InternalRow, ordinal: Int): VariantVal = 
row.getVariant(ordinal)
 
   override def setField(row: InternalRow, ordinal: Int, value: VariantVal): 
Unit =
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala
index ce2643f9e239..0c8b0b501951 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala
@@ -652,6 +652,21 @@ class VariantSuite extends QueryTest with 
SharedSparkSession with ExpressionEval
 checkAnswer(df, expected.collect())
   }
 
+  test("variant with many keys in a cached row-based df") {
+// The initial size of the buffer backing a cached dataframe column is 
128KB.
+// See `ColumnBuilder`.
+val numKeys = 128 * 1024
+var keyIterator = (0 until numKeys).iterator
+val entries = Array.fill(numKeys)(s"""\"${keyIterator.next()}\": 
\"test\"""")
+val jsonStr = s"{${entries.mkString(", ")}}"
+val query = s"""select parse_json('${jsonStr}') v from range(0, 10)"""
+val df = spark.sql(query)
+df.cache()
+
+val expected = spark.sql(query)
+checkAnswer(df, expected.collect())
+  }
+
   test("struct of variant in a cached row-based df") {
 val query = """select named_struct(
   'v', parse_json(format_string('{\"a\": %s}', id)),
@@ -680,6 +695,21 @@ class VariantSuite extends QueryTest with 
SharedSparkSession with ExpressionEval
 checkAnswer(df, expected.collect())
   }
 
+  test("array variant with many keys in a cached row-based df") {
+// The initial size of the buffer backing a cached dataframe column is 
128KB.
+// See `ColumnBuilder`.
+val numKeys = 128 * 1024
+var keyIterator = (0 until numKeys).iterator
+val entries = Array.fill(numKeys)(s"""\"${keyIterator.next()}\": 
\"test\"""")
+val jsonStr = s"{${entries.mkString(", ")}}"
+val query = s"""select array(parse_json('${jsonStr}')) v from range(0, 
10)"""
+val df = spark.sql(query)
+df.cache()
+
+val expected = spark.sql(query)
+checkAnswer(df, expected.collect())
+  }
+

[jira] [Resolved] (SPARK-49074) Fix cached Variant with column size greater than 128KB or individual variant larger than 2kb

2024-07-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49074.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47559
[https://github.com/apache/spark/pull/47559]

> Fix cached Variant with column size greater than 128KB or individual variant 
> larger than 2kb
> 
>
> Key: SPARK-49074
> URL: https://issues.apache.org/jira/browse/SPARK-49074
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, the `actualSize` method of the `VARIANT` `columnType` isn't 
> overridden, so we use the default size of 2kb for the `actualSize`. We should 
> define `actualSize` so the cached variant column can correctly be written to 
> the byte buffer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49000) Aggregation with DISTINCT gives wrong results when dealing with literals

2024-07-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49000.
-
Fix Version/s: 3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 47525
[https://github.com/apache/spark/pull/47525]

> Aggregation with DISTINCT gives wrong results when dealing with literals
> 
>
> Key: SPARK-49000
> URL: https://issues.apache.org/jira/browse/SPARK-49000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> Aggregation with *DISTINCT* gives wrong results when dealing with literals. 
> It appears that this bug affects all (or most) released versions of Spark.
>  
> For example:
> {code:java}
> select count(distinct 1) from t{code}
> returns 1, while the correct result should be 0.
>  
> For reference:
> {code:java}
> select count(1) from t{code}
> returns 0, which is the correct and expected result.
>  
> In these examples, suppose that *t* is a table with any columns).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49000) Aggregation with DISTINCT gives wrong results when dealing with literals

2024-07-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49000:
---

Assignee: Uroš Bojanić

> Aggregation with DISTINCT gives wrong results when dealing with literals
> 
>
> Key: SPARK-49000
> URL: https://issues.apache.org/jira/browse/SPARK-49000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Critical
>  Labels: pull-request-available
>
> Aggregation with *DISTINCT* gives wrong results when dealing with literals. 
> It appears that this bug affects all (or most) released versions of Spark.
>  
> For example:
> {code:java}
> select count(distinct 1) from t{code}
> returns 1, while the correct result should be 0.
>  
> For reference:
> {code:java}
> select count(1) from t{code}
> returns 0, which is the correct and expected result.
>  
> In these examples, suppose that *t* is a table with any columns).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates

2024-07-31 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new c6df89079886 [SPARK-49000][SQL] Fix "select count(distinct 1) from t" 
where t is empty table by expanding RewriteDistinctAggregates
c6df89079886 is described below

commit c6df890798862d0863afbfff8fca0ee4df70354f
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed Jul 31 22:37:42 2024 +0800

[SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty 
table by expanding RewriteDistinctAggregates

Fix `RewriteDistinctAggregates` rule to deal properly with aggregation on 
DISTINCT literals. Physical plan for `select count(distinct 1) from t`:
```
-- count(distinct 1)
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(distinct 1)], 
output=[count(DISTINCT 1)#2L])
   +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], 
output=[count#6L])
  +- HashAggregate(keys=[], functions=[], output=[])
 +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20]
+- HashAggregate(keys=[], functions=[], output=[])
   +- FileScan parquet spark_catalog.default.t[] Batched: 
false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
```
Problem is happening when `HashAggregate(keys=[], functions=[], output=[])` 
node yields one row to `partial_count` node, which then captures one row. This 
four-node structure is constructed by `AggUtils.planAggregateWithOneDistinct`.

To fix the problem, we're adding `Expand` node which will force non-empty 
grouping expressions in `HashAggregateExec` nodes. This will in turn enable 
streaming zero rows to parent `partial_count` node, yielding correct final 
result.

Aggregation with DISTINCT literal gives wrong results. For example, when 
running on empty table `t`:
`select count(distinct 1) from t` returns 1, while the correct result 
should be 0.
For reference:
`select count(1) from t` returns 0, which is the correct and expected 
result.

Yes, this fixes a critical bug in Spark.

New e2e SQL tests for aggregates with DISTINCT literals.

No.

Closes #47525 from nikolamand-db/SPARK-49000-spark-expand-approach.

Lead-authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Co-authored-by: Nikola Mandic 
Signed-off-by: Wenchen Fan 
(cherry picked from commit dfa21332f20fff4aa6052ffa556d206497c066cf)
Signed-off-by: Wenchen Fan 
---
 .../optimizer/RewriteDistinctAggregates.scala  |  13 ++-
 .../apache/spark/sql/DataFrameAggregateSuite.scala | 114 +
 2 files changed, 125 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
index da3cf782f668..e91493188873 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
@@ -197,6 +197,15 @@ import org.apache.spark.util.collection.Utils
  * techniques.
  */
 object RewriteDistinctAggregates extends Rule[LogicalPlan] {
+  private def mustRewrite(
+  aggregateExpressions: Seq[AggregateExpression],
+  groupingExpressions: Seq[Expression]): Boolean = {
+// If there are any AggregateExpressions with filter, we need to rewrite 
the query.
+// Also, if there are no grouping expressions and all aggregate 
expressions are foldable,
+// we need to rewrite the query, e.g. SELECT COUNT(DISTINCT 1).
+aggregateExpressions.exists(_.filter.isDefined) || 
(groupingExpressions.isEmpty &&
+  
aggregateExpressions.exists(_.aggregateFunction.children.forall(_.foldable)))
+  }
 
   private def mayNeedtoRewrite(a: Aggregate): Boolean = {
 val aggExpressions = collectAggregateExprs(a)
@@ -205,7 +214,7 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
 // clause for this rule because aggregation strategy can handle a single 
distinct aggregate
 // group without filter clause.
 // This check can produce false-positives, e.g., SUM(DISTINCT a) & 
COUNT(DISTINCT a).
-distinctAggs.size > 1 || distinctAggs.exists(_.filter.isDefined)
+distinctAggs.size > 1 || mustRewrite(distinctAggs, a.groupingExpressions)
   }
 
   def apply(plan: LogicalPlan): LogicalP

(spark) branch master updated: [SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates

2024-07-31 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dfa21332f20f [SPARK-49000][SQL] Fix "select count(distinct 1) from t" 
where t is empty table by expanding RewriteDistinctAggregates
dfa21332f20f is described below

commit dfa21332f20fff4aa6052ffa556d206497c066cf
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed Jul 31 22:37:42 2024 +0800

[SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty 
table by expanding RewriteDistinctAggregates

### What changes were proposed in this pull request?
Fix `RewriteDistinctAggregates` rule to deal properly with aggregation on 
DISTINCT literals. Physical plan for `select count(distinct 1) from t`:
```
-- count(distinct 1)
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(distinct 1)], 
output=[count(DISTINCT 1)#2L])
   +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], 
output=[count#6L])
  +- HashAggregate(keys=[], functions=[], output=[])
 +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20]
+- HashAggregate(keys=[], functions=[], output=[])
   +- FileScan parquet spark_catalog.default.t[] Batched: 
false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
```
Problem is happening when `HashAggregate(keys=[], functions=[], output=[])` 
node yields one row to `partial_count` node, which then captures one row. This 
four-node structure is constructed by `AggUtils.planAggregateWithOneDistinct`.

To fix the problem, we're adding `Expand` node which will force non-empty 
grouping expressions in `HashAggregateExec` nodes. This will in turn enable 
streaming zero rows to parent `partial_count` node, yielding correct final 
result.

### Why are the changes needed?
Aggregation with DISTINCT literal gives wrong results. For example, when 
running on empty table `t`:
`select count(distinct 1) from t` returns 1, while the correct result 
should be 0.
For reference:
`select count(1) from t` returns 0, which is the correct and expected 
result.

### Does this PR introduce _any_ user-facing change?
Yes, this fixes a critical bug in Spark.

### How was this patch tested?
New e2e SQL tests for aggregates with DISTINCT literals.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47525 from nikolamand-db/SPARK-49000-spark-expand-approach.

Lead-authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Co-authored-by: Nikola Mandic 
Signed-off-by: Wenchen Fan 
---
 .../optimizer/RewriteDistinctAggregates.scala  |  13 ++-
 .../apache/spark/sql/DataFrameAggregateSuite.scala | 114 +
 2 files changed, 125 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
index da3cf782f668..e91493188873 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
@@ -197,6 +197,15 @@ import org.apache.spark.util.collection.Utils
  * techniques.
  */
 object RewriteDistinctAggregates extends Rule[LogicalPlan] {
+  private def mustRewrite(
+  aggregateExpressions: Seq[AggregateExpression],
+  groupingExpressions: Seq[Expression]): Boolean = {
+// If there are any AggregateExpressions with filter, we need to rewrite 
the query.
+// Also, if there are no grouping expressions and all aggregate 
expressions are foldable,
+// we need to rewrite the query, e.g. SELECT COUNT(DISTINCT 1).
+aggregateExpressions.exists(_.filter.isDefined) || 
(groupingExpressions.isEmpty &&
+  
aggregateExpressions.exists(_.aggregateFunction.children.forall(_.foldable)))
+  }
 
   private def mayNeedtoRewrite(a: Aggregate): Boolean = {
 val aggExpressions = collectAggregateExprs(a)
@@ -205,7 +214,7 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
 // clause for this rule because aggregation strategy can handle a single 
distinct aggregate
 // group without filter clause.
 // This check can produce false-positives, e.g., SUM(DISTINCT a) & 
COUNT(DISTINCT a).
-distinctAggs.size > 1 || distinctAggs.exists(_.filter.isDefin

[jira] [Assigned] (SPARK-48977) Optimize collation support for string search (UTF8_LCASE collation)

2024-07-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48977:
---

Assignee: Uroš Bojanić

> Optimize collation support for string search (UTF8_LCASE collation)
> ---
>
> Key: SPARK-48977
> URL: https://issues.apache.org/jira/browse/SPARK-48977
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48977) Optimize collation support for string search (UTF8_LCASE collation)

2024-07-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48977.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47444
[https://github.com/apache/spark/pull/47444]

> Optimize collation support for string search (UTF8_LCASE collation)
> ---
>
> Key: SPARK-48977
> URL: https://issues.apache.org/jira/browse/SPARK-48977
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48977][SQL] Optimize string searching under UTF8_LCASE collation

2024-07-31 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0e07873d368f [SPARK-48977][SQL] Optimize string searching under 
UTF8_LCASE collation
0e07873d368f is described below

commit 0e07873d368fa17dfff11e28fd45531f1f388864
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed Jul 31 17:16:17 2024 +0800

[SPARK-48977][SQL] Optimize string searching under UTF8_LCASE collation

### What changes were proposed in this pull request?
Modify string search under UTF8_LCASE collation by utilizing UTF8String 
character iterator to reduce one order of algorithmic complexity.

### Why are the changes needed?
Optimize implementation for `contains`, `startsWith`, `endsWith`, `locate` 
expressions.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47444 from uros-db/optimize-search.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 83 +++---
 1 file changed, 73 insertions(+), 10 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index 430d1fb89832..5b005f152c51 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -85,13 +85,44 @@ public class CollationAwareUTF8String {
   final UTF8String lowercasePattern,
   int startPos) {
 assert startPos >= 0;
-for (int len = 0; len <= target.numChars() - startPos; ++len) {
-  if (lowerCaseCodePoints(target.substring(startPos, startPos + len))
-  .equals(lowercasePattern)) {
-return len;
+// Use code point iterators for efficient string search.
+Iterator targetIterator = target.codePointIterator();
+Iterator patternIterator = lowercasePattern.codePointIterator();
+// Skip to startPos in the target string.
+for (int i = 0; i < startPos; ++i) {
+  if (targetIterator.hasNext()) {
+targetIterator.next();
+  } else {
+return MATCH_NOT_FOUND;
   }
 }
-return MATCH_NOT_FOUND;
+// Compare the characters in the target and pattern strings.
+int matchLength = 0, codePointBuffer = -1, targetCodePoint, 
patternCodePoint;
+while (targetIterator.hasNext() && patternIterator.hasNext()) {
+  if (codePointBuffer != -1) {
+targetCodePoint = codePointBuffer;
+codePointBuffer = -1;
+  } else {
+// Use buffered lowercase code point iteration to handle one-to-many 
case mappings.
+targetCodePoint = getLowercaseCodePoint(targetIterator.next());
+if (targetCodePoint == CODE_POINT_COMBINED_LOWERCASE_I_DOT) {
+  targetCodePoint = CODE_POINT_LOWERCASE_I;
+  codePointBuffer = CODE_POINT_COMBINING_DOT;
+}
+++matchLength;
+  }
+  patternCodePoint = patternIterator.next();
+  if (targetCodePoint != patternCodePoint) {
+return MATCH_NOT_FOUND;
+  }
+}
+// If the pattern string has more characters, or the match is found at the 
middle of a
+// character that maps to multiple characters in lowercase, then match is 
not found.
+if (patternIterator.hasNext() || codePointBuffer != -1) {
+  return MATCH_NOT_FOUND;
+}
+// If all characters are equal, return the length of the match in the 
target string.
+return matchLength;
   }
 
   /**
@@ -155,13 +186,45 @@ public class CollationAwareUTF8String {
   final UTF8String target,
   final UTF8String lowercasePattern,
   int endPos) {
-assert endPos <= target.numChars();
-for (int len = 0; len <= endPos; ++len) {
-  if (lowerCaseCodePoints(target.substring(endPos - len, 
endPos)).equals(lowercasePattern)) {
-return len;
+assert endPos >= 0;
+// Use code point iterators for efficient string search.
+Iterator targetIterator = target.reverseCodePointIterator();
+Iterator patternIterator = 
lowercasePattern.reverseCodePointIterator();
+// Skip to startPos in the target string.
+for (int i = endPos; i < target.numChars(); ++i) {
+  if (targetIterator.hasNext()) {
+targetIterator.next();
+  } else {
+return MATCH_NOT_FOUND;
   }
 }
-return MATCH_NOT_FOUND;
+// Compare the characters in the target and pattern strings.
+

[jira] [Assigned] (SPARK-48725) Use lowerCaseCodePoints in string functions for UTF8_LCASE

2024-07-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48725:
---

Assignee: Uroš Bojanić

> Use lowerCaseCodePoints in string functions for UTF8_LCASE
> --
>
> Key: SPARK-48725
> URL: https://issues.apache.org/jira/browse/SPARK-48725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48725) Use lowerCaseCodePoints in string functions for UTF8_LCASE

2024-07-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48725.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47132
[https://github.com/apache/spark/pull/47132]

> Use lowerCaseCodePoints in string functions for UTF8_LCASE
> --
>
> Key: SPARK-48725
> URL: https://issues.apache.org/jira/browse/SPARK-48725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48725][SQL] Integrate CollationAwareUTF8String.lowerCaseCodePoints into string expressions

2024-07-30 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 98c365f6e652 [SPARK-48725][SQL] Integrate 
CollationAwareUTF8String.lowerCaseCodePoints into string expressions
98c365f6e652 is described below

commit 98c365f6e652f740e51c3f185ebd252918ae824a
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed Jul 31 13:13:49 2024 +0800

[SPARK-48725][SQL] Integrate CollationAwareUTF8String.lowerCaseCodePoints 
into string expressions

### What changes were proposed in this pull request?
Use `CollationAwareUTF8String.lowerCaseCodePoints` logic to properly 
lowercase strings according to UTF8_LCASE collation, instead of relying on 
`UTF8String.toLowerCase()` method calls.

### Why are the changes needed?
Avoid correctness issues with respect to code-point logic in UTF8_LCASE 
(arising when Java to performs string lowercasing) and ensure consistent 
results.

### Does this PR introduce _any_ user-facing change?
Yes, collation aware string function implementations will now rely on 
`CollationAwareUTF8String` string lowercasing for UTF8_LCASE collation, instead 
of `UTF8String` logic (which resorts to Java's implementation).

### How was this patch tested?
Existing tests, with some new cases in `CollationSupportSuite`.
6 expressions are affected by this change: `contains`, `instr`, 
`find_in_set`, `replace`, `locate`, `substr_index`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47132 from uros-db/lcase-cp.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
    Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java|  27 ++-
 .../spark/sql/catalyst/util/CollationFactory.java  |  35 ++--
 .../spark/unsafe/types/CollationSupportSuite.java  | 224 +
 3 files changed, 256 insertions(+), 30 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index b9868ca665a6..430d1fb89832 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -51,9 +51,8 @@ public class CollationAwareUTF8String {
   /**
* Returns whether the target string starts with the specified prefix, 
starting from the
* specified position (0-based index referring to character position in 
UTF8String), with respect
-   * to the UTF8_LCASE collation. The method assumes that the prefix is 
already lowercased
-   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
-   * same prefix string.
+   * to the UTF8_LCASE collation. The method assumes that the prefix is 
already lowercased prior
+   * to method call to avoid the overhead of lowercasing the same prefix 
string multiple times.
*
* @param target the string to be searched in
* @param lowercasePattern the string to be searched for
@@ -87,7 +86,8 @@ public class CollationAwareUTF8String {
   int startPos) {
 assert startPos >= 0;
 for (int len = 0; len <= target.numChars() - startPos; ++len) {
-  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+  if (lowerCaseCodePoints(target.substring(startPos, startPos + len))
+  .equals(lowercasePattern)) {
 return len;
   }
 }
@@ -123,8 +123,7 @@ public class CollationAwareUTF8String {
* Returns whether the target string ends with the specified suffix, ending 
at the specified
* position (0-based index referring to character position in UTF8String), 
with respect to the
* UTF8_LCASE collation. The method assumes that the suffix is already 
lowercased prior
-   * to method call to avoid the overhead of calling .toLowerCase() multiple 
times on the same
-   * suffix string.
+   * to method call to avoid the overhead of lowercasing the same suffix 
string multiple times.
*
* @param target the string to be searched in
* @param lowercasePattern the string to be searched for
@@ -158,7 +157,7 @@ public class CollationAwareUTF8String {
   int endPos) {
 assert endPos <= target.numChars();
 for (int len = 0; len <= endPos; ++len) {
-  if (target.substring(endPos - len, 
endPos).toLowerCase().equals(lowercasePattern)) {
+  if (lowerCaseCodePoints(target.substring(endPos - len, 
endPos)).equals(lowercasePattern)) {
 return len;
   }
 }
@@ -191,10 +190,9 @@ public class CollationAwareUTF8String {
   }
 
   /**
-   * Lowercase U

[jira] [Resolved] (SPARK-49003) Fix calculating hash value of collated strings

2024-07-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49003.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47502
[https://github.com/apache/spark/pull/47502]

> Fix calculating hash value of collated strings
> --
>
> Key: SPARK-49003
> URL: https://issues.apache.org/jira/browse/SPARK-49003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.3
>Reporter: Marko Ilic
>Assignee: Marko Ilic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49003) Fix calculating hash value of collated strings

2024-07-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49003:
---

Assignee: Marko Ilic

> Fix calculating hash value of collated strings
> --
>
> Key: SPARK-49003
> URL: https://issues.apache.org/jira/browse/SPARK-49003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.3
>Reporter: Marko Ilic
>Assignee: Marko Ilic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-49003][SQL] Fix interpreted code path hashing to be collation aware

2024-07-30 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5929412e65cd [SPARK-49003][SQL] Fix interpreted code path hashing to 
be collation aware
5929412e65cd is described below

commit 5929412e65cd82e72f74b21772d8c7ed04217399
Author: Marko 
AuthorDate: Wed Jul 31 12:26:35 2024 +0800

[SPARK-49003][SQL] Fix interpreted code path hashing to be collation aware

### What changes were proposed in this pull request?
Changed hash function to be collation aware. This change is just for 
interpreted code path. Codegen hashing was already collation aware.

### Why are the changes needed?
We were getting the wrong hash for collated strings because our hash 
function only used the binary representation of the string. It didn't take 
collation into account.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added tests to `HashExpressionsSuite.scala`

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47502 from ilicmarkodb/ilicmarkodb/fix_string_hash.

Lead-authored-by: Marko 
Co-authored-by: Marko Ilic 
Co-authored-by: Marko Ilić 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/hash.scala  | 12 --
 .../expressions/HashExpressionsSuite.scala | 26 +-
 .../spark/sql/CollationExpressionWalkerSuite.scala |  8 ---
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 12 +-
 4 files changed, 46 insertions(+), 12 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
index fa342f641509..3a667f370428 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
@@ -33,7 +33,7 @@ import org.apache.spark.sql.catalyst.expressions.Cast._
 import org.apache.spark.sql.catalyst.expressions.codegen._
 import org.apache.spark.sql.catalyst.expressions.codegen.Block._
 import org.apache.spark.sql.catalyst.types.DataTypeUtils
-import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
+import org.apache.spark.sql.catalyst.util.{ArrayData, CollationFactory, 
MapData}
 import org.apache.spark.sql.catalyst.util.DateTimeConstants._
 import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.internal.SQLConf
@@ -565,7 +565,15 @@ abstract class InterpretedHashFunction {
   case a: Array[Byte] =>
 hashUnsafeBytes(a, Platform.BYTE_ARRAY_OFFSET, a.length, seed)
   case s: UTF8String =>
-hashUnsafeBytes(s.getBaseObject, s.getBaseOffset, s.numBytes(), seed)
+val st = dataType.asInstanceOf[StringType]
+if (st.supportsBinaryEquality) {
+  hashUnsafeBytes(s.getBaseObject, s.getBaseOffset, s.numBytes(), seed)
+} else {
+  val stringHash = CollationFactory
+.fetchCollation(st.collationId)
+.hashFunction.applyAsLong(s)
+  hashLong(stringHash, seed)
+}
 
   case array: ArrayData =>
 val elementType = dataType match {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
index 474c27c0de9a..6f3890cafd2a 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
@@ -31,7 +31,7 @@ import org.apache.spark.sql.{RandomDataGenerator, Row}
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.encoders.{ExamplePointUDT, 
ExpressionEncoder}
 import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection
-import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, 
GenericArrayData, IntervalUtils}
+import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, 
CollationFactory, DateTimeUtils, GenericArrayData, IntervalUtils}
 import org.apache.spark.sql.types.{ArrayType, StructType, _}
 import org.apache.spark.unsafe.types.UTF8String
 import org.apache.spark.util.ArrayImplicits._
@@ -620,6 +620,30 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 checkHiveHashForDecimal("123456.123456789012345678901234567890", 38, 31, 
1728235666)
   }
 
+  for (collation <- Seq("UTF8_LCASE", "UNICODE_CI", "UTF8_BINARY")) {
+test(s"hash check for collated $collation string

[jira] [Created] (SPARK-49057) Do not block the AQE loop when submitting query stages

2024-07-30 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-49057:
---

 Summary: Do not block the AQE loop when submitting query stages
 Key: SPARK-49057
 URL: https://issues.apache.org/jira/browse/SPARK-49057
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48989) WholeStageCodeGen error resulting in NumberFormatException when calling SUBSTRING_INDEX

2024-07-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48989.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47481
[https://github.com/apache/spark/pull/47481]

> WholeStageCodeGen error resulting in NumberFormatException when calling 
> SUBSTRING_INDEX
> ---
>
> Key: SPARK-48989
> URL: https://issues.apache.org/jira/browse/SPARK-48989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: This was tested from the {{spark-shell}}, in local mode. 
>  All Spark versions were run with default settings.
> Spark 4.0 SNAPSHOT:  Exception.
> Spark 4.0 Preview:  Exception.
> Spark 3.5.1:  Success.
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> One seems to run into a {{NumberFormatException}}, possibly from an error in 
> WholeStageCodeGen, when I exercise {{SUBSTRING_INDEX}} with a null row, thus:
> {code:scala}
> // Create integer table with one null.
> sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) 
> ").repartition(1).write.mode("overwrite").parquet("/tmp/mytable")
> // Exercise substring-index.
> sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM 
> PARQUET.`/tmp/mytable` ").show()
> {code}
> On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following 
> exception:
> {code}
> java.lang.NumberFormatException: For input string: "columnartorow_value_0"
>   at 
> java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
>   at java.base/java.lang.Integer.parseInt(Integer.java:668)
>   at 
> org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868)
>   at 
> org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
>   at scala.Option.getOrElse(Option.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
>   at 
> org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
>   at scala.Option.getOrElse(Option.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162)
>   at 
> org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74)
>   at scala.collection.immutable.List.map(List.scala:247)
>   at scala.collection.immutable.List.map(List.scala:79)
>   at 
> org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68)
>   at 
> org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99)
> {code}
> The same query seems to run alright on Spark 3.5.x:
> {code}
> ++-+
> | num| subs|
> ++-+
> |   1|a|
> |   2|  a_a|
> |   3|a_a_a|
> |NULL| NULL|
> ++-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48989][SQL] Fix error result of throwing an exception of `SUBSTRING_INDEX` built-in function with Codegen

2024-07-29 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f9d63158bb98 [SPARK-48989][SQL] Fix error result of throwing an 
exception of `SUBSTRING_INDEX` built-in function with Codegen
f9d63158bb98 is described below

commit f9d63158bb98dd35fea7dd3df34baf1624ea4497
Author: Wei Guo 
AuthorDate: Tue Jul 30 12:39:27 2024 +0800

[SPARK-48989][SQL] Fix error result of throwing an exception of 
`SUBSTRING_INDEX` built-in function with Codegen

### What changes were proposed in this pull request?

This PR aims to fix error result of throwing an exception of 
`SUBSTRING_INDEX` built-in function with Codegen after related work about 
supporting it to work with collated strings.

### Why are the changes needed?

Fix a bug.  Currently, this function cannot be used because it throws an 
exception.

![image](https://github.com/user-attachments/assets/32aaddb8-b261-47a2-8eeb-d66ed79a7db2)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA and add related test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47481 from wayneguow/SPARK-48989.

Authored-by: Wei Guo 
Signed-off-by: Wenchen Fan 
---
 .../java/org/apache/spark/sql/catalyst/util/CollationSupport.java | 8 
 .../apache/spark/sql/catalyst/expressions/stringExpressions.scala | 2 +-
 .../spark/sql/catalyst/expressions/StringExpressionsSuite.scala   | 2 ++
 .../test/scala/org/apache/spark/sql/StringFunctionsSuite.scala| 8 
 4 files changed, 15 insertions(+), 5 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
index 453423ddbc33..672708bec82b 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
@@ -469,15 +469,15 @@ public final class CollationSupport {
   }
 }
 public static String genCode(final String string, final String delimiter,
-final int count, final int collationId) {
+final String count, final int collationId) {
   CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
   String expr = "CollationSupport.SubstringIndex.exec";
   if (collation.supportsBinaryEquality) {
-return String.format(expr + "Binary(%s, %s, %d)", string, delimiter, 
count);
+return String.format(expr + "Binary(%s, %s, %s)", string, delimiter, 
count);
   } else if (collation.supportsLowercaseEquality) {
-return String.format(expr + "Lowercase(%s, %s, %d)", string, 
delimiter, count);
+return String.format(expr + "Lowercase(%s, %s, %s)", string, 
delimiter, count);
   } else {
-return String.format(expr + "ICU(%s, %s, %d, %d)", string, delimiter, 
count, collationId);
+return String.format(expr + "ICU(%s, %s, %d, %s)", string, delimiter, 
count, collationId);
   }
 }
 public static UTF8String execBinary(final UTF8String string, final 
UTF8String delimiter,
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
index 08f373a86ae3..95531c29a9af 100755
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
@@ -1657,7 +1657,7 @@ case class SubstringIndex(strExpr: Expression, delimExpr: 
Expression, countExpr:
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 defineCodeGen(ctx, ev, (str, delim, count) =>
-  CollationSupport.SubstringIndex.genCode(str, delim, 
Integer.parseInt(count, 10), collationId))
+  CollationSupport.SubstringIndex.genCode(str, delim, count, collationId))
   }
 
   override protected def withNewChildrenInternal(
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
index 847c783e1936..7210979f0846 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
@@ -356,6 +356,8 @@ class StringExpressionsSuite extends SparkFunSuit

(spark) branch master updated: [SPARK-48865][SQL][FOLLOWUP] Add failOnError argument to replace TryEval in TryUrlDecode

2024-07-29 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 80c44329fec4 [SPARK-48865][SQL][FOLLOWUP] Add failOnError argument to 
replace TryEval in TryUrlDecode
80c44329fec4 is described below

commit 80c44329fec47f7eaa0f25b3dff96e2ee06c80c9
Author: wforget <643348...@qq.com>
AuthorDate: Tue Jul 30 11:25:39 2024 +0800

[SPARK-48865][SQL][FOLLOWUP] Add failOnError argument to replace TryEval in 
TryUrlDecode

### What changes were proposed in this pull request?

Add `failOnError` argument for `UrlDecode` to replace `TryEval` in 
`TryUrlDecode`.

### Why are the changes needed?

Address https://github.com/apache/spark/pull/47294#discussion_r1681150787

> I'm not a big fan of TryEval as it catches all the exceptions, including 
the ones not from UrlDecode, but from its input expressions.
>
> We should add a boolean flag in UrlDecode to control the null-on-error 
behavior.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47514 from wForget/try_url_decode.

Authored-by: wforget <643348...@qq.com>
Signed-off-by: Wenchen Fan 
---
 .../explain-results/function_try_url_decode.explain |  2 +-
 .../explain-results/function_url_decode.explain |  2 +-
 .../spark/sql/catalyst/expressions/urlExpressions.scala | 17 ++---
 .../sql-tests/analyzer-results/url-functions.sql.out|  8 
 4 files changed, 16 insertions(+), 13 deletions(-)

diff --git 
a/connect/common/src/test/resources/query-tests/explain-results/function_try_url_decode.explain
 
b/connect/common/src/test/resources/query-tests/explain-results/function_try_url_decode.explain
index 74b360a6b5f3..d1d6d3b1476b 100644
--- 
a/connect/common/src/test/resources/query-tests/explain-results/function_try_url_decode.explain
+++ 
b/connect/common/src/test/resources/query-tests/explain-results/function_try_url_decode.explain
@@ -1,2 +1,2 @@
-Project [tryeval(static_invoke(UrlCodec.decode(g#0, UTF-8))) AS 
try_url_decode(g)#0]
+Project [static_invoke(UrlCodec.decode(g#0, UTF-8, false)) AS 
try_url_decode(g)#0]
 +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/connect/common/src/test/resources/query-tests/explain-results/function_url_decode.explain
 
b/connect/common/src/test/resources/query-tests/explain-results/function_url_decode.explain
index 6111cc1374fb..ef203ac49a32 100644
--- 
a/connect/common/src/test/resources/query-tests/explain-results/function_url_decode.explain
+++ 
b/connect/common/src/test/resources/query-tests/explain-results/function_url_decode.explain
@@ -1,2 +1,2 @@
-Project [static_invoke(UrlCodec.decode(g#0, UTF-8)) AS url_decode(g)#0]
+Project [static_invoke(UrlCodec.decode(g#0, UTF-8, true)) AS url_decode(g)#0]
 +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala
index c2b999f30161..84531d6b1ad6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala
@@ -29,7 +29,7 @@ import org.apache.spark.sql.catalyst.trees.UnaryLike
 import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.internal.types.StringTypeAnyCollation
-import org.apache.spark.sql.types.{AbstractDataType, DataType}
+import org.apache.spark.sql.types.{AbstractDataType, BooleanType, DataType}
 import org.apache.spark.unsafe.types.UTF8String
 
 // scalastyle:off line.size.limit
@@ -86,16 +86,18 @@ case class UrlEncode(child: Expression)
   since = "3.4.0",
   group = "url_funcs")
 // scalastyle:on line.size.limit
-case class UrlDecode(child: Expression)
+case class UrlDecode(child: Expression, failOnError: Boolean = true)
   extends RuntimeReplaceable with UnaryLike[Expression] with 
ImplicitCastInputTypes {
 
+  def this(child: Expression) = this(child, true)
+
   override lazy val replacement: Expression =
 StaticInvoke(
   UrlCodec.getClass,
   SQLConf.get.defaultStringType,
   "decode",
-  Seq(child, Literal("UTF-8")),
-  Seq(StringTypeAnyCollation, StringTypeAnyCollation))
+  Seq(child, Literal("UTF-8"), Literal(failOnError)),
+  Seq(StringTypeAnyCollation, StringTypeAnyCollation, BooleanType))
 
   override protected def

(spark) branch master updated: [SPARK-48900] Add `reason` field for all internal calls for job/stage cancellation

2024-07-29 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 308669fc3019 [SPARK-48900] Add `reason` field for all internal calls 
for job/stage cancellation
308669fc3019 is described below

commit 308669fc301916837bacb7c3ec1ecef93190c094
Author: Mingkang Li 
AuthorDate: Tue Jul 30 09:02:27 2024 +0800

[SPARK-48900] Add `reason` field for all internal calls for job/stage 
cancellation

### What changes were proposed in this pull request?

The changes can be grouped into two categories:

- **Add `reason` field for all internal calls for job/stage cancellation**
  - Cancel because of exceptions:
- org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala
- org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala
- 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
   - Cancel by user (Web UI):
  - org/apache/spark/ui/jobs/JobsTab.scala
   - Cancel when streaming terminates/query ends:
 - org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 - 
org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

_(Developers familiar with these components: would appreciate it if you 
could provide suggestions for more helpful cancellation reason strings! :) )_

- **API Change for `JobWaiter`**
Currently, the `.cancel()` function in `JobWaiter` does not allow 
specifying a reason for cancellation. This limitation prevents us from 
reporting cancellation reasons in AQE, such as cleanup or canceling unnecessary 
jobs after query replanning. To address this, we should add an `reason: String` 
parameter to all relevant APIs along the chain.

https://github.com/user-attachments/assets/239cbcd6-8d78-446a-98d0-456e5e837494";>

### Why are the changes needed?

Today it is difficult to determine why a job, stage, or job group was 
canceled. We should leverage existing Spark functionality to provide a reason 
string explaining the cancellation cause, and should add new APIs to let us 
provide this reason when canceling job groups. For more context, please read 
[this JIRA ticket](https://issues.apache.org/jira/browse/SPARK-48900).

This feature can be implemented in two PRs:
1. [Modify the current SparkContext and its downstream APIs to add the 
reason string, such as cancelJobGroup and 
cancelJobsWithTag](https://github.com/apache/spark/pull/47361)
2. Add reasons for all internal calls to these methods.

**Note: This is the second of the two PRs to implement this new feature**

### Does this PR introduce _any_ user-facing change?

It adds reasons for jobs and stages cancelled internally, providing users 
with clearer explanations for cancellations by Spark, such as exceptions, end 
of streaming, AQE query replanning, etc.

### How was this patch tested?

Tests for the API changes on the `JobWaiter` chain are added in 
`JobCancellationSuite.scala`

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47374 from mingkangli-db/reason_internal_calls.

Authored-by: Mingkang Li 
Signed-off-by: Wenchen Fan 
---
 .../connect/execution/ExecuteThreadRunner.scala|  4 ++-
 .../main/scala/org/apache/spark/FutureAction.scala | 15 ++---
 .../org/apache/spark/scheduler/JobWaiter.scala | 17 +++---
 .../scala/org/apache/spark/ui/jobs/JobsTab.scala   |  2 +-
 .../org/apache/spark/JobCancellationSuite.scala| 37 +-
 project/MimaExcludes.scala |  3 ++
 .../execution/adaptive/AdaptiveSparkPlanExec.scala |  2 +-
 .../sql/execution/adaptive/QueryStageExec.scala| 13 
 .../execution/exchange/BroadcastExchangeExec.scala | 11 ---
 .../execution/exchange/ShuffleExchangeExec.scala   |  6 ++--
 .../execution/streaming/MicroBatchExecution.scala  |  6 ++--
 .../streaming/continuous/ContinuousExecution.scala |  3 +-
 .../sql/execution/BroadcastExchangeSuite.scala |  3 +-
 .../SparkExecuteStatementOperation.scala   |  5 +--
 14 files changed, 94 insertions(+), 33 deletions(-)

diff --git 
a/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala
 
b/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala
index 4ef4f632204b..1a38f237ee09 100644
--- 
a/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala
+++ 
b/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala
@@ -115,7 +115,9 @@ private[connect] class ExecuteThreadRunner(executeHolder: 
ExecuteHolder)

[jira] [Resolved] (SPARK-48910) Slow linear searches in PreprocessTableCreation

2024-07-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48910.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47484
[https://github.com/apache/spark/pull/47484]

> Slow linear searches in PreprocessTableCreation
> ---
>
> Key: SPARK-48910
> URL: https://issues.apache.org/jira/browse/SPARK-48910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Vladimir Golubev
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> PreprocessTableCreation does Seq.contains over partition columns, which 
> becomes very slow in case of 1000s of partitions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48910) Slow linear searches in PreprocessTableCreation

2024-07-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48910:
---

Assignee: Vladimir Golubev

> Slow linear searches in PreprocessTableCreation
> ---
>
> Key: SPARK-48910
> URL: https://issues.apache.org/jira/browse/SPARK-48910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Vladimir Golubev
>Priority: Major
>  Labels: pull-request-available
>
> PreprocessTableCreation does Seq.contains over partition columns, which 
> becomes very slow in case of 1000s of partitions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation

2024-07-29 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new db036537ab55 [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear 
searches in PreprocessTableCreation
db036537ab55 is described below

commit db036537ab5593b2520742b3b1a1028bb0fcc7fa
Author: Vladimir Golubev 
AuthorDate: Mon Jul 29 22:07:02 2024 +0800

[SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in 
PreprocessTableCreation

### What changes were proposed in this pull request?

Use `HashSet`/`HashMap` instead of doing linear searches over the `Seq`. In 
case of 1000s of partitions this significantly improves the performance.

### Why are the changes needed?

To avoid the O(n*m) passes in the `PreprocessTableCreation`

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UTs

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47484 from 
vladimirg-db/vladimirg-db/get-rid-of-linear-searches-preprocess-table-creation.

Authored-by: Vladimir Golubev 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/datasources/rules.scala| 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
index 9bc0793650ac..9279862c9196 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.datasources
 
 import java.util.Locale
 
+import scala.collection.mutable.{HashMap, HashSet}
 import scala.jdk.CollectionConverters._
 
 import org.apache.spark.sql.{AnalysisException, SaveMode, SparkSession}
@@ -248,10 +249,14 @@ case class PreprocessTableCreation(catalog: 
SessionCatalog) extends Rule[Logical
 DDLUtils.checkTableColumns(tableDesc.copy(schema = 
analyzedQuery.schema))
 
 val output = analyzedQuery.output
+
+val outputByName = HashMap(output.map(o => o.name -> o): _*)
 val partitionAttrs = normalizedTable.partitionColumnNames.map { 
partCol =>
-  output.find(_.name == partCol).get
+  outputByName(partCol)
 }
-val newOutput = output.filterNot(partitionAttrs.contains) ++ 
partitionAttrs
+val partitionAttrsSet = HashSet(partitionAttrs: _*)
+val newOutput = output.filterNot(partitionAttrsSet.contains) ++ 
partitionAttrs
+
 val reorderedQuery = if (newOutput == output) {
   analyzedQuery
 } else {
@@ -263,12 +268,14 @@ case class PreprocessTableCreation(catalog: 
SessionCatalog) extends Rule[Logical
 DDLUtils.checkTableColumns(tableDesc)
 val normalizedTable = normalizeCatalogTable(tableDesc.schema, 
tableDesc)
 
+val normalizedSchemaByName = HashMap(normalizedTable.schema.map(s => 
s.name -> s): _*)
 val partitionSchema = normalizedTable.partitionColumnNames.map { 
partCol =>
-  normalizedTable.schema.find(_.name == partCol).get
+  normalizedSchemaByName(partCol)
 }
-
-val reorderedSchema =
-  
StructType(normalizedTable.schema.filterNot(partitionSchema.contains) ++ 
partitionSchema)
+val partitionSchemaSet = HashSet(partitionSchema: _*)
+val reorderedSchema = StructType(
+  normalizedTable.schema.filterNot(partitionSchemaSet.contains) ++ 
partitionSchema
+)
 
 c.copy(tableDesc = normalizedTable.copy(schema = reorderedSchema))
   }
@@ -360,8 +367,9 @@ case class PreprocessTableCreation(catalog: SessionCatalog) 
extends Rule[Logical
 messageParameters = Map.empty)
 }
 
+val normalizedPartitionColsSet = HashSet(normalizedPartitionCols: _*)
 schema
-  .filter(f => normalizedPartitionCols.contains(f.name))
+  .filter(f => normalizedPartitionColsSet.contains(f.name))
   .foreach { field =>
 if (!PartitioningUtils.canPartitionOn(field.dataType)) {
   throw 
QueryCompilationErrors.invalidPartitionColumnDataTypeError(field)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [VOTE] Release Spark 3.5.2 (RC4)

2024-07-29 Thread Wenchen Fan
+1

On Sat, Jul 27, 2024 at 10:03 AM Dongjoon Hyun 
wrote:

> +1
>
> Thank you, Kent.
>
> Dongjoon.
>
> On Fri, Jul 26, 2024 at 6:37 AM Kent Yao  wrote:
>
>> Hi dev,
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.5.2.
>>
>> The vote is open until Jul 29, 14:00:00 UTC, and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v3.5.2-rc4 (commit
>> 1edbddfadeb46581134fa477d35399ddc63b7163):
>> https://github.com/apache/spark/tree/v3.5.2-rc4
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc4-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1460/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc4-docs/
>>
>> The list of bug fixes going into 3.5.2 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12353980
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC via "pip install
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc4-bin/pyspark-3.5.2.tar.gz
>> "
>> and see if anything important breaks.
>> In the Java/Scala, you can add the staging repository to your projects
>> resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.5.2?
>> ===
>>
>> The current list of open tickets targeted at 3.5.2 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for
>> "Target Version/s" = 3.5.2
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> Thanks,
>> Kent Yao
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


[jira] [Assigned] (SPARK-45787) Support Catalog.listColumns() for clustering columns

2024-07-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45787:
---

Assignee: Jiaheng Tang  (was: Terry Kim)

> Support Catalog.listColumns() for clustering columns
> 
>
> Key: SPARK-45787
> URL: https://issues.apache.org/jira/browse/SPARK-45787
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Terry Kim
>Assignee: Jiaheng Tang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Support Catalog.listColumns() for clustering columns so that it `
> org.apache.spark.sql.catalog.Column` contains clustering info (e.g., 
> isCluster).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45787) Support Catalog.listColumns() for clustering columns

2024-07-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45787.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47451
[https://github.com/apache/spark/pull/47451]

> Support Catalog.listColumns() for clustering columns
> 
>
> Key: SPARK-45787
> URL: https://issues.apache.org/jira/browse/SPARK-45787
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Support Catalog.listColumns() for clustering columns so that it `
> org.apache.spark.sql.catalog.Column` contains clustering info (e.g., 
> isCluster).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45787) Support Catalog.listColumns() for clustering columns

2024-07-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45787:
---

Assignee: Terry Kim

> Support Catalog.listColumns() for clustering columns
> 
>
> Key: SPARK-45787
> URL: https://issues.apache.org/jira/browse/SPARK-45787
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>  Labels: pull-request-available
>
> Support Catalog.listColumns() for clustering columns so that it `
> org.apache.spark.sql.catalog.Column` contains clustering info (e.g., 
> isCluster).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated (f3b819ec642c -> e73ede7939f2)

2024-07-25 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from f3b819ec642c [SPARK-48503][SQL] Allow grouping on expressions in 
scalar subqueries, if they are bound to outer rows
 add e73ede7939f2 [SPARK-45787][SQL] Support Catalog.listColumns for 
clustering columns

No new revisions were added by this update.

Summary of changes:
 R/pkg/tests/fulltests/test_sparkSQL.R  |  3 +-
 .../org/apache/spark/sql/catalog/interface.scala   | 18 --
 python/pyspark/sql/catalog.py  |  2 ++
 python/pyspark/sql/connect/catalog.py  |  1 +
 python/pyspark/sql/tests/test_catalog.py   |  4 +++
 .../catalyst/catalog/ExternalCatalogSuite.scala|  9 +++--
 .../sql/catalyst/catalog/SessionCatalogSuite.scala |  5 ++-
 .../org/apache/spark/sql/catalog/interface.scala   | 17 +++--
 .../apache/spark/sql/internal/CatalogImpl.scala| 15 +---
 .../apache/spark/sql/internal/CatalogSuite.scala   | 41 ++
 10 files changed, 87 insertions(+), 28 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows

2024-07-25 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f3b819ec642c [SPARK-48503][SQL] Allow grouping on expressions in 
scalar subqueries, if they are bound to outer rows
f3b819ec642c is described below

commit f3b819ec642c22327e84ff0c08d255a23bb51611
Author: Andrey Gubichev 
AuthorDate: Fri Jul 26 09:37:28 2024 +0800

[SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if 
they are bound to outer rows

### What changes were proposed in this pull request?

Extends previous work in https://github.com/apache/spark/pull/46839, 
allowing the grouping expressions to be bound to outer references.

Most common example is
`select *, (select count(*) from T_inner where cast(T_inner.x as date) = 
T_outer.date group by cast(T_inner.x as date))`

Here, we group by cast(T_inner.x as date) which is bound to an outer row. 
This guarantees that for every outer row, there is exactly one value of 
cast(T_inner.x as date), so it is safe to group on it.
Previously, we required that only columns can be bound to outer 
expressions, thus forbidding such subqueries.

### Why are the changes needed?

Extends supported subqueries

### Does this PR introduce _any_ user-facing change?

Yes, previously failing queries are now passing

### How was this patch tested?

Query tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47388 from agubichev/group_by_cols.

Authored-by: Andrey Gubichev 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  | 29 ++---
 .../spark/sql/catalyst/expressions/subquery.scala  | 73 --
 .../scalar-subquery-group-by.sql.out   | 53 
 .../scalar-subquery/scalar-subquery-group-by.sql   |  7 +++
 .../scalar-subquery-group-by.sql.out   | 41 
 5 files changed, 175 insertions(+), 28 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 2bc6785aa40c..3a18b0c4d2ea 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -905,26 +905,31 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   // SPARK-18504/SPARK-18814: Block cases where GROUP BY columns
   // are not part of the correlated columns.
 
-  // Note: groupByCols does not contain outer refs - grouping by an outer 
ref is always ok
-  val groupByCols = 
AttributeSet(agg.groupingExpressions.flatMap(_.references))
-  // Collect the inner query attributes that are guaranteed to have a 
single value for each
-  // outer row. See comment on getCorrelatedEquivalentInnerColumns.
-  val correlatedEquivalentCols = getCorrelatedEquivalentInnerColumns(query)
-  val nonEquivalentGroupByCols = groupByCols -- correlatedEquivalentCols
+  // Collect the inner query expressions that are guaranteed to have a 
single value for each
+  // outer row. See comment on getCorrelatedEquivalentInnerExpressions.
+  val correlatedEquivalentExprs = 
getCorrelatedEquivalentInnerExpressions(query)
+  // Grouping expressions, except outer refs and constant expressions - 
grouping by an
+  // outer ref or a constant is always ok
+  val groupByExprs =
+ExpressionSet(agg.groupingExpressions.filter(x => 
!x.isInstanceOf[OuterReference] &&
+  x.references.nonEmpty))
+  val nonEquivalentGroupByExprs = groupByExprs -- correlatedEquivalentExprs
 
   val invalidCols = if (!SQLConf.get.getConf(
 
SQLConf.LEGACY_SCALAR_SUBQUERY_ALLOW_GROUP_BY_NON_EQUALITY_CORRELATED_PREDICATE))
 {
-nonEquivalentGroupByCols
+nonEquivalentGroupByExprs
   } else {
 // Legacy incorrect logic for checking for invalid group-by columns 
(see SPARK-48503).
 // Allows any inner attribute that appears in a correlated predicate, 
even if it is a
 // non-equality predicate or under an operator that can change the 
values of the attribute
 // (see comments on getCorrelatedEquivalentInnerColumns for examples).
+// Note: groupByCols does not contain outer refs - grouping by an 
outer ref is always ok
+val groupByCols = 
AttributeSet(agg.groupingExpressions.flatMap(_.references))
 val subqueryColumns = 
getCorrelatedPredicates(query).flatMap(_.references)
   .filterNot(conditions.flatMap(_.references).contains)
 val correlatedCols = A

Re: [外部邮件] [VOTE] Release Spark 3.5.2 (RC2)

2024-07-25 Thread Wenchen Fan
I'm changing my vote to -1 as we found a regression that breaks Delta
Lake's generated column feature. The fix was merged just now:
https://github.com/apache/spark/pull/47483

Can we cut a new RC?

On Thu, Jul 25, 2024 at 3:13 PM Mridul Muralidharan 
wrote:

>
> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Phive -Pyarn -Pkubernetes
>
> Regards,
> Mridul
>
>
> On Tue, Jul 23, 2024 at 9:51 PM Kent Yao  wrote:
>
>> +1(non-binding), I have checked:
>>
>> - Download links are OK
>> - Signatures, Checksums, and the KEYS file are OK
>> - LICENSE and NOTICE are present
>> - No unexpected binary files in source releases
>> - Successfully built from source
>>
>> Thanks,
>> Kent Yao
>>
>> On 2024/07/23 06:55:28 yangjie01 wrote:
>> > +1, Thanks Kent Yao ~
>> >
>> > 在 2024/7/22 17:01,“Kent Yao”mailto:y...@apache.org>>
>> 写入:
>> >
>> >
>> > Hi dev,
>> >
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 3.5.2.
>> >
>> >
>> > The vote is open until Jul 25, 09:00:00 AM UTC, and passes if a
>> majority +1
>> > PMC votes are cast, with
>> > a minimum of 3 +1 votes.
>> >
>> >
>> > [ ] +1 Release this package as Apache Spark 3.5.2
>> > [ ] -1 Do not release this package because ...
>> >
>> >
>> > To learn more about Apache Spark, please see https://spark.apache.org/
>> 
>> >
>> >
>> > The tag to be voted on is v3.5.2-rc2 (commit
>> > 6d8f511430881fa7a3203405260da174df424103):
>> > https://github.com/apache/spark/tree/v3.5.2-rc2 <
>> https://github.com/apache/spark/tree/v3.5.2-rc2>
>> >
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/ <
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/>
>> >
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS <
>> https://dist.apache.org/repos/dist/dev/spark/KEYS>
>> >
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1458/
>> 
>> >
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-docs/ <
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-docs/>
>> >
>> >
>> > The list of bug fixes going into 3.5.2 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12353980 <
>> https://issues.apache.org/jira/projects/SPARK/versions/12353980>
>> >
>> >
>> > FAQ
>> >
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC via "pip install
>> >
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/pyspark-3.5.2.tar.gz";
>> <
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/pyspark-3.5.2.tar.gz"
>> ;>
>> > and see if anything important breaks.
>> > In the Java/Scala, you can add the staging repository to your projects
>> > resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with an out of date RC going forward).
>> >
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.5.2?
>> > ===
>> >
>> >
>> > The current list of open tickets targeted at 3.5.2 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK <
>> https://issues.apache.org/jira/projects/SPARK> and search for
>> > "Target Version/s" = 3.5.2
>> >
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> >
>> > Thanks,
>> > Kent Yao
>> >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > dev-unsubscr...@spark.apache.org>
>> >
>> >
>> >
>> >
>> >
>> >
>> > -
>> > To unsu

[jira] [Resolved] (SPARK-48761) Add clusterBy DataFrameWriter API for Scala

2024-07-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48761.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47301
[https://github.com/apache/spark/pull/47301]

> Add clusterBy DataFrameWriter API for Scala
> ---
>
> Key: SPARK-48761
> URL: https://issues.apache.org/jira/browse/SPARK-48761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaheng Tang
>Assignee: Jiaheng Tang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add a new `clusterBy` DataFrameWriter API for Scala. This allows users to 
> interact with clustered tables using DataFrameWriter API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48761) Add clusterBy DataFrameWriter API for Scala

2024-07-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48761:
---

Assignee: Jiaheng Tang

> Add clusterBy DataFrameWriter API for Scala
> ---
>
> Key: SPARK-48761
> URL: https://issues.apache.org/jira/browse/SPARK-48761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaheng Tang
>Assignee: Jiaheng Tang
>Priority: Major
>  Labels: pull-request-available
>
> Add a new `clusterBy` DataFrameWriter API for Scala. This allows users to 
> interact with clustered tables using DataFrameWriter API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala

2024-07-24 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bafce5de394b [SPARK-48761][SQL] Introduce clusterBy DataFrameWriter 
API for Scala
bafce5de394b is described below

commit bafce5de394bafe63226ab5dae0ce3ac4245a793
Author: Jiaheng Tang 
AuthorDate: Thu Jul 25 12:58:11 2024 +0800

[SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala

### What changes were proposed in this pull request?

Introduce a new `clusterBy` DataFrame API in Scala. This PR adds the API 
for both the DataFrameWriter V1 and V2, as well as Spark Connect.

### Why are the changes needed?

Introduce more ways for users to interact with clustered tables.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new `clusterBy` DataFrame API in Scala to allow specifying 
the clustering columns when writing DataFrames.

### How was this patch tested?

New unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47301 from zedtang/clusterby-scala-api.

Authored-by: Jiaheng Tang 
Signed-off-by: Wenchen Fan 
---
 .../src/main/resources/error/error-conditions.json | 14 
 .../sql/connect/planner/SparkConnectPlanner.scala  | 10 +++
 .../org/apache/spark/sql/connect/dsl/package.scala |  4 ++
 .../connect/planner/SparkConnectProtoSuite.scala   | 42 
 .../org/apache/spark/sql/DataFrameWriter.scala | 19 ++
 .../org/apache/spark/sql/DataFrameWriterV2.scala   | 23 +++
 .../org/apache/spark/sql/ClientDatasetSuite.scala  |  4 ++
 project/MimaExcludes.scala |  4 +-
 .../spark/sql/catalyst/catalog/interface.scala | 31 +
 .../spark/sql/errors/QueryCompilationErrors.scala  | 30 +
 .../sql/connector/catalog/InMemoryBaseTable.scala  |  4 ++
 .../org/apache/spark/sql/DataFrameWriter.scala | 66 +--
 .../org/apache/spark/sql/DataFrameWriterV2.scala   | 40 ++-
 .../execution/datasources/DataSourceUtils.scala|  5 ++
 .../spark/sql/execution/datasources/rules.scala| 19 +-
 .../apache/spark/sql/DataFrameWriterV2Suite.scala  | 50 +-
 .../execution/command/DescribeTableSuiteBase.scala | 51 ++
 .../sql/test/DataFrameReaderWriterSuite.scala  | 77 +-
 18 files changed, 482 insertions(+), 11 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index 44f0a59a4b48..65e063518054 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -471,6 +471,20 @@
 ],
 "sqlState" : "0A000"
   },
+  "CLUSTERING_COLUMNS_MISMATCH" : {
+"message" : [
+  "Specified clustering does not match that of the existing table 
.",
+  "Specified clustering columns: [].",
+  "Existing clustering columns: []."
+],
+"sqlState" : "42P10"
+  },
+  "CLUSTERING_NOT_SUPPORTED" : {
+"message" : [
+  "'' does not support clustering."
+],
+"sqlState" : "42000"
+  },
   "CODEC_NOT_AVAILABLE" : {
 "message" : [
   "The codec  is not available."
diff --git 
a/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
 
b/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
index 013f0d83391c..e790a25ec97f 100644
--- 
a/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
+++ 
b/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
@@ -3090,6 +3090,11 @@ class SparkConnectPlanner(
   w.partitionBy(names.toSeq: _*)
 }
 
+if (writeOperation.getClusteringColumnsCount > 0) {
+  val names = writeOperation.getClusteringColumnsList.asScala
+  w.clusterBy(names.head, names.tail.toSeq: _*)
+}
+
 if (writeOperation.hasSource) {
   w.format(writeOperation.getSource)
 }
@@ -3153,6 +3158,11 @@ class SparkConnectPlanner(
   w.partitionedBy(names.head, names.tail: _*)
 }
 
+if (writeOperation.getClusteringColumnsCount > 0) {
+  val names = writeOperation.getClusteringColumnsList.asScala
+  w.clusterBy(names.head, names.tail.toSeq: _*)
+}
+
 writeOperation.getMode match {
   case proto.WriteOperationV2.Mode.MODE_CREATE =>
 if (writeOperation.hasProvider) {
diff --git 
a/connect/server/src/test/scala/org/apache/spark/sql/connect/dsl/package.scala 
b/connect/server/src

[jira] [Assigned] (SPARK-48990) Unified variable related SQL syntax keywords

2024-07-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48990:
---

Assignee: BingKun Pan

> Unified variable related SQL syntax keywords
> 
>
> Key: SPARK-48990
> URL: https://issues.apache.org/jira/browse/SPARK-48990
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48990) Unified variable related SQL syntax keywords

2024-07-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48990.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47469
[https://github.com/apache/spark/pull/47469]

> Unified variable related SQL syntax keywords
> 
>
> Key: SPARK-48990
> URL: https://issues.apache.org/jira/browse/SPARK-48990
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48990][SQL] Unified variable related SQL syntax keywords

2024-07-24 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 34e65a8e7251 [SPARK-48990][SQL] Unified variable related SQL syntax 
keywords
34e65a8e7251 is described below

commit 34e65a8e72513d0445b5ff9b251e3388625ad1ec
Author: panbingkun 
AuthorDate: Wed Jul 24 22:52:04 2024 +0800

[SPARK-48990][SQL] Unified variable related SQL syntax keywords

### What changes were proposed in this pull request?
The pr aims to unified `variable` related `SQL syntax` keywords, enable 
syntax `DECLARE (OR REPLACE)? ...` and `DROP TEMPORARY ...` to support keyword: 
`VAR` (not only `VARIABLE`).

### Why are the changes needed?
When `setting` variables, we support `(VARIABLE | VAR)`, but when 
`declaring` and `dropping` variables, we only support the keyword `VARIABLE` 
(not support the keyword `VAR`)

https://github.com/user-attachments/assets/07084fef-4080-4410-a74c-e6001ae0a8fa";>


https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L68-L72


https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L218-L220

The syntax seems `a bit weird`, `inconsistent experience` in SQL syntax 
related to variable usage by end-users, so I propose to `unify` it.

### Does this PR introduce _any_ user-facing change?
Yes, enable end-users to use `variable related SQL` with `consistent` 
keywords.

### How was this patch tested?
Updated existed UT.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47469 from panbingkun/SPARK-48990.

Authored-by: panbingkun 
Signed-off-by: Wenchen Fan 
---
 docs/sql-ref-syntax-ddl-declare-variable.md|   2 +-
 docs/sql-ref-syntax-ddl-drop-variable.md   |   2 +-
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |  17 +-
 .../analyzer-results/sql-session-variables.sql.out |   4 +-
 .../sql-tests/inputs/sql-session-variables.sql |   4 +-
 .../results/sql-session-variables.sql.out  |   4 +-
 .../command/DeclareVariableParserSuite.scala   | 188 +
 .../command/DropVariableParserSuite.scala  |  56 ++
 8 files changed, 263 insertions(+), 14 deletions(-)

diff --git a/docs/sql-ref-syntax-ddl-declare-variable.md 
b/docs/sql-ref-syntax-ddl-declare-variable.md
index 518770c4496e..ba9857bf1917 100644
--- a/docs/sql-ref-syntax-ddl-declare-variable.md
+++ b/docs/sql-ref-syntax-ddl-declare-variable.md
@@ -34,7 +34,7 @@ column default expressions, and generated column expressions.
 ### Syntax
 
 ```sql
-DECLARE [ OR REPLACE ] [ VARIABLE ]
+DECLARE [ OR REPLACE ] [ VAR | VARIABLE ]
 variable_name [ data_type ] [ { DEFAULT | = } default_expr ]
 ```
 
diff --git a/docs/sql-ref-syntax-ddl-drop-variable.md 
b/docs/sql-ref-syntax-ddl-drop-variable.md
index f58d944317d5..5b2b0da76945 100644
--- a/docs/sql-ref-syntax-ddl-drop-variable.md
+++ b/docs/sql-ref-syntax-ddl-drop-variable.md
@@ -27,7 +27,7 @@ be thrown if the variable does not exist.
 ### Syntax
 
 ```sql
-DROP TEMPORARY VARIABLE [ IF EXISTS ] variable_name
+DROP TEMPORARY { VAR | VARIABLE } [ IF EXISTS ] variable_name
 ```
 
 ### Parameters
diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
index a50051715e20..c7aa56cf920a 100644
--- 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
+++ 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
@@ -66,8 +66,8 @@ compoundStatement
 ;
 
 setStatementWithOptionalVarKeyword
-: SET (VARIABLE | VAR)? assignmentList  
#setVariableWithOptionalKeyword
-| SET (VARIABLE | VAR)? LEFT_PAREN multipartIdentifierList RIGHT_PAREN EQ
+: SET variable? assignmentList  
#setVariableWithOptionalKeyword
+| SET variable? LEFT_PAREN multipartIdentifierList RIGHT_PAREN EQ
 LEFT_PAREN query RIGHT_PAREN
#setVariableWithOptionalKeyword
 ;
 
@@ -215,9 +215,9 @@ statement
 routineCharacteristics
 RETURN (query | expression)
#createUserDefinedFunction
 | DROP TEMPORARY? FUNCTION (IF EXISTS)? identifierReference
#dropFunction
-| DECLARE (OR REPLACE)? VARIABLE?
+| DECLARE (OR REPLACE)? variable?
 identifierReference dataType? variableDefaultExpression?   
#createVariable
-| DROP TEMPORARY VARIABLE (IF EXISTS)? identifierReference 
#dropVariable
+| DROP TEMPORARY vari

[jira] [Resolved] (SPARK-48833) Support variant in `InMemoryTableScan`

2024-07-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48833.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47252
[https://github.com/apache/spark/pull/47252]

> Support variant in `InMemoryTableScan`
> --
>
> Key: SPARK-48833
> URL: https://issues.apache.org/jira/browse/SPARK-48833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, df.cache() does not support tables with variant types. We should 
> allow for support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



(spark) branch master updated: [SPARK-48833][SQL][VARIANT] Support variant in `InMemoryTableScan`

2024-07-24 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0c9b072e1808 [SPARK-48833][SQL][VARIANT] Support variant in 
`InMemoryTableScan`
0c9b072e1808 is described below

commit 0c9b072e1808e180d8670e4100f3344f039cb072
Author: Richard Chen 
AuthorDate: Wed Jul 24 18:02:38 2024 +0800

[SPARK-48833][SQL][VARIANT] Support variant in `InMemoryTableScan`

### What changes were proposed in this pull request?

adds support for variant type in `InMemoryTableScan`, or `df.cache()` by 
supporting writing variant values to an inmemory buffer.

### Why are the changes needed?

prior to this PR, calling `df.cache()` on a df that has a variant would 
fail because `InMemoryTableScan` does not support reading variant types.

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

added UTs
### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47252 from richardc-db/variant_dfcache_support.

Authored-by: Richard Chen 
Signed-off-by: Wenchen Fan 
---
 .../sql/execution/columnar/ColumnAccessor.scala|  6 +-
 .../sql/execution/columnar/ColumnBuilder.scala |  4 ++
 .../spark/sql/execution/columnar/ColumnStats.scala | 15 +
 .../spark/sql/execution/columnar/ColumnType.scala  | 43 -
 .../columnar/GenerateColumnAccessor.scala  |  3 +-
 .../scala/org/apache/spark/sql/VariantSuite.scala  | 72 ++
 6 files changed, 139 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnAccessor.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnAccessor.scala
index 9652a48e5270..2074649cc986 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnAccessor.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnAccessor.scala
@@ -28,7 +28,7 @@ import org.apache.spark.sql.errors.QueryExecutionErrors
 import 
org.apache.spark.sql.execution.columnar.compression.CompressibleColumnAccessor
 import org.apache.spark.sql.execution.vectorized.WritableColumnVector
 import org.apache.spark.sql.types._
-import org.apache.spark.unsafe.types.CalendarInterval
+import org.apache.spark.unsafe.types.{CalendarInterval, VariantVal}
 
 /**
  * An `Iterator` like trait used to extract values from columnar byte buffer. 
When a value is
@@ -111,6 +111,10 @@ private[columnar] class IntervalColumnAccessor(buffer: 
ByteBuffer)
   extends BasicColumnAccessor[CalendarInterval](buffer, CALENDAR_INTERVAL)
   with NullableColumnAccessor
 
+private[columnar] class VariantColumnAccessor(buffer: ByteBuffer)
+  extends BasicColumnAccessor[VariantVal](buffer, VARIANT)
+  with NullableColumnAccessor
+
 private[columnar] class CompactDecimalColumnAccessor(buffer: ByteBuffer, 
dataType: DecimalType)
   extends NativeColumnAccessor(buffer, COMPACT_DECIMAL(dataType))
 
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnBuilder.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnBuilder.scala
index 9fafdb794841..b65ef12f12d5 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnBuilder.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnBuilder.scala
@@ -131,6 +131,9 @@ class BinaryColumnBuilder extends ComplexColumnBuilder(new 
BinaryColumnStats, BI
 private[columnar]
 class IntervalColumnBuilder extends ComplexColumnBuilder(new 
IntervalColumnStats, CALENDAR_INTERVAL)
 
+private[columnar]
+class VariantColumnBuilder extends ComplexColumnBuilder(new 
VariantColumnStats, VARIANT)
+
 private[columnar] class CompactDecimalColumnBuilder(dataType: DecimalType)
   extends NativeColumnBuilder(new DecimalColumnStats(dataType), 
COMPACT_DECIMAL(dataType))
 
@@ -189,6 +192,7 @@ private[columnar] object ColumnBuilder {
   case s: StringType => new StringColumnBuilder(s)
   case BinaryType => new BinaryColumnBuilder
   case CalendarIntervalType => new IntervalColumnBuilder
+  case VariantType => new VariantColumnBuilder
   case dt: DecimalType if dt.precision <= Decimal.MAX_LONG_DIGITS =>
 new CompactDecimalColumnBuilder(dt)
   case dt: DecimalType => new DecimalColumnBuilder(dt)
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala
index 45f489cb13c2..4e4b3667fa24 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala
@@ -297,6 +297,2

[jira] [Assigned] (SPARK-48833) Support variant in `InMemoryTableScan`

2024-07-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48833:
---

Assignee: Richard Chen

> Support variant in `InMemoryTableScan`
> --
>
> Key: SPARK-48833
> URL: https://issues.apache.org/jira/browse/SPARK-48833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
>
> Currently, df.cache() does not support tables with variant types. We should 
> allow for support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48338) Sql Scripting support for Spark SQL

2024-07-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48338.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47404
[https://github.com/apache/spark/pull/47404]

> Sql Scripting support for Spark SQL
> ---
>
> Key: SPARK-48338
> URL: https://issues.apache.org/jira/browse/SPARK-48338
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - 
> OSS.pdf
>
>
> Design doc for this feature is in attachment.
> High level example of Sql Script:
> ```
> BEGIN
>   DECLARE c INT = 10;
>   WHILE c > 0 DO
> INSERT INTO tscript VALUES (c);
> SET c = c - 1;
>   END WHILE;
> END
> ```
> High level motivation behind this feature:
> SQL Scripting gives customers the ability to develop complex ETL and analysis 
> entirely in SQL. Until now, customers have had to write verbose SQL 
> statements or combine SQL + Python to efficiently write business logic. 
> Coming from another system, customers have to choose whether or not they want 
> to migrate to pyspark. Some customers end up not using Spark because of this 
> gap. SQL Scripting is a key milestone towards enabling SQL practitioners to 
> write sophisticated queries, without the need to use pyspark. Further, SQL 
> Scripting is a necessary step towards support for SQL Stored Procedures, and 
> along with SQL Variables (released) and Temp Tables (in progress), will allow 
> for more seamless data warehouse migrations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >