[spark] branch master updated: [SPARK-39749][SQL][FOLLOWUP] Move the new behavior of CAST(DECIMAL AS STRING) under ANSI SQL mode

2022-07-18 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5bd47640dc4 [SPARK-39749][SQL][FOLLOWUP] Move the new behavior of 
CAST(DECIMAL AS STRING) under ANSI SQL mode
5bd47640dc4 is described below

commit 5bd47640dc455ae27c94825743a127bbb20e59bf
Author: Gengliang Wang 
AuthorDate: Tue Jul 19 12:14:43 2022 +0900

[SPARK-39749][SQL][FOLLOWUP] Move the new behavior of CAST(DECIMAL AS 
STRING) under ANSI SQL mode

### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/37160, which 
changes the default behavior of `CAST(DECIMAL AS STRING)` as always using the 
plain string representation for ANSI SQL compliance.
This PR is to move the new behavior under ANSI SQL mode. The default 
behavior remains unchanged.
### Why are the changes needed?

After offline discussions,  changing the default behavior brings upgrade 
risks to Spark 3.4.0. AFAIK, there are existing tables storing the results of 
casting decimal as strings. Running aggregation or joins over these tables can 
produce incorrect results.
Since the motivation for the new behaviors is ANSI SQL compliance, the new 
behavior only exists in the ANSI SQL mode makes more sense.
Users who enable the ANSI SQL mode are supposed to accept such a breaking 
change.
### Does this PR introduce _any_ user-facing change?

Yes, though not released yet, this PR makes the behavior of casting decimal 
values to string more reasonable by always using plain string under ANSI SQL 
mode.
### How was this patch tested?

UT

Closes #37221 from gengliangwang/ansiDecimalToString.

Authored-by: Gengliang Wang 
Signed-off-by: Hyukjin Kwon 
---
 docs/sql-migration-guide.md   |  1 -
 docs/sql-ref-ansi-compliance.md   |  5 +++--
 .../org/apache/spark/sql/catalyst/expressions/Cast.scala  | 11 ---
 .../main/scala/org/apache/spark/sql/internal/SQLConf.scala| 11 ---
 .../apache/spark/sql/catalyst/expressions/CastSuiteBase.scala |  8 
 .../spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala |  5 +
 .../spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala  |  5 +
 7 files changed, 21 insertions(+), 25 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 0d83ee3aeb3..75f7f6c9f8c 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -26,7 +26,6 @@ license: |
   
   - Since Spark 3.4, Number or Number(\*) from Teradata will be treated as 
Decimal(38,18). In Spark 3.3 or earlier, Number or Number(\*) from Teradata 
will be treated as Decimal(38, 0), in which case the fractional part will be 
removed.
   - Since Spark 3.4, v1 database, table, permanent view and function 
identifier will include 'spark_catalog' as the catalog name if database is 
defined, e.g. a table identifier will be: `spark_catalog.default.t`. To restore 
the legacy behavior, set `spark.sql.legacy.v1IdentifierNoCatalog` to `true`.
-  - Since Spark 3.4, the results of casting Decimal values as String type will 
not contain exponential notations. To restore the legacy behavior, which uses 
scientific notation if the adjusted exponent is less than -6, set 
`spark.sql.legacy.castDecimalToString.enabled` to `true`.
 
 ## Upgrading from Spark SQL 3.2 to 3.3
 
diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md
index 6ad8210ed7e..65ed5caf833 100644
--- a/docs/sql-ref-ansi-compliance.md
+++ b/docs/sql-ref-ansi-compliance.md
@@ -81,7 +81,7 @@ Besides, the ANSI SQL mode disallows the following type 
conversions which are al
 
 | Source\Target | Numeric | String | Date | Timestamp | Interval | Boolean | 
Binary | Array | Map | Struct |
 
|---|-||--|---|--|-||---|-||
-| Numeric   | **Y** | Y  | N| **Y** | N  | Y   | N  | N | N  
 | N  |
+| Numeric   | **Y** | **Y**  | N| **Y** | N  | Y   | N  | N | N  
 | N  |
 | String| **Y** | Y | **Y** | **Y** | **Y** | **Y** | Y | N   
  | N   | N  |
 | Date  | N   | Y  | Y| Y | N| N   | N 
 | N | N   | N  |
 | Timestamp | **Y** | Y  | Y| Y 
| N| N   | N  | N | N   | N  |
@@ -92,7 +92,7 @@ Besides, the ANSI SQL mode disallows the following type 
conversions which are al
 | Map   | N   | Y  | N| N | N| N   | N 
 | N | **Y** | N  |
 | Struct| N   | Y  | N| N | N| N   | N 
 | N | N   | **Y** |
 
-In the table above, all the `CAST`s that can 

[spark-website] branch asf-site updated: [SPARK-39512] Document docker image release steps (#400)

2022-07-18 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new a89536e5f [SPARK-39512] Document docker image release steps (#400)
a89536e5f is described below

commit a89536e5fc4498945149de3e0fb3ec8dc456b908
Author: Holden Karau 
AuthorDate: Mon Jul 18 18:27:13 2022 -0700

[SPARK-39512] Document docker image release steps (#400)

Document the docker image release steps for the release manager to follow 
when finalizing the release.
---
 release-process.md| 18 --
 site/release-process.html | 15 +--
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/release-process.md b/release-process.md
index 7c1e9ff2d..e8bba053f 100644
--- a/release-process.md
+++ b/release-process.md
@@ -35,6 +35,9 @@ If you are a new Release Manager, you can read up on the 
process from the follow
 - gpg for signing https://www.apache.org/dev/openpgp.html
 - svn https://www.apache.org/dev/version-control.html#https-svn
 
+
+You should also get access to the ASF Dockerhub. You can get access by filing 
a INFRA JIRA ticket (see an example ticket 
https://issues.apache.org/jira/browse/INFRA-21282 ).
+
 Preparing gpg key
 
 You can skip this section if you have already uploaded your key.
@@ -175,10 +178,11 @@ To cut a release candidate, there are 4 steps:
 1. Package the release binaries & sources, and upload them to the Apache 
staging SVN repo.
 1. Create the release docs, and upload them to the Apache staging SVN repo.
 1. Publish a snapshot to the Apache staging Maven repo.
+1. Create a RC docker image tag (e.g. `3.4.0-rc1`)
 
-The process of cutting a release candidate has been automated via the 
`dev/create-release/do-release-docker.sh` script.
+The process of cutting a release candidate has been mostly automated via the 
`dev/create-release/do-release-docker.sh` script.
 Run this script, type information it requires, and wait until it finishes. You 
can also do a single step via the `-s` option.
-Please run `do-release-docker.sh -h` and see more details.
+Please run `do-release-docker.sh -h` and see more details. It does not 
currently generate the RC docker image tag.
 
 Call a vote on the release candidate
 
@@ -387,6 +391,16 @@ $ git shortlog v1.1.1 --grep "$EXPR" > contrib.txt
 $ git log v1.1.1 --grep "$expr" --shortstat --oneline | grep -B 1 -e 
"[3-9][0-9][0-9] insert" -e "[1-9][1-9][1-9][1-9] insert" | grep SPARK > 
large-patches.txt
 ```
 
+Create and upload Spark Docker Images
+
+The Spark docker images are created using the `./bin/docker-image-tool.sh` 
that is included in the release artifacts.
+
+
+You should install `docker buildx` so that you can cross-compile for multiple 
archs as ARM is becoming increasing popular. If you have access to both an ARM 
and an x86 machine you should set up a [remote builder as described 
here](https://scalingpythonml.com/2020/12/11/some-sharp-corners-with-docker-buildx.html),
 but if you only have one [docker buildx with QEMU works fine as we don't use 
cgo](https://docs.docker.com/buildx/working-with-buildx/).
+
+
+Once you have your cross-platform docker build environment setup, extract the 
build artifact (e.g. `tar -xvf spark-3.3.0-bin-hadoop3.tgz`), go into the 
directory (e.g. `cd spark-3.3.0-bin-hadoop3`) and build the containers and 
publish them to the Spark dockerhub (e.g. `./bin/docker-image-tool.sh -r 
docker.io/apache -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile 
-t v3.3.0 -X -b java_image_tag=11-jre-slim build`)
+
 Create an announcement
 
 Once everything is working (website docs, website changes) create an 
announcement on the website
diff --git a/site/release-process.html b/site/release-process.html
index 236dd4631..dbecf8ff3 100644
--- a/site/release-process.html
+++ b/site/release-process.html
@@ -163,6 +163,8 @@
   svn https://www.apache.org/dev/version-control.html#https-svn
 
 
+You should also get access to the ASF Dockerhub. You can get access by 
filing a INFRA JIRA ticket (see an example ticket 
https://issues.apache.org/jira/browse/INFRA-21282 ).
+
 Preparing gpg key
 
 You can skip this section if you have already uploaded your key.
@@ -295,11 +297,12 @@ Note that not all permutations are run on PR therefore it 
is important to check
   Package the release binaries  sources, and upload them to the 
Apache staging SVN repo.
   Create the release docs, and upload them to the Apache staging SVN 
repo.
   Publish a snapshot to the Apache staging Maven repo.
+  Create a RC docker image tag (e.g. 3.4.0-rc1)
 
 
-The process of cutting a release candidate has been automated via the dev/create-release/do-release-docker.sh script.
+The process of cutting a release candidate has been mostly automated via 
the dev/create-release/do-release-docker.sh script.
 Run this script, type information it 

[GitHub] [spark-website] gengliangwang merged pull request #400: [SPARK-39512] Document docker image release steps

2022-07-18 Thread GitBox


gengliangwang merged PR #400:
URL: https://github.com/apache/spark-website/pull/400


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] gengliangwang commented on pull request #400: [SPARK-39512] Document docker image release steps

2022-07-18 Thread GitBox


gengliangwang commented on PR #400:
URL: https://github.com/apache/spark-website/pull/400#issuecomment-1188495099

   Merging to asf-site


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39809][PYTHON] Support CharType in PySpark

2022-07-18 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 00d7094dc30 [SPARK-39809][PYTHON] Support CharType in PySpark
00d7094dc30 is described below

commit 00d7094dc3024ae594605b311dcc55e95d277d5f
Author: Ruifeng Zheng 
AuthorDate: Tue Jul 19 10:22:04 2022 +0900

[SPARK-39809][PYTHON] Support CharType in PySpark

### What changes were proposed in this pull request?
Support CharType in PySpark

### Why are the changes needed?
for function parity

### Does this PR introduce _any_ user-facing change?
yes, new type added

### How was this patch tested?
added UT

Closes #37215 from zhengruifeng/py_add_char.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/tests/test_types.py | 26 ++---
 python/pyspark/sql/types.py| 42 +++---
 2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/sql/tests/test_types.py 
b/python/pyspark/sql/tests/test_types.py
index 218cfc413db..b1609417a0c 100644
--- a/python/pyspark/sql/tests/test_types.py
+++ b/python/pyspark/sql/tests/test_types.py
@@ -38,6 +38,7 @@ from pyspark.sql.types import (
 DayTimeIntervalType,
 MapType,
 StringType,
+CharType,
 VarcharType,
 StructType,
 StructField,
@@ -740,9 +741,12 @@ class TypesTests(ReusedSQLTestCase):
 from pyspark.sql.types import _all_atomic_types, _parse_datatype_string
 
 for k, t in _all_atomic_types.items():
-if k != "varchar":
+if k != "varchar" and k != "char":
 self.assertEqual(t(), _parse_datatype_string(k))
 self.assertEqual(IntegerType(), _parse_datatype_string("int"))
+self.assertEqual(CharType(1), _parse_datatype_string("char(1)"))
+self.assertEqual(CharType(10), _parse_datatype_string("char( 10   )"))
+self.assertEqual(CharType(11), _parse_datatype_string("char( 11)"))
 self.assertEqual(VarcharType(1), _parse_datatype_string("varchar(1)"))
 self.assertEqual(VarcharType(10), _parse_datatype_string("varchar( 10  
 )"))
 self.assertEqual(VarcharType(11), _parse_datatype_string("varchar( 
11)"))
@@ -1033,6 +1037,7 @@ class TypesTests(ReusedSQLTestCase):
 instances = [
 NullType(),
 StringType(),
+CharType(10),
 VarcharType(10),
 BinaryType(),
 BooleanType(),
@@ -1138,6 +1143,15 @@ class DataTypeTests(unittest.TestCase):
 t3 = DecimalType(8)
 self.assertNotEqual(t2, t3)
 
+def test_char_type(self):
+v1 = CharType(10)
+v2 = CharType(20)
+self.assertTrue(v2 is not v1)
+self.assertNotEqual(v1, v2)
+v3 = CharType(10)
+self.assertEqual(v1, v3)
+self.assertFalse(v1 is v3)
+
 def test_varchar_type(self):
 v1 = VarcharType(10)
 v2 = VarcharType(20)
@@ -1221,14 +1235,18 @@ class DataTypeVerificationTests(unittest.TestCase):
 success_spec = [
 # String
 ("", StringType()),
-("", StringType()),
 (1, StringType()),
 (1.0, StringType()),
 ([], StringType()),
 ({}, StringType()),
+# Char
+("", CharType(10)),
+(1, CharType(10)),
+(1.0, CharType(10)),
+([], CharType(10)),
+({}, CharType(10)),
 # Varchar
 ("", VarcharType(10)),
-("", VarcharType(10)),
 (1, VarcharType(10)),
 (1.0, VarcharType(10)),
 ([], VarcharType(10)),
@@ -1289,6 +1307,8 @@ class DataTypeVerificationTests(unittest.TestCase):
 failure_spec = [
 # String (match anything but None)
 (None, StringType(), ValueError),
+# CharType (match anything but None)
+(None, CharType(10), ValueError),
 # VarcharType (match anything but None)
 (None, VarcharType(10), ValueError),
 # UDT
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 7ab8f7c9c2d..e034ff75e10 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -56,6 +56,7 @@ U = TypeVar("U")
 __all__ = [
 "DataType",
 "NullType",
+"CharType",
 "StringType",
 "VarcharType",
 "BinaryType",
@@ -182,6 +183,28 @@ class StringType(AtomicType, metaclass=DataTypeSingleton):
 pass
 
 
+class CharType(AtomicType):
+"""Char data type
+
+Parameters
+--
+length : int
+the length limitation.
+"""
+
+def __init__(self, length: int):
+self.length = length
+
+def simpleString(self) -> str:
+return "char(%d)" % 

[spark] branch branch-3.3 updated: [SPARK-39806][SQL] Accessing `_metadata` on partitioned table can crash a query

2022-07-18 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new aeafb175875 [SPARK-39806][SQL] Accessing `_metadata` on partitioned 
table can crash a query
aeafb175875 is described below

commit aeafb175875c00519e03e0ea5b5f22f765dc3607
Author: Ala Luszczak 
AuthorDate: Tue Jul 19 09:04:03 2022 +0800

[SPARK-39806][SQL] Accessing `_metadata` on partitioned table can crash a 
query

This changes alters the projection used in `FileScanRDD` to attach file 
metadata to a row produced by the reader. This
projection used to remove the partitioning columns from the produced row. 
The produced row had different schema than expected by the consumers, and was 
missing part of the data, which resulted in query failure.

This is a bug. `FileScanRDD` should produce rows matching expected schema, 
and containing all the requested data. Queries should not crash due to internal 
errors.

No.

Adds a new test in `FileMetadataStructSuite.scala` that reproduces the 
issue.

Closes #37214 from ala/metadata-partition-by.

Authored-by: Ala Luszczak 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 385f1c8e4037928afafbf6664e30dc268510c05e)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/DataSourceScanExec.scala   |  4 ++--
 .../sql/execution/datasources/FileScanRDD.scala|  4 ++--
 .../datasources/FileMetadataStructSuite.scala  | 26 ++
 3 files changed, 30 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
index 9e8ae9a714d..40d29af28f9 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
@@ -621,7 +621,7 @@ case class FileSourceScanExec(
 }
 
 new FileScanRDD(fsRelation.sparkSession, readFile, filePartitions,
-  requiredSchema, metadataColumns)
+  new StructType(requiredSchema.fields ++ 
fsRelation.partitionSchema.fields), metadataColumns)
   }
 
   /**
@@ -678,7 +678,7 @@ case class FileSourceScanExec(
   FilePartition.getFilePartitions(relation.sparkSession, splitFiles, 
maxSplitBytes)
 
 new FileScanRDD(fsRelation.sparkSession, readFile, partitions,
-  requiredSchema, metadataColumns)
+  new StructType(requiredSchema.fields ++ 
fsRelation.partitionSchema.fields), metadataColumns)
   }
 
   // Filters unused DynamicPruningExpression expressions - one which has been 
replaced
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
index 20c393a5c0e..b65b36ef393 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
@@ -68,7 +68,7 @@ class FileScanRDD(
 @transient private val sparkSession: SparkSession,
 readFunction: (PartitionedFile) => Iterator[InternalRow],
 @transient val filePartitions: Seq[FilePartition],
-val readDataSchema: StructType,
+val readSchema: StructType,
 val metadataColumns: Seq[AttributeReference] = Seq.empty)
   extends RDD[InternalRow](sparkSession.sparkContext, Nil) {
 
@@ -126,7 +126,7 @@ class FileScanRDD(
   // an unsafe projection to convert a joined internal row to an unsafe row
   private lazy val projection = {
 val joinedExpressions =
-  readDataSchema.fields.map(_.dataType) ++ 
metadataColumns.map(_.dataType)
+  readSchema.fields.map(_.dataType) ++ metadataColumns.map(_.dataType)
 UnsafeProjection.create(joinedExpressions)
   }
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala
index 410fc985dd3..6afea42ee83 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala
@@ -21,6 +21,7 @@ import java.io.File
 import java.sql.Timestamp
 import java.text.SimpleDateFormat
 
+import org.apache.spark.TestUtils
 import org.apache.spark.sql.{AnalysisException, Column, DataFrame, QueryTest, 
Row}
 import org.apache.spark.sql.execution.FileSourceScanExec
 import org.apache.spark.sql.functions._
@@ -30,6 +31,8 @@ import org.apache.spark.sql.types.{IntegerType, LongType, 
StringType, StructFiel
 
 class FileMetadataStructSuite extends QueryTest with 

[spark] branch master updated: [SPARK-39806][SQL] Accessing `_metadata` on partitioned table can crash a query

2022-07-18 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 385f1c8e403 [SPARK-39806][SQL] Accessing `_metadata` on partitioned 
table can crash a query
385f1c8e403 is described below

commit 385f1c8e4037928afafbf6664e30dc268510c05e
Author: Ala Luszczak 
AuthorDate: Tue Jul 19 09:04:03 2022 +0800

[SPARK-39806][SQL] Accessing `_metadata` on partitioned table can crash a 
query

### What changes were proposed in this pull request?

This changes alters the projection used in `FileScanRDD` to attach file 
metadata to a row produced by the reader. This
projection used to remove the partitioning columns from the produced row. 
The produced row had different schema than expected by the consumers, and was 
missing part of the data, which resulted in query failure.

### Why are the changes needed?

This is a bug. `FileScanRDD` should produce rows matching expected schema, 
and containing all the requested data. Queries should not crash due to internal 
errors.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Adds a new test in `FileMetadataStructSuite.scala` that reproduces the 
issue.

Closes #37214 from ala/metadata-partition-by.

Authored-by: Ala Luszczak 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/DataSourceScanExec.scala   |  7 --
 .../sql/execution/datasources/FileScanRDD.scala|  4 ++--
 .../datasources/FileMetadataStructSuite.scala  | 26 ++
 3 files changed, 33 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
index 9e316cc88cf..5950136e79a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
@@ -650,7 +650,9 @@ case class FileSourceScanExec(
 }
 
 new FileScanRDD(fsRelation.sparkSession, readFile, filePartitions,
-  requiredSchema, metadataColumns, new 
FileSourceOptions(CaseInsensitiveMap(relation.options)))
+  new StructType(requiredSchema.fields ++ 
fsRelation.partitionSchema.fields), metadataColumns,
+  new FileSourceOptions(CaseInsensitiveMap(relation.options)))
+
   }
 
   /**
@@ -707,7 +709,8 @@ case class FileSourceScanExec(
   FilePartition.getFilePartitions(relation.sparkSession, splitFiles, 
maxSplitBytes)
 
 new FileScanRDD(fsRelation.sparkSession, readFile, partitions,
-  requiredSchema, metadataColumns, new 
FileSourceOptions(CaseInsensitiveMap(relation.options)))
+  new StructType(requiredSchema.fields ++ 
fsRelation.partitionSchema.fields), metadataColumns,
+  new FileSourceOptions(CaseInsensitiveMap(relation.options)))
   }
 
   // Filters unused DynamicPruningExpression expressions - one which has been 
replaced
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
index 97776413509..4c3f5629e78 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
@@ -69,7 +69,7 @@ class FileScanRDD(
 @transient private val sparkSession: SparkSession,
 readFunction: (PartitionedFile) => Iterator[InternalRow],
 @transient val filePartitions: Seq[FilePartition],
-val readDataSchema: StructType,
+val readSchema: StructType,
 val metadataColumns: Seq[AttributeReference] = Seq.empty,
 options: FileSourceOptions = new 
FileSourceOptions(CaseInsensitiveMap(Map.empty)))
   extends RDD[InternalRow](sparkSession.sparkContext, Nil) {
@@ -128,7 +128,7 @@ class FileScanRDD(
   // an unsafe projection to convert a joined internal row to an unsafe row
   private lazy val projection = {
 val joinedExpressions =
-  readDataSchema.fields.map(_.dataType) ++ 
metadataColumns.map(_.dataType)
+  readSchema.fields.map(_.dataType) ++ metadataColumns.map(_.dataType)
 UnsafeProjection.create(joinedExpressions)
   }
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala
index 410fc985dd3..6afea42ee83 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala
@@ -21,6 +21,7 @@ import java.io.File
 

[GitHub] [spark-website] xinrong-meng opened a new pull request, #407: Update Spark 3.4 release window

2022-07-18 Thread GitBox


xinrong-meng opened a new pull request, #407:
URL: https://github.com/apache/spark-website/pull/407

   This PR proposes to update Spark 3.4 release window. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior

2022-07-18 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dcccbf4f9dd [SPARK-39807][PYTHON][PS] Respect Series.concat sort 
parameter to follow 1.4.3 behavior
dcccbf4f9dd is described below

commit dcccbf4f9ddd22dc59e6199a940625f677b23a81
Author: Yikun Jiang 
AuthorDate: Tue Jul 19 09:34:32 2022 +0900

[SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 
1.4.3 behavior

### What changes were proposed in this pull request?

Respect Series.concat sort parameter when `num_series == 1` to follow 1.4.3 
behavior.

### Why are the changes needed?
In https://github.com/apache/spark/pull/36711, we follow the pandas 1.4.2 
behaviors to respect Series.concat sort parameter except `num_series == 1` case.

In [pandas 
1.4.3](https://github.com/pandas-dev/pandas/releases/tag/v1.4.3), fix the issue 
https://github.com/pandas-dev/pandas/issues/47127. The bug of `num_series == 1` 
is also fixed, so we add this PR to follow panda 1.4.3 behavior.

### Does this PR introduce _any_ user-facing change?
Yes, we already cover this case in:

https://github.com/apache/spark/blob/master/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst
```
In Spark 3.4, the Series.concat sort parameter will be respected to follow 
pandas 1.4 behaviors.
```

### How was this patch tested?
- CI passed
- test_concat_index_axis passed with panda 1.3.5, 1.4.2, 1.4.3.

Closes #37217 from Yikun/SPARK-39807.

Authored-by: Yikun Jiang 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/namespace.py|  5 ++---
 python/pyspark/pandas/tests/test_namespace.py | 20 +++-
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/python/pyspark/pandas/namespace.py 
b/python/pyspark/pandas/namespace.py
index 7691bf465e7..0f0dc606c52 100644
--- a/python/pyspark/pandas/namespace.py
+++ b/python/pyspark/pandas/namespace.py
@@ -2621,9 +2621,8 @@ def concat(
 
 assert len(merged_columns) > 0
 
-# If sort is True, always sort when there are more than two Series,
-# and if there is only one Series, never sort to follow pandas 
1.4+ behavior.
-if sort and num_series != 1:
+# If sort is True, always sort
+if sort:
 # FIXME: better ordering
 merged_columns = sorted(merged_columns, key=name_like_string)
 
diff --git a/python/pyspark/pandas/tests/test_namespace.py 
b/python/pyspark/pandas/tests/test_namespace.py
index 4db756c6e66..ac033f7828b 100644
--- a/python/pyspark/pandas/tests/test_namespace.py
+++ b/python/pyspark/pandas/tests/test_namespace.py
@@ -334,19 +334,21 @@ class NamespaceTest(PandasOnSparkTestCase, SQLTestUtils):
 ([psdf.reset_index(), psdf], [pdf.reset_index(), pdf]),
 ([psdf, psdf[["C", "A"]]], [pdf, pdf[["C", "A"]]]),
 ([psdf[["C", "A"]], psdf], [pdf[["C", "A"]], pdf]),
-# only one Series
-([psdf, psdf["C"]], [pdf, pdf["C"]]),
-([psdf["C"], psdf], [pdf["C"], pdf]),
 # more than two Series
 ([psdf["C"], psdf, psdf["A"]], [pdf["C"], pdf, pdf["A"]]),
 ]
 
-if LooseVersion(pd.__version__) >= LooseVersion("1.4"):
-# more than two Series
-psdfs, pdfs = ([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], 
pdf["A"]])
-for ignore_index, join, sort in itertools.product(ignore_indexes, 
joins, sorts):
-# See also https://github.com/pandas-dev/pandas/issues/47127
-if (join, sort) != ("outer", True):
+# See also https://github.com/pandas-dev/pandas/issues/47127
+if LooseVersion(pd.__version__) >= LooseVersion("1.4.3"):
+series_objs = [
+# more than two Series
+([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]]),
+# only one Series
+([psdf, psdf["C"]], [pdf, pdf["C"]]),
+([psdf["C"], psdf], [pdf["C"], pdf]),
+]
+for psdfs, pdfs in series_objs:
+for ignore_index, join, sort in 
itertools.product(ignore_indexes, joins, sorts):
 self.assert_eq(
 ps.concat(psdfs, ignore_index=ignore_index, join=join, 
sort=sort),
 pd.concat(pdfs, ignore_index=ignore_index, join=join, 
sort=sort),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (29ed337d52c -> 88380b6e1e8)

2022-07-18 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 29ed337d52c [SPARK-35579][SQL] Bump janino to 3.1.7
 add 88380b6e1e8 [SPARK-39803][SQL] Use `LevenshteinDistance` instead of 
`StringUtils.getLevenshteinDistance`

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated (ea5af851e78 -> b36b21464ce)

2022-07-18 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


from ea5af851e78 [SPARK-39758][SQL][3.2] Fix NPE from the regexp functions 
on invalid patterns
 add b36b21464ce [SPARK-39647][CORE][3.2] Register the executor with ESS 
before registering the BlockManager

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/storage/BlockManager.scala| 30 --
 .../apache/spark/storage/BlockManagerSuite.scala   | 36 ++
 2 files changed, 56 insertions(+), 10 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-35579][SQL] Bump janino to 3.1.7

2022-07-18 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 29ed337d52c [SPARK-35579][SQL] Bump janino to 3.1.7
29ed337d52c is described below

commit 29ed337d52c5e821f628ff38d4ae01ebba95751a
Author: Prashant Singh 
AuthorDate: Mon Jul 18 08:11:27 2022 -0500

[SPARK-35579][SQL] Bump janino to 3.1.7

### What changes were proposed in this pull request?

upgrade janino to 3.1.7 from 3.0.16

### Why are the changes needed?

- The proposed version contains bug fix in janino by maropu.
   - https://github.com/janino-compiler/janino/pull/148
- contains `getBytecodes` method which can be used to simplify the way to 
get bytecodes from ClassBodyEvaluator in 
CodeGenerator#updateAndGetCompilationStats method. (by LuciferYang)
   - https://github.com/apache/spark/pull/32536

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UTs

Closes #37202 from singhpk234/upgrade/bump-janino.

Authored-by: Prashant Singh 
Signed-off-by: Sean Owen 
---
 dev/deps/spark-deps-hadoop-2-hive-2.3  |  4 ++--
 dev/deps/spark-deps-hadoop-3-hive-2.3  |  4 ++--
 pom.xml|  2 +-
 .../sql/catalyst/expressions/codegen/CodeGenerator.scala   | 14 +++---
 .../org/apache/spark/sql/errors/QueryExecutionErrors.scala |  3 +--
 .../sql/catalyst/expressions/CodeGenerationSuite.scala |  1 -
 6 files changed, 9 insertions(+), 19 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 
b/dev/deps/spark-deps-hadoop-2-hive-2.3
index ccc0d607010..af1db05ff4b 100644
--- a/dev/deps/spark-deps-hadoop-2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2-hive-2.3
@@ -39,7 +39,7 @@ commons-cli/1.5.0//commons-cli-1.5.0.jar
 commons-codec/1.15//commons-codec-1.15.jar
 commons-collections/3.2.2//commons-collections-3.2.2.jar
 commons-collections4/4.4//commons-collections4-4.4.jar
-commons-compiler/3.0.16//commons-compiler-3.0.16.jar
+commons-compiler/3.1.7//commons-compiler-3.1.7.jar
 commons-compress/1.21//commons-compress-1.21.jar
 commons-configuration/1.6//commons-configuration-1.6.jar
 commons-crypto/1.1.0//commons-crypto-1.1.0.jar
@@ -128,7 +128,7 @@ jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar
 jakarta.validation-api/2.0.2//jakarta.validation-api-2.0.2.jar
 jakarta.ws.rs-api/2.1.6//jakarta.ws.rs-api-2.1.6.jar
 jakarta.xml.bind-api/2.3.2//jakarta.xml.bind-api-2.3.2.jar
-janino/3.0.16//janino-3.0.16.jar
+janino/3.1.7//janino-3.1.7.jar
 javassist/3.25.0-GA//javassist-3.25.0-GA.jar
 javax.inject/1//javax.inject-1.jar
 javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 31ada151e26..28644989282 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -40,7 +40,7 @@ commons-cli/1.5.0//commons-cli-1.5.0.jar
 commons-codec/1.15//commons-codec-1.15.jar
 commons-collections/3.2.2//commons-collections-3.2.2.jar
 commons-collections4/4.4//commons-collections4-4.4.jar
-commons-compiler/3.0.16//commons-compiler-3.0.16.jar
+commons-compiler/3.1.7//commons-compiler-3.1.7.jar
 commons-compress/1.21//commons-compress-1.21.jar
 commons-crypto/1.1.0//commons-crypto-1.1.0.jar
 commons-dbcp/1.4//commons-dbcp-1.4.jar
@@ -116,7 +116,7 @@ jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar
 jakarta.validation-api/2.0.2//jakarta.validation-api-2.0.2.jar
 jakarta.ws.rs-api/2.1.6//jakarta.ws.rs-api-2.1.6.jar
 jakarta.xml.bind-api/2.3.2//jakarta.xml.bind-api-2.3.2.jar
-janino/3.0.16//janino-3.0.16.jar
+janino/3.1.7//janino-3.1.7.jar
 javassist/3.25.0-GA//javassist-3.25.0-GA.jar
 javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar
 javolution/5.5.1//javolution-5.5.1.jar
diff --git a/pom.xml b/pom.xml
index 1976ea9db92..ffd2baae8aa 100644
--- a/pom.xml
+++ b/pom.xml
@@ -189,7 +189,7 @@
 2.11.1
 4.1.17
 14.0.1
-3.0.16
+3.1.7
 2.35
 2.10.14
 3.5.2
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
index 592783a54e5..3f3e9f75cfa 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -18,7 +18,6 @@
 package org.apache.spark.sql.catalyst.expressions.codegen
 
 import java.io.ByteArrayInputStream
-import java.util.{Map => JavaMap}
 
 import scala.annotation.tailrec
 import scala.collection.JavaConverters._
@@ -28,8 +27,8 @@ import scala.util.control.NonFatal
 
 import 

[spark] branch master updated: [SPARK-39760][PYTHON] Support Varchar in PySpark

2022-07-18 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 08808fb5079 [SPARK-39760][PYTHON] Support Varchar in PySpark
08808fb5079 is described below

commit 08808fb507947b51ea7656496612a81e11fe66bd
Author: Ruifeng Zheng 
AuthorDate: Mon Jul 18 15:55:55 2022 +0800

[SPARK-39760][PYTHON] Support Varchar in PySpark

### What changes were proposed in this pull request?
Support Varchar in PySpark

### Why are the changes needed?
function parity

### Does this PR introduce _any_ user-facing change?
yes, new datatype supported

### How was this patch tested?
1, added UT;
2, manually check against the scala side:

```python

In [1]: from pyspark.sql.types import *
   ...: from pyspark.sql.functions import *
   ...:
   ...: df = spark.createDataFrame([(1,), (11,)], ["value"])
   ...: ret = df.select(col("value").cast(VarcharType(10))).collect()
   ...:
22/07/13 17:17:07 WARN CharVarcharUtils: The Spark cast operator does not 
support char/varchar type and simply treats them as string type. Please use 
string type directly to avoid confusion. Otherwise, you can set 
spark.sql.legacy.charVarcharAsString to true, so that Spark treat them as 
string type as same as Spark 3.0 and earlier

In [2]:

In [2]: schema = StructType([StructField("a", IntegerType(), True), 
(StructField("v", VarcharType(10), True))])
   ...: description = "this a table created via Catalog.createTable()"
   ...: table = spark.catalog.createTable("tab3_via_catalog", 
schema=schema, description=description)
   ...: table.schema
   ...:
Out[2]: StructType([StructField('a', IntegerType(), True), StructField('v', 
StringType(), True)])
```

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = spark.range(0, 10).selectExpr(" id AS value")
df: org.apache.spark.sql.DataFrame = [value: bigint]

scala> val ret = df.select(col("value").cast(VarcharType(10))).collect()
22/07/13 17:28:56 WARN CharVarcharUtils: The Spark cast operator does not 
support char/varchar type and simply treats them as string type. Please use 
string type directly to avoid confusion. Otherwise, you can set 
spark.sql.legacy.charVarcharAsString to true, so that Spark treat them as 
string type as same as Spark 3.0 and earlier
ret: Array[org.apache.spark.sql.Row] = Array([0], [1], [2], [3], [4], [5], 
[6], [7], [8], [9])

scala>

scala> val schema = StructType(StructField("a", IntegerType, true) :: 
(StructField("v", VarcharType(10), true) :: Nil))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(a,IntegerType,true),StructField(v,VarcharType(10),true))

scala> val description = "this a table created via Catalog.createTable()"
description: String = this a table created via Catalog.createTable()

scala> val table = spark.catalog.createTable("tab3_via_catalog", 
source="json", schema=schema, description=description, 
options=Map.empty[String, String])
table: org.apache.spark.sql.DataFrame = [a: int, v: string]

scala> table.schema
res0: org.apache.spark.sql.types.StructType = 
StructType(StructField(a,IntegerType,true),StructField(v,StringType,true))
```

Closes #37173 from zhengruifeng/py_add_varchar.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../source/reference/pyspark.sql/data_types.rst|  1 +
 python/pyspark/sql/tests/test_types.py | 26 +++-
 python/pyspark/sql/types.py| 46 --
 3 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/python/docs/source/reference/pyspark.sql/data_types.rst 
b/python/docs/source/reference/pyspark.sql/data_types.rst
index d146c640477..775f0bf394a 100644
--- a/python/docs/source/reference/pyspark.sql/data_types.rst
+++ b/python/docs/source/reference/pyspark.sql/data_types.rst
@@ -40,6 +40,7 @@ Data Types
 NullType
 ShortType
 StringType
+VarcharType
 StructField
 StructType
 TimestampType
diff --git a/python/pyspark/sql/tests/test_types.py 
b/python/pyspark/sql/tests/test_types.py
index ef0ad82dbb9..218cfc413db 100644
--- a/python/pyspark/sql/tests/test_types.py
+++ b/python/pyspark/sql/tests/test_types.py
@@ -38,6 +38,7 @@ from pyspark.sql.types import (
 DayTimeIntervalType,
 MapType,
 StringType,
+VarcharType,
 StructType,
 StructField,
 ArrayType,
@@ -739,8 +740,12 @@ class TypesTests(ReusedSQLTestCase):
 from pyspark.sql.types import _all_atomic_types,