[spark] branch master updated: [SPARK-39749][SQL][FOLLOWUP] Move the new behavior of CAST(DECIMAL AS STRING) under ANSI SQL mode
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5bd47640dc4 [SPARK-39749][SQL][FOLLOWUP] Move the new behavior of CAST(DECIMAL AS STRING) under ANSI SQL mode 5bd47640dc4 is described below commit 5bd47640dc455ae27c94825743a127bbb20e59bf Author: Gengliang Wang AuthorDate: Tue Jul 19 12:14:43 2022 +0900 [SPARK-39749][SQL][FOLLOWUP] Move the new behavior of CAST(DECIMAL AS STRING) under ANSI SQL mode ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/37160, which changes the default behavior of `CAST(DECIMAL AS STRING)` as always using the plain string representation for ANSI SQL compliance. This PR is to move the new behavior under ANSI SQL mode. The default behavior remains unchanged. ### Why are the changes needed? After offline discussions, changing the default behavior brings upgrade risks to Spark 3.4.0. AFAIK, there are existing tables storing the results of casting decimal as strings. Running aggregation or joins over these tables can produce incorrect results. Since the motivation for the new behaviors is ANSI SQL compliance, the new behavior only exists in the ANSI SQL mode makes more sense. Users who enable the ANSI SQL mode are supposed to accept such a breaking change. ### Does this PR introduce _any_ user-facing change? Yes, though not released yet, this PR makes the behavior of casting decimal values to string more reasonable by always using plain string under ANSI SQL mode. ### How was this patch tested? UT Closes #37221 from gengliangwang/ansiDecimalToString. Authored-by: Gengliang Wang Signed-off-by: Hyukjin Kwon --- docs/sql-migration-guide.md | 1 - docs/sql-ref-ansi-compliance.md | 5 +++-- .../org/apache/spark/sql/catalyst/expressions/Cast.scala | 11 --- .../main/scala/org/apache/spark/sql/internal/SQLConf.scala| 11 --- .../apache/spark/sql/catalyst/expressions/CastSuiteBase.scala | 8 .../spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala | 5 + .../spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala | 5 + 7 files changed, 21 insertions(+), 25 deletions(-) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 0d83ee3aeb3..75f7f6c9f8c 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -26,7 +26,6 @@ license: | - Since Spark 3.4, Number or Number(\*) from Teradata will be treated as Decimal(38,18). In Spark 3.3 or earlier, Number or Number(\*) from Teradata will be treated as Decimal(38, 0), in which case the fractional part will be removed. - Since Spark 3.4, v1 database, table, permanent view and function identifier will include 'spark_catalog' as the catalog name if database is defined, e.g. a table identifier will be: `spark_catalog.default.t`. To restore the legacy behavior, set `spark.sql.legacy.v1IdentifierNoCatalog` to `true`. - - Since Spark 3.4, the results of casting Decimal values as String type will not contain exponential notations. To restore the legacy behavior, which uses scientific notation if the adjusted exponent is less than -6, set `spark.sql.legacy.castDecimalToString.enabled` to `true`. ## Upgrading from Spark SQL 3.2 to 3.3 diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md index 6ad8210ed7e..65ed5caf833 100644 --- a/docs/sql-ref-ansi-compliance.md +++ b/docs/sql-ref-ansi-compliance.md @@ -81,7 +81,7 @@ Besides, the ANSI SQL mode disallows the following type conversions which are al | Source\Target | Numeric | String | Date | Timestamp | Interval | Boolean | Binary | Array | Map | Struct | |---|-||--|---|--|-||---|-|| -| Numeric | **Y** | Y | N| **Y** | N | Y | N | N | N | N | +| Numeric | **Y** | **Y** | N| **Y** | N | Y | N | N | N | N | | String| **Y** | Y | **Y** | **Y** | **Y** | **Y** | Y | N | N | N | | Date | N | Y | Y| Y | N| N | N | N | N | N | | Timestamp | **Y** | Y | Y| Y | N| N | N | N | N | N | @@ -92,7 +92,7 @@ Besides, the ANSI SQL mode disallows the following type conversions which are al | Map | N | Y | N| N | N| N | N | N | **Y** | N | | Struct| N | Y | N| N | N| N | N | N | N | **Y** | -In the table above, all the `CAST`s that can
[spark-website] branch asf-site updated: [SPARK-39512] Document docker image release steps (#400)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new a89536e5f [SPARK-39512] Document docker image release steps (#400) a89536e5f is described below commit a89536e5fc4498945149de3e0fb3ec8dc456b908 Author: Holden Karau AuthorDate: Mon Jul 18 18:27:13 2022 -0700 [SPARK-39512] Document docker image release steps (#400) Document the docker image release steps for the release manager to follow when finalizing the release. --- release-process.md| 18 -- site/release-process.html | 15 +-- 2 files changed, 29 insertions(+), 4 deletions(-) diff --git a/release-process.md b/release-process.md index 7c1e9ff2d..e8bba053f 100644 --- a/release-process.md +++ b/release-process.md @@ -35,6 +35,9 @@ If you are a new Release Manager, you can read up on the process from the follow - gpg for signing https://www.apache.org/dev/openpgp.html - svn https://www.apache.org/dev/version-control.html#https-svn + +You should also get access to the ASF Dockerhub. You can get access by filing a INFRA JIRA ticket (see an example ticket https://issues.apache.org/jira/browse/INFRA-21282 ). + Preparing gpg key You can skip this section if you have already uploaded your key. @@ -175,10 +178,11 @@ To cut a release candidate, there are 4 steps: 1. Package the release binaries & sources, and upload them to the Apache staging SVN repo. 1. Create the release docs, and upload them to the Apache staging SVN repo. 1. Publish a snapshot to the Apache staging Maven repo. +1. Create a RC docker image tag (e.g. `3.4.0-rc1`) -The process of cutting a release candidate has been automated via the `dev/create-release/do-release-docker.sh` script. +The process of cutting a release candidate has been mostly automated via the `dev/create-release/do-release-docker.sh` script. Run this script, type information it requires, and wait until it finishes. You can also do a single step via the `-s` option. -Please run `do-release-docker.sh -h` and see more details. +Please run `do-release-docker.sh -h` and see more details. It does not currently generate the RC docker image tag. Call a vote on the release candidate @@ -387,6 +391,16 @@ $ git shortlog v1.1.1 --grep "$EXPR" > contrib.txt $ git log v1.1.1 --grep "$expr" --shortstat --oneline | grep -B 1 -e "[3-9][0-9][0-9] insert" -e "[1-9][1-9][1-9][1-9] insert" | grep SPARK > large-patches.txt ``` +Create and upload Spark Docker Images + +The Spark docker images are created using the `./bin/docker-image-tool.sh` that is included in the release artifacts. + + +You should install `docker buildx` so that you can cross-compile for multiple archs as ARM is becoming increasing popular. If you have access to both an ARM and an x86 machine you should set up a [remote builder as described here](https://scalingpythonml.com/2020/12/11/some-sharp-corners-with-docker-buildx.html), but if you only have one [docker buildx with QEMU works fine as we don't use cgo](https://docs.docker.com/buildx/working-with-buildx/). + + +Once you have your cross-platform docker build environment setup, extract the build artifact (e.g. `tar -xvf spark-3.3.0-bin-hadoop3.tgz`), go into the directory (e.g. `cd spark-3.3.0-bin-hadoop3`) and build the containers and publish them to the Spark dockerhub (e.g. `./bin/docker-image-tool.sh -r docker.io/apache -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile -t v3.3.0 -X -b java_image_tag=11-jre-slim build`) + Create an announcement Once everything is working (website docs, website changes) create an announcement on the website diff --git a/site/release-process.html b/site/release-process.html index 236dd4631..dbecf8ff3 100644 --- a/site/release-process.html +++ b/site/release-process.html @@ -163,6 +163,8 @@ svn https://www.apache.org/dev/version-control.html#https-svn +You should also get access to the ASF Dockerhub. You can get access by filing a INFRA JIRA ticket (see an example ticket https://issues.apache.org/jira/browse/INFRA-21282 ). + Preparing gpg key You can skip this section if you have already uploaded your key. @@ -295,11 +297,12 @@ Note that not all permutations are run on PR therefore it is important to check Package the release binaries sources, and upload them to the Apache staging SVN repo. Create the release docs, and upload them to the Apache staging SVN repo. Publish a snapshot to the Apache staging Maven repo. + Create a RC docker image tag (e.g. 3.4.0-rc1) -The process of cutting a release candidate has been automated via the dev/create-release/do-release-docker.sh script. +The process of cutting a release candidate has been mostly automated via the dev/create-release/do-release-docker.sh script. Run this script, type information it
[GitHub] [spark-website] gengliangwang merged pull request #400: [SPARK-39512] Document docker image release steps
gengliangwang merged PR #400: URL: https://github.com/apache/spark-website/pull/400 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] gengliangwang commented on pull request #400: [SPARK-39512] Document docker image release steps
gengliangwang commented on PR #400: URL: https://github.com/apache/spark-website/pull/400#issuecomment-1188495099 Merging to asf-site -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-39809][PYTHON] Support CharType in PySpark
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 00d7094dc30 [SPARK-39809][PYTHON] Support CharType in PySpark 00d7094dc30 is described below commit 00d7094dc3024ae594605b311dcc55e95d277d5f Author: Ruifeng Zheng AuthorDate: Tue Jul 19 10:22:04 2022 +0900 [SPARK-39809][PYTHON] Support CharType in PySpark ### What changes were proposed in this pull request? Support CharType in PySpark ### Why are the changes needed? for function parity ### Does this PR introduce _any_ user-facing change? yes, new type added ### How was this patch tested? added UT Closes #37215 from zhengruifeng/py_add_char. Authored-by: Ruifeng Zheng Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/tests/test_types.py | 26 ++--- python/pyspark/sql/types.py| 42 +++--- 2 files changed, 62 insertions(+), 6 deletions(-) diff --git a/python/pyspark/sql/tests/test_types.py b/python/pyspark/sql/tests/test_types.py index 218cfc413db..b1609417a0c 100644 --- a/python/pyspark/sql/tests/test_types.py +++ b/python/pyspark/sql/tests/test_types.py @@ -38,6 +38,7 @@ from pyspark.sql.types import ( DayTimeIntervalType, MapType, StringType, +CharType, VarcharType, StructType, StructField, @@ -740,9 +741,12 @@ class TypesTests(ReusedSQLTestCase): from pyspark.sql.types import _all_atomic_types, _parse_datatype_string for k, t in _all_atomic_types.items(): -if k != "varchar": +if k != "varchar" and k != "char": self.assertEqual(t(), _parse_datatype_string(k)) self.assertEqual(IntegerType(), _parse_datatype_string("int")) +self.assertEqual(CharType(1), _parse_datatype_string("char(1)")) +self.assertEqual(CharType(10), _parse_datatype_string("char( 10 )")) +self.assertEqual(CharType(11), _parse_datatype_string("char( 11)")) self.assertEqual(VarcharType(1), _parse_datatype_string("varchar(1)")) self.assertEqual(VarcharType(10), _parse_datatype_string("varchar( 10 )")) self.assertEqual(VarcharType(11), _parse_datatype_string("varchar( 11)")) @@ -1033,6 +1037,7 @@ class TypesTests(ReusedSQLTestCase): instances = [ NullType(), StringType(), +CharType(10), VarcharType(10), BinaryType(), BooleanType(), @@ -1138,6 +1143,15 @@ class DataTypeTests(unittest.TestCase): t3 = DecimalType(8) self.assertNotEqual(t2, t3) +def test_char_type(self): +v1 = CharType(10) +v2 = CharType(20) +self.assertTrue(v2 is not v1) +self.assertNotEqual(v1, v2) +v3 = CharType(10) +self.assertEqual(v1, v3) +self.assertFalse(v1 is v3) + def test_varchar_type(self): v1 = VarcharType(10) v2 = VarcharType(20) @@ -1221,14 +1235,18 @@ class DataTypeVerificationTests(unittest.TestCase): success_spec = [ # String ("", StringType()), -("", StringType()), (1, StringType()), (1.0, StringType()), ([], StringType()), ({}, StringType()), +# Char +("", CharType(10)), +(1, CharType(10)), +(1.0, CharType(10)), +([], CharType(10)), +({}, CharType(10)), # Varchar ("", VarcharType(10)), -("", VarcharType(10)), (1, VarcharType(10)), (1.0, VarcharType(10)), ([], VarcharType(10)), @@ -1289,6 +1307,8 @@ class DataTypeVerificationTests(unittest.TestCase): failure_spec = [ # String (match anything but None) (None, StringType(), ValueError), +# CharType (match anything but None) +(None, CharType(10), ValueError), # VarcharType (match anything but None) (None, VarcharType(10), ValueError), # UDT diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py index 7ab8f7c9c2d..e034ff75e10 100644 --- a/python/pyspark/sql/types.py +++ b/python/pyspark/sql/types.py @@ -56,6 +56,7 @@ U = TypeVar("U") __all__ = [ "DataType", "NullType", +"CharType", "StringType", "VarcharType", "BinaryType", @@ -182,6 +183,28 @@ class StringType(AtomicType, metaclass=DataTypeSingleton): pass +class CharType(AtomicType): +"""Char data type + +Parameters +-- +length : int +the length limitation. +""" + +def __init__(self, length: int): +self.length = length + +def simpleString(self) -> str: +return "char(%d)" %
[spark] branch branch-3.3 updated: [SPARK-39806][SQL] Accessing `_metadata` on partitioned table can crash a query
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new aeafb175875 [SPARK-39806][SQL] Accessing `_metadata` on partitioned table can crash a query aeafb175875 is described below commit aeafb175875c00519e03e0ea5b5f22f765dc3607 Author: Ala Luszczak AuthorDate: Tue Jul 19 09:04:03 2022 +0800 [SPARK-39806][SQL] Accessing `_metadata` on partitioned table can crash a query This changes alters the projection used in `FileScanRDD` to attach file metadata to a row produced by the reader. This projection used to remove the partitioning columns from the produced row. The produced row had different schema than expected by the consumers, and was missing part of the data, which resulted in query failure. This is a bug. `FileScanRDD` should produce rows matching expected schema, and containing all the requested data. Queries should not crash due to internal errors. No. Adds a new test in `FileMetadataStructSuite.scala` that reproduces the issue. Closes #37214 from ala/metadata-partition-by. Authored-by: Ala Luszczak Signed-off-by: Wenchen Fan (cherry picked from commit 385f1c8e4037928afafbf6664e30dc268510c05e) Signed-off-by: Wenchen Fan --- .../spark/sql/execution/DataSourceScanExec.scala | 4 ++-- .../sql/execution/datasources/FileScanRDD.scala| 4 ++-- .../datasources/FileMetadataStructSuite.scala | 26 ++ 3 files changed, 30 insertions(+), 4 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala index 9e8ae9a714d..40d29af28f9 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala @@ -621,7 +621,7 @@ case class FileSourceScanExec( } new FileScanRDD(fsRelation.sparkSession, readFile, filePartitions, - requiredSchema, metadataColumns) + new StructType(requiredSchema.fields ++ fsRelation.partitionSchema.fields), metadataColumns) } /** @@ -678,7 +678,7 @@ case class FileSourceScanExec( FilePartition.getFilePartitions(relation.sparkSession, splitFiles, maxSplitBytes) new FileScanRDD(fsRelation.sparkSession, readFile, partitions, - requiredSchema, metadataColumns) + new StructType(requiredSchema.fields ++ fsRelation.partitionSchema.fields), metadataColumns) } // Filters unused DynamicPruningExpression expressions - one which has been replaced diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala index 20c393a5c0e..b65b36ef393 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala @@ -68,7 +68,7 @@ class FileScanRDD( @transient private val sparkSession: SparkSession, readFunction: (PartitionedFile) => Iterator[InternalRow], @transient val filePartitions: Seq[FilePartition], -val readDataSchema: StructType, +val readSchema: StructType, val metadataColumns: Seq[AttributeReference] = Seq.empty) extends RDD[InternalRow](sparkSession.sparkContext, Nil) { @@ -126,7 +126,7 @@ class FileScanRDD( // an unsafe projection to convert a joined internal row to an unsafe row private lazy val projection = { val joinedExpressions = - readDataSchema.fields.map(_.dataType) ++ metadataColumns.map(_.dataType) + readSchema.fields.map(_.dataType) ++ metadataColumns.map(_.dataType) UnsafeProjection.create(joinedExpressions) } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala index 410fc985dd3..6afea42ee83 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala @@ -21,6 +21,7 @@ import java.io.File import java.sql.Timestamp import java.text.SimpleDateFormat +import org.apache.spark.TestUtils import org.apache.spark.sql.{AnalysisException, Column, DataFrame, QueryTest, Row} import org.apache.spark.sql.execution.FileSourceScanExec import org.apache.spark.sql.functions._ @@ -30,6 +31,8 @@ import org.apache.spark.sql.types.{IntegerType, LongType, StringType, StructFiel class FileMetadataStructSuite extends QueryTest with
[spark] branch master updated: [SPARK-39806][SQL] Accessing `_metadata` on partitioned table can crash a query
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 385f1c8e403 [SPARK-39806][SQL] Accessing `_metadata` on partitioned table can crash a query 385f1c8e403 is described below commit 385f1c8e4037928afafbf6664e30dc268510c05e Author: Ala Luszczak AuthorDate: Tue Jul 19 09:04:03 2022 +0800 [SPARK-39806][SQL] Accessing `_metadata` on partitioned table can crash a query ### What changes were proposed in this pull request? This changes alters the projection used in `FileScanRDD` to attach file metadata to a row produced by the reader. This projection used to remove the partitioning columns from the produced row. The produced row had different schema than expected by the consumers, and was missing part of the data, which resulted in query failure. ### Why are the changes needed? This is a bug. `FileScanRDD` should produce rows matching expected schema, and containing all the requested data. Queries should not crash due to internal errors. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Adds a new test in `FileMetadataStructSuite.scala` that reproduces the issue. Closes #37214 from ala/metadata-partition-by. Authored-by: Ala Luszczak Signed-off-by: Wenchen Fan --- .../spark/sql/execution/DataSourceScanExec.scala | 7 -- .../sql/execution/datasources/FileScanRDD.scala| 4 ++-- .../datasources/FileMetadataStructSuite.scala | 26 ++ 3 files changed, 33 insertions(+), 4 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala index 9e316cc88cf..5950136e79a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala @@ -650,7 +650,9 @@ case class FileSourceScanExec( } new FileScanRDD(fsRelation.sparkSession, readFile, filePartitions, - requiredSchema, metadataColumns, new FileSourceOptions(CaseInsensitiveMap(relation.options))) + new StructType(requiredSchema.fields ++ fsRelation.partitionSchema.fields), metadataColumns, + new FileSourceOptions(CaseInsensitiveMap(relation.options))) + } /** @@ -707,7 +709,8 @@ case class FileSourceScanExec( FilePartition.getFilePartitions(relation.sparkSession, splitFiles, maxSplitBytes) new FileScanRDD(fsRelation.sparkSession, readFile, partitions, - requiredSchema, metadataColumns, new FileSourceOptions(CaseInsensitiveMap(relation.options))) + new StructType(requiredSchema.fields ++ fsRelation.partitionSchema.fields), metadataColumns, + new FileSourceOptions(CaseInsensitiveMap(relation.options))) } // Filters unused DynamicPruningExpression expressions - one which has been replaced diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala index 97776413509..4c3f5629e78 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala @@ -69,7 +69,7 @@ class FileScanRDD( @transient private val sparkSession: SparkSession, readFunction: (PartitionedFile) => Iterator[InternalRow], @transient val filePartitions: Seq[FilePartition], -val readDataSchema: StructType, +val readSchema: StructType, val metadataColumns: Seq[AttributeReference] = Seq.empty, options: FileSourceOptions = new FileSourceOptions(CaseInsensitiveMap(Map.empty))) extends RDD[InternalRow](sparkSession.sparkContext, Nil) { @@ -128,7 +128,7 @@ class FileScanRDD( // an unsafe projection to convert a joined internal row to an unsafe row private lazy val projection = { val joinedExpressions = - readDataSchema.fields.map(_.dataType) ++ metadataColumns.map(_.dataType) + readSchema.fields.map(_.dataType) ++ metadataColumns.map(_.dataType) UnsafeProjection.create(joinedExpressions) } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala index 410fc985dd3..6afea42ee83 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala @@ -21,6 +21,7 @@ import java.io.File
[GitHub] [spark-website] xinrong-meng opened a new pull request, #407: Update Spark 3.4 release window
xinrong-meng opened a new pull request, #407: URL: https://github.com/apache/spark-website/pull/407 This PR proposes to update Spark 3.4 release window. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new dcccbf4f9dd [SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior dcccbf4f9dd is described below commit dcccbf4f9ddd22dc59e6199a940625f677b23a81 Author: Yikun Jiang AuthorDate: Tue Jul 19 09:34:32 2022 +0900 [SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior ### What changes were proposed in this pull request? Respect Series.concat sort parameter when `num_series == 1` to follow 1.4.3 behavior. ### Why are the changes needed? In https://github.com/apache/spark/pull/36711, we follow the pandas 1.4.2 behaviors to respect Series.concat sort parameter except `num_series == 1` case. In [pandas 1.4.3](https://github.com/pandas-dev/pandas/releases/tag/v1.4.3), fix the issue https://github.com/pandas-dev/pandas/issues/47127. The bug of `num_series == 1` is also fixed, so we add this PR to follow panda 1.4.3 behavior. ### Does this PR introduce _any_ user-facing change? Yes, we already cover this case in: https://github.com/apache/spark/blob/master/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst ``` In Spark 3.4, the Series.concat sort parameter will be respected to follow pandas 1.4 behaviors. ``` ### How was this patch tested? - CI passed - test_concat_index_axis passed with panda 1.3.5, 1.4.2, 1.4.3. Closes #37217 from Yikun/SPARK-39807. Authored-by: Yikun Jiang Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/namespace.py| 5 ++--- python/pyspark/pandas/tests/test_namespace.py | 20 +++- 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/python/pyspark/pandas/namespace.py b/python/pyspark/pandas/namespace.py index 7691bf465e7..0f0dc606c52 100644 --- a/python/pyspark/pandas/namespace.py +++ b/python/pyspark/pandas/namespace.py @@ -2621,9 +2621,8 @@ def concat( assert len(merged_columns) > 0 -# If sort is True, always sort when there are more than two Series, -# and if there is only one Series, never sort to follow pandas 1.4+ behavior. -if sort and num_series != 1: +# If sort is True, always sort +if sort: # FIXME: better ordering merged_columns = sorted(merged_columns, key=name_like_string) diff --git a/python/pyspark/pandas/tests/test_namespace.py b/python/pyspark/pandas/tests/test_namespace.py index 4db756c6e66..ac033f7828b 100644 --- a/python/pyspark/pandas/tests/test_namespace.py +++ b/python/pyspark/pandas/tests/test_namespace.py @@ -334,19 +334,21 @@ class NamespaceTest(PandasOnSparkTestCase, SQLTestUtils): ([psdf.reset_index(), psdf], [pdf.reset_index(), pdf]), ([psdf, psdf[["C", "A"]]], [pdf, pdf[["C", "A"]]]), ([psdf[["C", "A"]], psdf], [pdf[["C", "A"]], pdf]), -# only one Series -([psdf, psdf["C"]], [pdf, pdf["C"]]), -([psdf["C"], psdf], [pdf["C"], pdf]), # more than two Series ([psdf["C"], psdf, psdf["A"]], [pdf["C"], pdf, pdf["A"]]), ] -if LooseVersion(pd.__version__) >= LooseVersion("1.4"): -# more than two Series -psdfs, pdfs = ([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]]) -for ignore_index, join, sort in itertools.product(ignore_indexes, joins, sorts): -# See also https://github.com/pandas-dev/pandas/issues/47127 -if (join, sort) != ("outer", True): +# See also https://github.com/pandas-dev/pandas/issues/47127 +if LooseVersion(pd.__version__) >= LooseVersion("1.4.3"): +series_objs = [ +# more than two Series +([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]]), +# only one Series +([psdf, psdf["C"]], [pdf, pdf["C"]]), +([psdf["C"], psdf], [pdf["C"], pdf]), +] +for psdfs, pdfs in series_objs: +for ignore_index, join, sort in itertools.product(ignore_indexes, joins, sorts): self.assert_eq( ps.concat(psdfs, ignore_index=ignore_index, join=join, sort=sort), pd.concat(pdfs, ignore_index=ignore_index, join=join, sort=sort), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (29ed337d52c -> 88380b6e1e8)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 29ed337d52c [SPARK-35579][SQL] Bump janino to 3.1.7 add 88380b6e1e8 [SPARK-39803][SQL] Use `LevenshteinDistance` instead of `StringUtils.getLevenshteinDistance` No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (ea5af851e78 -> b36b21464ce)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git from ea5af851e78 [SPARK-39758][SQL][3.2] Fix NPE from the regexp functions on invalid patterns add b36b21464ce [SPARK-39647][CORE][3.2] Register the executor with ESS before registering the BlockManager No new revisions were added by this update. Summary of changes: .../org/apache/spark/storage/BlockManager.scala| 30 -- .../apache/spark/storage/BlockManagerSuite.scala | 36 ++ 2 files changed, 56 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-35579][SQL] Bump janino to 3.1.7
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 29ed337d52c [SPARK-35579][SQL] Bump janino to 3.1.7 29ed337d52c is described below commit 29ed337d52c5e821f628ff38d4ae01ebba95751a Author: Prashant Singh AuthorDate: Mon Jul 18 08:11:27 2022 -0500 [SPARK-35579][SQL] Bump janino to 3.1.7 ### What changes were proposed in this pull request? upgrade janino to 3.1.7 from 3.0.16 ### Why are the changes needed? - The proposed version contains bug fix in janino by maropu. - https://github.com/janino-compiler/janino/pull/148 - contains `getBytecodes` method which can be used to simplify the way to get bytecodes from ClassBodyEvaluator in CodeGenerator#updateAndGetCompilationStats method. (by LuciferYang) - https://github.com/apache/spark/pull/32536 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs Closes #37202 from singhpk234/upgrade/bump-janino. Authored-by: Prashant Singh Signed-off-by: Sean Owen --- dev/deps/spark-deps-hadoop-2-hive-2.3 | 4 ++-- dev/deps/spark-deps-hadoop-3-hive-2.3 | 4 ++-- pom.xml| 2 +- .../sql/catalyst/expressions/codegen/CodeGenerator.scala | 14 +++--- .../org/apache/spark/sql/errors/QueryExecutionErrors.scala | 3 +-- .../sql/catalyst/expressions/CodeGenerationSuite.scala | 1 - 6 files changed, 9 insertions(+), 19 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 b/dev/deps/spark-deps-hadoop-2-hive-2.3 index ccc0d607010..af1db05ff4b 100644 --- a/dev/deps/spark-deps-hadoop-2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2-hive-2.3 @@ -39,7 +39,7 @@ commons-cli/1.5.0//commons-cli-1.5.0.jar commons-codec/1.15//commons-codec-1.15.jar commons-collections/3.2.2//commons-collections-3.2.2.jar commons-collections4/4.4//commons-collections4-4.4.jar -commons-compiler/3.0.16//commons-compiler-3.0.16.jar +commons-compiler/3.1.7//commons-compiler-3.1.7.jar commons-compress/1.21//commons-compress-1.21.jar commons-configuration/1.6//commons-configuration-1.6.jar commons-crypto/1.1.0//commons-crypto-1.1.0.jar @@ -128,7 +128,7 @@ jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar jakarta.validation-api/2.0.2//jakarta.validation-api-2.0.2.jar jakarta.ws.rs-api/2.1.6//jakarta.ws.rs-api-2.1.6.jar jakarta.xml.bind-api/2.3.2//jakarta.xml.bind-api-2.3.2.jar -janino/3.0.16//janino-3.0.16.jar +janino/3.1.7//janino-3.1.7.jar javassist/3.25.0-GA//javassist-3.25.0-GA.jar javax.inject/1//javax.inject-1.jar javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 31ada151e26..28644989282 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -40,7 +40,7 @@ commons-cli/1.5.0//commons-cli-1.5.0.jar commons-codec/1.15//commons-codec-1.15.jar commons-collections/3.2.2//commons-collections-3.2.2.jar commons-collections4/4.4//commons-collections4-4.4.jar -commons-compiler/3.0.16//commons-compiler-3.0.16.jar +commons-compiler/3.1.7//commons-compiler-3.1.7.jar commons-compress/1.21//commons-compress-1.21.jar commons-crypto/1.1.0//commons-crypto-1.1.0.jar commons-dbcp/1.4//commons-dbcp-1.4.jar @@ -116,7 +116,7 @@ jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar jakarta.validation-api/2.0.2//jakarta.validation-api-2.0.2.jar jakarta.ws.rs-api/2.1.6//jakarta.ws.rs-api-2.1.6.jar jakarta.xml.bind-api/2.3.2//jakarta.xml.bind-api-2.3.2.jar -janino/3.0.16//janino-3.0.16.jar +janino/3.1.7//janino-3.1.7.jar javassist/3.25.0-GA//javassist-3.25.0-GA.jar javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar javolution/5.5.1//javolution-5.5.1.jar diff --git a/pom.xml b/pom.xml index 1976ea9db92..ffd2baae8aa 100644 --- a/pom.xml +++ b/pom.xml @@ -189,7 +189,7 @@ 2.11.1 4.1.17 14.0.1 -3.0.16 +3.1.7 2.35 2.10.14 3.5.2 diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala index 592783a54e5..3f3e9f75cfa 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala @@ -18,7 +18,6 @@ package org.apache.spark.sql.catalyst.expressions.codegen import java.io.ByteArrayInputStream -import java.util.{Map => JavaMap} import scala.annotation.tailrec import scala.collection.JavaConverters._ @@ -28,8 +27,8 @@ import scala.util.control.NonFatal import
[spark] branch master updated: [SPARK-39760][PYTHON] Support Varchar in PySpark
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 08808fb5079 [SPARK-39760][PYTHON] Support Varchar in PySpark 08808fb5079 is described below commit 08808fb507947b51ea7656496612a81e11fe66bd Author: Ruifeng Zheng AuthorDate: Mon Jul 18 15:55:55 2022 +0800 [SPARK-39760][PYTHON] Support Varchar in PySpark ### What changes were proposed in this pull request? Support Varchar in PySpark ### Why are the changes needed? function parity ### Does this PR introduce _any_ user-facing change? yes, new datatype supported ### How was this patch tested? 1, added UT; 2, manually check against the scala side: ```python In [1]: from pyspark.sql.types import * ...: from pyspark.sql.functions import * ...: ...: df = spark.createDataFrame([(1,), (11,)], ["value"]) ...: ret = df.select(col("value").cast(VarcharType(10))).collect() ...: 22/07/13 17:17:07 WARN CharVarcharUtils: The Spark cast operator does not support char/varchar type and simply treats them as string type. Please use string type directly to avoid confusion. Otherwise, you can set spark.sql.legacy.charVarcharAsString to true, so that Spark treat them as string type as same as Spark 3.0 and earlier In [2]: In [2]: schema = StructType([StructField("a", IntegerType(), True), (StructField("v", VarcharType(10), True))]) ...: description = "this a table created via Catalog.createTable()" ...: table = spark.catalog.createTable("tab3_via_catalog", schema=schema, description=description) ...: table.schema ...: Out[2]: StructType([StructField('a', IntegerType(), True), StructField('v', StringType(), True)]) ``` ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> val df = spark.range(0, 10).selectExpr(" id AS value") df: org.apache.spark.sql.DataFrame = [value: bigint] scala> val ret = df.select(col("value").cast(VarcharType(10))).collect() 22/07/13 17:28:56 WARN CharVarcharUtils: The Spark cast operator does not support char/varchar type and simply treats them as string type. Please use string type directly to avoid confusion. Otherwise, you can set spark.sql.legacy.charVarcharAsString to true, so that Spark treat them as string type as same as Spark 3.0 and earlier ret: Array[org.apache.spark.sql.Row] = Array([0], [1], [2], [3], [4], [5], [6], [7], [8], [9]) scala> scala> val schema = StructType(StructField("a", IntegerType, true) :: (StructField("v", VarcharType(10), true) :: Nil)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true),StructField(v,VarcharType(10),true)) scala> val description = "this a table created via Catalog.createTable()" description: String = this a table created via Catalog.createTable() scala> val table = spark.catalog.createTable("tab3_via_catalog", source="json", schema=schema, description=description, options=Map.empty[String, String]) table: org.apache.spark.sql.DataFrame = [a: int, v: string] scala> table.schema res0: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true),StructField(v,StringType,true)) ``` Closes #37173 from zhengruifeng/py_add_varchar. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- .../source/reference/pyspark.sql/data_types.rst| 1 + python/pyspark/sql/tests/test_types.py | 26 +++- python/pyspark/sql/types.py| 46 -- 3 files changed, 68 insertions(+), 5 deletions(-) diff --git a/python/docs/source/reference/pyspark.sql/data_types.rst b/python/docs/source/reference/pyspark.sql/data_types.rst index d146c640477..775f0bf394a 100644 --- a/python/docs/source/reference/pyspark.sql/data_types.rst +++ b/python/docs/source/reference/pyspark.sql/data_types.rst @@ -40,6 +40,7 @@ Data Types NullType ShortType StringType +VarcharType StructField StructType TimestampType diff --git a/python/pyspark/sql/tests/test_types.py b/python/pyspark/sql/tests/test_types.py index ef0ad82dbb9..218cfc413db 100644 --- a/python/pyspark/sql/tests/test_types.py +++ b/python/pyspark/sql/tests/test_types.py @@ -38,6 +38,7 @@ from pyspark.sql.types import ( DayTimeIntervalType, MapType, StringType, +VarcharType, StructType, StructField, ArrayType, @@ -739,8 +740,12 @@ class TypesTests(ReusedSQLTestCase): from pyspark.sql.types import _all_atomic_types,