[spark] branch branch-3.0 updated: [SPARK-34087][3.1][SQL] Fix memory leak of ExecutionListenerBus
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 5c2a268 [SPARK-34087][3.1][SQL] Fix memory leak of ExecutionListenerBus 5c2a268 is described below commit 5c2a26874ac816d81381e3ebd016b56def3a4355 Author: yi.wu AuthorDate: Fri Mar 19 13:23:06 2021 +0800 [SPARK-34087][3.1][SQL] Fix memory leak of ExecutionListenerBus This PR proposes an alternative way to fix the memory leak of `ExecutionListenerBus`, which would automatically clean them up. Basically, the idea is to add `registerSparkListenerForCleanup` to `ContextCleaner`, so we can remove the `ExecutionListenerBus` from `LiveListenerBus` when the `SparkSession` is GC'ed. On the other hand, to make the `SparkSession` GC-able, we need to get rid of the reference of `SparkSession` in `ExecutionListenerBus`. Therefore, we introduced the `sessionUUID`, which is a unique identifier for SparkSession, to replace the `SparkSession` object. Note that, the proposal wouldn't take effect when `spark.cleaner.referenceTracking=false` since it depends on `ContextCleaner`. Fix the memory leak caused by `ExecutionListenerBus` mentioned in SPARK-34087. Yes, save memory for users. Added unit test. Closes #31839 from Ngone51/fix-mem-leak-of-ExecutionListenerBus. Authored-by: yi.wu Signed-off-by: HyukjinKwon Closes #31881 from Ngone51/SPARK-34087-3.1. Authored-by: yi.wu Signed-off-by: Wenchen Fan --- .../scala/org/apache/spark/ContextCleaner.scala| 21 + .../scala/org/apache/spark/sql/SparkSession.scala | 3 ++ .../spark/sql/util/QueryExecutionListener.scala| 12 .../spark/sql/SparkSessionBuilderSuite.scala | 35 +- 4 files changed, 65 insertions(+), 6 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/ContextCleaner.scala b/core/src/main/scala/org/apache/spark/ContextCleaner.scala index 7c3d6d9..5da4679 100644 --- a/core/src/main/scala/org/apache/spark/ContextCleaner.scala +++ b/core/src/main/scala/org/apache/spark/ContextCleaner.scala @@ -27,6 +27,7 @@ import org.apache.spark.broadcast.Broadcast import org.apache.spark.internal.Logging import org.apache.spark.internal.config._ import org.apache.spark.rdd.{RDD, ReliableRDDCheckpointData} +import org.apache.spark.scheduler.SparkListener import org.apache.spark.shuffle.api.ShuffleDriverComponents import org.apache.spark.util.{AccumulatorContext, AccumulatorV2, ThreadUtils, Utils} @@ -39,6 +40,7 @@ private case class CleanShuffle(shuffleId: Int) extends CleanupTask private case class CleanBroadcast(broadcastId: Long) extends CleanupTask private case class CleanAccum(accId: Long) extends CleanupTask private case class CleanCheckpoint(rddId: Int) extends CleanupTask +private case class CleanSparkListener(listener: SparkListener) extends CleanupTask /** * A WeakReference associated with a CleanupTask. @@ -175,6 +177,13 @@ private[spark] class ContextCleaner( referenceBuffer.add(new CleanupTaskWeakReference(task, objectForCleanup, referenceQueue)) } + /** Register a SparkListener to be cleaned up when its owner is garbage collected. */ + def registerSparkListenerForCleanup( + listenerOwner: AnyRef, + listener: SparkListener): Unit = { +registerForCleanup(listenerOwner, CleanSparkListener(listener)) + } + /** Keep cleaning RDD, shuffle, and broadcast state. */ private def keepCleaning(): Unit = Utils.tryOrStopSparkContext(sc) { while (!stopped) { @@ -197,6 +206,8 @@ private[spark] class ContextCleaner( doCleanupAccum(accId, blocking = blockOnCleanupTasks) case CleanCheckpoint(rddId) => doCleanCheckpoint(rddId) + case CleanSparkListener(listener) => +doCleanSparkListener(listener) } } } @@ -272,6 +283,16 @@ private[spark] class ContextCleaner( } } + def doCleanSparkListener(listener: SparkListener): Unit = { +try { + logDebug(s"Cleaning Spark listener $listener") + sc.listenerBus.removeListener(listener) + logDebug(s"Cleaned Spark listener $listener") +} catch { + case e: Exception => logError(s"Error cleaning Spark listener $listener", e) +} + } + private def broadcastManager = sc.env.broadcastManager private def mapOutputTrackerMaster = sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster] } diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala index f89e58c..28b3a0d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala @@ -1
[spark] branch branch-3.0 updated: [SPARK-34772][SQL] RebaseDateTime loadRebaseRecords should use Spark classloader instead of context
This is an automated email from the ASF dual-hosted git repository. yumwang pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 0da0d3d [SPARK-34772][SQL] RebaseDateTime loadRebaseRecords should use Spark classloader instead of context 0da0d3d is described below commit 0da0d3d7ab6855b8dc7be5db68905d8109b536f8 Author: ulysses-you AuthorDate: Fri Mar 19 12:51:43 2021 +0800 [SPARK-34772][SQL] RebaseDateTime loadRebaseRecords should use Spark classloader instead of context Change context classloader to Spark classloader at `RebaseDateTime.loadRebaseRecords` With custom `spark.sql.hive.metastore.version` and `spark.sql.hive.metastore.jars`. Spark would use date formatter in `HiveShim` that convert `date` to `string`, if we set `spark.sql.legacy.timeParserPolicy=LEGACY` and the partition type is `date` the `RebaseDateTime` code will be invoked. At that moment, if `RebaseDateTime` is initialized the first time then context class loader is `IsolatedClientLoader`. Such error msg would throw: ``` java.lang.IllegalArgumentException: argument "src" is null at com.fasterxml.jackson.databind.ObjectMapper._assertNotNull(ObjectMapper.java:4413) at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3157) at com.fasterxml.jackson.module.scala.ScalaObjectMapper.readValue(ScalaObjectMapper.scala:187) at com.fasterxml.jackson.module.scala.ScalaObjectMapper.readValue$(ScalaObjectMapper.scala:186) at org.apache.spark.sql.catalyst.util.RebaseDateTime$$anon$1.readValue(RebaseDateTime.scala:267) at org.apache.spark.sql.catalyst.util.RebaseDateTime$.loadRebaseRecords(RebaseDateTime.scala:269) at org.apache.spark.sql.catalyst.util.RebaseDateTime$.(RebaseDateTime.scala:291) at org.apache.spark.sql.catalyst.util.RebaseDateTime$.(RebaseDateTime.scala) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.toJavaDate(DateTimeUtils.scala:109) at org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format(DateFormatter.scala:95) at org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format$(DateFormatter.scala:94) at org.apache.spark.sql.catalyst.util.LegacySimpleDateFormatter.format(DateFormatter.scala:138) at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$1$.unapply(HiveShim.scala:661) at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:785) at org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) ``` ``` java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.catalyst.util.RebaseDateTime$ at org.apache.spark.sql.catalyst.util.DateTimeUtils$.toJavaDate(DateTimeUtils.scala:109) at org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format(DateFormatter.scala:95) at org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format$(DateFormatter.scala:94) at org.apache.spark.sql.catalyst.util.LegacySimpleDateFormatter.format(DateFormatter.scala:138) at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$1$.unapply(HiveShim.scala:661) at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:785) at org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) at scala.collection.immutable.Stream.flatMap(Stream.scala:493) at org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:749) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:747) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitionsByFilter$1(HiveExternalCatalog.scala:1273) ``` The reproduce steps: 1. `spark.sql.hive.metastore.version` and `spark.sql.hive.metastore.jars`. 2. `CREATE TABLE t (c int) PARTITIONED BY (p date)` 3. `SET spark.sql.legacy.timeParserPolicy=LEGACY` 4. `SELECT * FROM t WHERE p='2021-01-01'` Yes, bug fix. pass `org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite` and add new unit test to `HiveSparkSubmitSuite.scala`. Closes #31864 from ulysses-you/SPARK-34772. Aut
[spark] branch branch-3.1 updated: [SPARK-34087][3.1][SQL] Fix memory leak of ExecutionListenerBus
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 5f46bee [SPARK-34087][3.1][SQL] Fix memory leak of ExecutionListenerBus 5f46bee is described below commit 5f46bee4d09a655e38a89f9134138f1826855870 Author: yi.wu AuthorDate: Fri Mar 19 13:23:06 2021 +0800 [SPARK-34087][3.1][SQL] Fix memory leak of ExecutionListenerBus ### What changes were proposed in this pull request? This PR proposes an alternative way to fix the memory leak of `ExecutionListenerBus`, which would automatically clean them up. Basically, the idea is to add `registerSparkListenerForCleanup` to `ContextCleaner`, so we can remove the `ExecutionListenerBus` from `LiveListenerBus` when the `SparkSession` is GC'ed. On the other hand, to make the `SparkSession` GC-able, we need to get rid of the reference of `SparkSession` in `ExecutionListenerBus`. Therefore, we introduced the `sessionUUID`, which is a unique identifier for SparkSession, to replace the `SparkSession` object. Note that, the proposal wouldn't take effect when `spark.cleaner.referenceTracking=false` since it depends on `ContextCleaner`. ### Why are the changes needed? Fix the memory leak caused by `ExecutionListenerBus` mentioned in SPARK-34087. ### Does this PR introduce _any_ user-facing change? Yes, save memory for users. ### How was this patch tested? Added unit test. Closes #31839 from Ngone51/fix-mem-leak-of-ExecutionListenerBus. Authored-by: yi.wu Signed-off-by: HyukjinKwon Closes #31881 from Ngone51/SPARK-34087-3.1. Authored-by: yi.wu Signed-off-by: Wenchen Fan --- .../scala/org/apache/spark/ContextCleaner.scala| 21 + .../scala/org/apache/spark/sql/SparkSession.scala | 3 ++ .../spark/sql/util/QueryExecutionListener.scala| 12 .../spark/sql/SparkSessionBuilderSuite.scala | 35 +- 4 files changed, 65 insertions(+), 6 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/ContextCleaner.scala b/core/src/main/scala/org/apache/spark/ContextCleaner.scala index cfa1139..34b3089 100644 --- a/core/src/main/scala/org/apache/spark/ContextCleaner.scala +++ b/core/src/main/scala/org/apache/spark/ContextCleaner.scala @@ -27,6 +27,7 @@ import org.apache.spark.broadcast.Broadcast import org.apache.spark.internal.Logging import org.apache.spark.internal.config._ import org.apache.spark.rdd.{RDD, ReliableRDDCheckpointData} +import org.apache.spark.scheduler.SparkListener import org.apache.spark.shuffle.api.ShuffleDriverComponents import org.apache.spark.util.{AccumulatorContext, AccumulatorV2, ThreadUtils, Utils} @@ -39,6 +40,7 @@ private case class CleanShuffle(shuffleId: Int) extends CleanupTask private case class CleanBroadcast(broadcastId: Long) extends CleanupTask private case class CleanAccum(accId: Long) extends CleanupTask private case class CleanCheckpoint(rddId: Int) extends CleanupTask +private case class CleanSparkListener(listener: SparkListener) extends CleanupTask /** * A WeakReference associated with a CleanupTask. @@ -175,6 +177,13 @@ private[spark] class ContextCleaner( referenceBuffer.add(new CleanupTaskWeakReference(task, objectForCleanup, referenceQueue)) } + /** Register a SparkListener to be cleaned up when its owner is garbage collected. */ + def registerSparkListenerForCleanup( + listenerOwner: AnyRef, + listener: SparkListener): Unit = { +registerForCleanup(listenerOwner, CleanSparkListener(listener)) + } + /** Keep cleaning RDD, shuffle, and broadcast state. */ private def keepCleaning(): Unit = Utils.tryOrStopSparkContext(sc) { while (!stopped) { @@ -197,6 +206,8 @@ private[spark] class ContextCleaner( doCleanupAccum(accId, blocking = blockOnCleanupTasks) case CleanCheckpoint(rddId) => doCleanCheckpoint(rddId) + case CleanSparkListener(listener) => +doCleanSparkListener(listener) } } } @@ -276,6 +287,16 @@ private[spark] class ContextCleaner( } } + def doCleanSparkListener(listener: SparkListener): Unit = { +try { + logDebug(s"Cleaning Spark listener $listener") + sc.listenerBus.removeListener(listener) + logDebug(s"Cleaned Spark listener $listener") +} catch { + case e: Exception => logError(s"Error cleaning Spark listener $listener", e) +} + } + private def broadcastManager = sc.env.broadcastManager private def mapOutputTrackerMaster = sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster] } diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala b/sql/core/src/main/scala/org/apache
[spark] branch branch-3.1 updated: [SPARK-34772][SQL] RebaseDateTime loadRebaseRecords should use Spark classloader instead of context
This is an automated email from the ASF dual-hosted git repository. yumwang pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 9918568 [SPARK-34772][SQL] RebaseDateTime loadRebaseRecords should use Spark classloader instead of context 9918568 is described below commit 9918568d873541d83526652ae448269e973c326d Author: ulysses-you AuthorDate: Fri Mar 19 12:51:43 2021 +0800 [SPARK-34772][SQL] RebaseDateTime loadRebaseRecords should use Spark classloader instead of context Change context classloader to Spark classloader at `RebaseDateTime.loadRebaseRecords` With custom `spark.sql.hive.metastore.version` and `spark.sql.hive.metastore.jars`. Spark would use date formatter in `HiveShim` that convert `date` to `string`, if we set `spark.sql.legacy.timeParserPolicy=LEGACY` and the partition type is `date` the `RebaseDateTime` code will be invoked. At that moment, if `RebaseDateTime` is initialized the first time then context class loader is `IsolatedClientLoader`. Such error msg would throw: ``` java.lang.IllegalArgumentException: argument "src" is null at com.fasterxml.jackson.databind.ObjectMapper._assertNotNull(ObjectMapper.java:4413) at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3157) at com.fasterxml.jackson.module.scala.ScalaObjectMapper.readValue(ScalaObjectMapper.scala:187) at com.fasterxml.jackson.module.scala.ScalaObjectMapper.readValue$(ScalaObjectMapper.scala:186) at org.apache.spark.sql.catalyst.util.RebaseDateTime$$anon$1.readValue(RebaseDateTime.scala:267) at org.apache.spark.sql.catalyst.util.RebaseDateTime$.loadRebaseRecords(RebaseDateTime.scala:269) at org.apache.spark.sql.catalyst.util.RebaseDateTime$.(RebaseDateTime.scala:291) at org.apache.spark.sql.catalyst.util.RebaseDateTime$.(RebaseDateTime.scala) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.toJavaDate(DateTimeUtils.scala:109) at org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format(DateFormatter.scala:95) at org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format$(DateFormatter.scala:94) at org.apache.spark.sql.catalyst.util.LegacySimpleDateFormatter.format(DateFormatter.scala:138) at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$1$.unapply(HiveShim.scala:661) at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:785) at org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) ``` ``` java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.catalyst.util.RebaseDateTime$ at org.apache.spark.sql.catalyst.util.DateTimeUtils$.toJavaDate(DateTimeUtils.scala:109) at org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format(DateFormatter.scala:95) at org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format$(DateFormatter.scala:94) at org.apache.spark.sql.catalyst.util.LegacySimpleDateFormatter.format(DateFormatter.scala:138) at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$1$.unapply(HiveShim.scala:661) at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:785) at org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) at scala.collection.immutable.Stream.flatMap(Stream.scala:493) at org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:749) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:747) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitionsByFilter$1(HiveExternalCatalog.scala:1273) ``` The reproduce steps: 1. `spark.sql.hive.metastore.version` and `spark.sql.hive.metastore.jars`. 2. `CREATE TABLE t (c int) PARTITIONED BY (p date)` 3. `SET spark.sql.legacy.timeParserPolicy=LEGACY` 4. `SELECT * FROM t WHERE p='2021-01-01'` Yes, bug fix. pass `org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite` and add new unit test to `HiveSparkSubmitSuite.scala`. Closes #31864 from ulysses-you/SPARK-34772. Aut
[spark] branch master updated (a48b208 -> 5850956)
This is an automated email from the ASF dual-hosted git repository. yumwang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a48b208 [SPARK-34761][SQL] Support add/subtract of a day-time interval to/from a timestamp add 5850956 [SPARK-34772][SQL] RebaseDateTime loadRebaseRecords should use Spark classloader instead of context No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/util/RebaseDateTime.scala | 3 +- .../spark/sql/hive/HiveSparkSubmitSuite.scala | 40 +- 2 files changed, 41 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0a58029 -> a48b208)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 0a58029 [SPARK-31897][SQL] Enable codegen for GenerateExec add a48b208 [SPARK-34761][SQL] Support add/subtract of a day-time interval to/from a timestamp No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 12 +++-- .../catalyst/expressions/datetimeExpressions.scala | 24 ++--- .../spark/sql/catalyst/util/DateTimeUtils.scala| 22 .../expressions/DateExpressionsSuite.scala | 58 +- .../catalyst/expressions/LiteralGenerator.scala| 11 ++-- .../sql/catalyst/util/DateTimeUtilsSuite.scala | 35 + .../apache/spark/sql/ColumnExpressionSuite.scala | 48 ++ 7 files changed, 189 insertions(+), 21 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (07ee732 -> 0a58029)
This is an automated email from the ASF dual-hosted git repository. viirya pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 07ee732 [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document add 0a58029 [SPARK-31897][SQL] Enable codegen for GenerateExec No new revisions were added by this update. Summary of changes: .../GenerateExecBenchmark-jdk11-results.txt| 12 .../benchmarks/GenerateExecBenchmark-results.txt | 12 .../apache/spark/sql/execution/GenerateExec.scala | 23 +++ .../sql/execution/WholeStageCodegenSuite.scala | 79 ++ .../benchmark/GenerateExecBenchmark.scala | 31 ++--- 5 files changed, 134 insertions(+), 23 deletions(-) create mode 100644 sql/core/benchmarks/GenerateExecBenchmark-jdk11-results.txt create mode 100644 sql/core/benchmarks/GenerateExecBenchmark-results.txt copy external/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroWriteBenchmark.scala => sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/GenerateExecBenchmark.scala (56%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] attilapiros commented on a change in pull request #328: Add Attila Zsolt Piros to committers
attilapiros commented on a change in pull request #328: URL: https://github.com/apache/spark-website/pull/328#discussion_r597374832 ## File path: site/committers.html ## @@ -338,6 +338,10 @@ Current Committers Holden Karau Apple + + Attila Zsolt Piros + Cloudera + Review comment: I am testing my check action ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] HyukjinKwon commented on a change in pull request #328: Add Attila Zsolt Piros to committers
HyukjinKwon commented on a change in pull request #328: URL: https://github.com/apache/spark-website/pull/328#discussion_r597374541 ## File path: site/committers.html ## @@ -338,6 +338,10 @@ Current Committers Holden Karau Apple + + Attila Zsolt Piros + Cloudera + Review comment: Don't forget to update htmls too man :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] HeartSaVioR commented on a change in pull request #328: Add Attila Zsolt Piros to committers
HeartSaVioR commented on a change in pull request #328: URL: https://github.com/apache/spark-website/pull/328#discussion_r597371962 ## File path: committers.md ## @@ -42,6 +42,7 @@ navigation: |Kazuaki Ishizaki|IBM| |Xingbo Jiang|Databricks| |Holden Karau|Apple| +|Attila Zsolt Piros|Cloudera| Review comment: Same comment here. (I agree this is quite confusing.) Actually there's a convention in the list if I understand correctly (I don't know where this is documented) - the list is alphabetically sorted by "last name" (not first name). Could you please check the sequence a bit and please update the same if the list follows the pattern? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] attilapiros opened a new pull request #328: Add Attila Zsolt Piros to committers
attilapiros opened a new pull request #328: URL: https://github.com/apache/spark-website/pull/328 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] hzxiongyinke commented on pull request #325: Add Kent Yao to committers
hzxiongyinke commented on pull request #325: URL: https://github.com/apache/spark-website/pull/325#issuecomment-802494343 Congrats! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated (1b70aad -> c2629a7)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git. from 1b70aad [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document add c2629a7 [SPARK-34719][SQL][3.1] Correctly resolve the view query with duplicated column names No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/analysis/view.scala | 44 +--- .../spark/sql/execution/SQLViewTestSuite.scala | 48 ++ 2 files changed, 86 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new a3f55ca [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document a3f55ca is described below commit a3f55ca4258b2f48ad9025be20afa9c6c46a189a Author: Kousuke Saruta AuthorDate: Fri Mar 19 10:19:26 2021 +0900 [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document ### What changes were proposed in this pull request? This PR fix an issue that virtual operators (`||`, `!=`, `<>`, `between` and `case`) are absent from the Spark SQL Built-in functions document. ### Why are the changes needed? The document should explain about all the supported built-in operators. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the document with `SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_PYTHONDOC=1 bundler exec jekyll build` and then, confirmed the document. ![neq1](https://user-images.githubusercontent.com/4736016/92859-e2e76380-85fc-11eb-89c9-75916a5e856a.png) ![neq2](https://user-images.githubusercontent.com/4736016/92874-e7ac1780-85fc-11eb-9a9b-c504265b373f.png) ![between](https://user-images.githubusercontent.com/4736016/92898-eda1f880-85fc-11eb-992d-cf80c544ec27.png) ![case](https://user-images.githubusercontent.com/4736016/92918-f266ac80-85fc-11eb-9306-5dbc413a0cdb.png) ![double_pipe](https://user-images.githubusercontent.com/4736016/92952-fb577e00-85fc-11eb-932e-385e5c2a5205.png) Closes #31841 from sarutak/builtin-op-doc. Authored-by: Kousuke Saruta Signed-off-by: HyukjinKwon (cherry picked from commit 07ee73234f1d1ecd1e5edcce3bc510c59a59cb00) Signed-off-by: HyukjinKwon --- .../spark/sql/execution/command/functions.scala| 6 +- sql/gen-sql-api-docs.py| 102 - 2 files changed, 105 insertions(+), 3 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala index 8ab7d3d..fd11c3b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala @@ -229,8 +229,10 @@ case class ShowFunctionsCommand( case (f, "USER") if showUserFunctions => f.unquotedString case (f, "SYSTEM") if showSystemFunctions => f.unquotedString } -// Hard code "<>", "!=", "between", and "case" for now as there is no corresponding functions. -// "<>", "!=", "between", and "case" is SystemFunctions, only show when showSystemFunctions=true +// Hard code "<>", "!=", "between", "case", and "||" +// for now as there is no corresponding functions. +// "<>", "!=", "between", "case", and "||" is SystemFunctions, +// only show when showSystemFunctions=true if (showSystemFunctions) { (functionNames ++ StringUtils.filterPattern(FunctionsCommand.virtualOperators, pattern.getOrElse("*"))) diff --git a/sql/gen-sql-api-docs.py b/sql/gen-sql-api-docs.py index 6132899..8465b9c 100644 --- a/sql/gen-sql-api-docs.py +++ b/sql/gen-sql-api-docs.py @@ -24,6 +24,106 @@ from pyspark.java_gateway import launch_gateway ExpressionInfo = namedtuple( "ExpressionInfo", "className name usage arguments examples note since deprecated") +_virtual_operator_infos = [ +ExpressionInfo( +className="", +name="!=", +usage="expr1 != expr2 - Returns true if `expr1` is not equal to `expr2`, " + + "or false otherwise.", +arguments="\nArguments:\n " + + """* expr1, expr2 - the two expressions must be same type or can be casted to + a common type, and must be a type that can be used in equality comparison. + Map type is not supported. For complex types such array/struct, + the data types of fields must be orderable.""", +examples="\nExamples:\n " + + "> SELECT 1 != 2;\n " + + " true\n " + + "> SELECT 1 != '2';\n " + + " true\n " + + "> SELECT true != NULL;\n " + + " NULL\n " + + "> SELECT NULL != NULL;\n " + + " NULL", +note="", +since="1.0.0", +deprecated=""), +ExpressionInfo( +className="", +name="<>", +usage="expr1 != expr2 - Returns true if `expr1` is not equal to `expr2`, " + + "or false otherwise.", +arguments="\nArgu
[spark] branch branch-3.1 updated: [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 1b70aad [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document 1b70aad is described below commit 1b70aadb62ba87b3ffe130ad2049a4aa0aa877aa Author: Kousuke Saruta AuthorDate: Fri Mar 19 10:19:26 2021 +0900 [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document ### What changes were proposed in this pull request? This PR fix an issue that virtual operators (`||`, `!=`, `<>`, `between` and `case`) are absent from the Spark SQL Built-in functions document. ### Why are the changes needed? The document should explain about all the supported built-in operators. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the document with `SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_PYTHONDOC=1 bundler exec jekyll build` and then, confirmed the document. ![neq1](https://user-images.githubusercontent.com/4736016/92859-e2e76380-85fc-11eb-89c9-75916a5e856a.png) ![neq2](https://user-images.githubusercontent.com/4736016/92874-e7ac1780-85fc-11eb-9a9b-c504265b373f.png) ![between](https://user-images.githubusercontent.com/4736016/92898-eda1f880-85fc-11eb-992d-cf80c544ec27.png) ![case](https://user-images.githubusercontent.com/4736016/92918-f266ac80-85fc-11eb-9306-5dbc413a0cdb.png) ![double_pipe](https://user-images.githubusercontent.com/4736016/92952-fb577e00-85fc-11eb-932e-385e5c2a5205.png) Closes #31841 from sarutak/builtin-op-doc. Authored-by: Kousuke Saruta Signed-off-by: HyukjinKwon (cherry picked from commit 07ee73234f1d1ecd1e5edcce3bc510c59a59cb00) Signed-off-by: HyukjinKwon --- .../spark/sql/execution/command/functions.scala| 6 +- sql/gen-sql-api-docs.py| 102 - 2 files changed, 105 insertions(+), 3 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala index 5c511ec..07c8148 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala @@ -229,8 +229,10 @@ case class ShowFunctionsCommand( case (f, "USER") if showUserFunctions => f.unquotedString case (f, "SYSTEM") if showSystemFunctions => f.unquotedString } -// Hard code "<>", "!=", "between", and "case" for now as there is no corresponding functions. -// "<>", "!=", "between", and "case" is SystemFunctions, only show when showSystemFunctions=true +// Hard code "<>", "!=", "between", "case", and "||" +// for now as there is no corresponding functions. +// "<>", "!=", "between", "case", and "||" is SystemFunctions, +// only show when showSystemFunctions=true if (showSystemFunctions) { (functionNames ++ StringUtils.filterPattern(FunctionsCommand.virtualOperators, pattern.getOrElse("*"))) diff --git a/sql/gen-sql-api-docs.py b/sql/gen-sql-api-docs.py index 2f73409..17631a7 100644 --- a/sql/gen-sql-api-docs.py +++ b/sql/gen-sql-api-docs.py @@ -24,6 +24,106 @@ from pyspark.java_gateway import launch_gateway ExpressionInfo = namedtuple( "ExpressionInfo", "className name usage arguments examples note since deprecated") +_virtual_operator_infos = [ +ExpressionInfo( +className="", +name="!=", +usage="expr1 != expr2 - Returns true if `expr1` is not equal to `expr2`, " + + "or false otherwise.", +arguments="\nArguments:\n " + + """* expr1, expr2 - the two expressions must be same type or can be casted to + a common type, and must be a type that can be used in equality comparison. + Map type is not supported. For complex types such array/struct, + the data types of fields must be orderable.""", +examples="\nExamples:\n " + + "> SELECT 1 != 2;\n " + + " true\n " + + "> SELECT 1 != '2';\n " + + " true\n " + + "> SELECT true != NULL;\n " + + " NULL\n " + + "> SELECT NULL != NULL;\n " + + " NULL", +note="", +since="1.0.0", +deprecated=""), +ExpressionInfo( +className="", +name="<>", +usage="expr1 != expr2 - Returns true if `expr1` is not equal to `expr2`, " + + "or false otherwise.", +arguments="\nArgu
[spark] branch master updated (8207e2f -> 07ee732)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8207e2f [SPARK-34781][SQL] Eliminate LEFT SEMI/ANTI joins to its left child side in AQE add 07ee732 [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document No new revisions were added by this update. Summary of changes: .../spark/sql/execution/command/functions.scala| 6 +- sql/gen-sql-api-docs.py| 102 - 2 files changed, 105 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-34781][SQL] Eliminate LEFT SEMI/ANTI joins to its left child side in AQE
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8207e2f [SPARK-34781][SQL] Eliminate LEFT SEMI/ANTI joins to its left child side in AQE 8207e2f is described below commit 8207e2f65cc2ce2d87ee60ee05a2c1ee896cf93e Author: Cheng Su AuthorDate: Fri Mar 19 09:41:52 2021 +0900 [SPARK-34781][SQL] Eliminate LEFT SEMI/ANTI joins to its left child side in AQE ### What changes were proposed in this pull request? In `EliminateJoinToEmptyRelation.scala`, we can extend it to cover more cases for LEFT SEMI and LEFT ANI joins: * Join is left semi join, join right side is non-empty and condition is empty. Eliminate join to its left side. * Join is left anti join, join right side is empty. Eliminate join to its left side. Given we eliminate join to its left side here, renaming the current optimization rule to `EliminateUnnecessaryJoin` instead. In addition, also change to use `checkRowCount()` to check run time row count, instead of using `EmptyHashedRelation`. So this can cover `BroadcastNestedLoopJoin` as well. (`BroadcastNestedLoopJoin`'s broadcast side is `Array[InternalRow]`, not `HashedRelation`). ### Why are the changes needed? Cover more join cases, and improve query performance for affected queries. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `AdaptiveQueryExecSuite.scala`. Closes #31873 from c21/aqe-join. Authored-by: Cheng Su Signed-off-by: Takeshi Yamamuro --- .../sql/execution/adaptive/AQEOptimizer.scala | 2 +- .../adaptive/EliminateJoinToEmptyRelation.scala| 71 - .../adaptive/EliminateUnnecessaryJoin.scala| 91 ++ .../spark/sql/DynamicPartitionPruningSuite.scala | 2 +- .../adaptive/AdaptiveQueryExecSuite.scala | 51 5 files changed, 127 insertions(+), 90 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala index 04b8ade..901637d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala @@ -29,7 +29,7 @@ class AQEOptimizer(conf: SQLConf) extends RuleExecutor[LogicalPlan] { private val defaultBatches = Seq( Batch("Demote BroadcastHashJoin", Once, DemoteBroadcastHashJoin), -Batch("Eliminate Join to Empty Relation", Once, EliminateJoinToEmptyRelation) +Batch("Eliminate Unnecessary Join", Once, EliminateUnnecessaryJoin) ) final override protected def batches: Seq[Batch] = { diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateJoinToEmptyRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateJoinToEmptyRelation.scala deleted file mode 100644 index d6df522..000 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateJoinToEmptyRelation.scala +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.sql.execution.adaptive - -import org.apache.spark.sql.catalyst.planning.ExtractSingleColumnNullAwareAntiJoin -import org.apache.spark.sql.catalyst.plans.{Inner, LeftAnti, LeftSemi} -import org.apache.spark.sql.catalyst.plans.logical.{Join, LocalRelation, LogicalPlan} -import org.apache.spark.sql.catalyst.rules.Rule -import org.apache.spark.sql.execution.joins.{EmptyHashedRelation, HashedRelation, HashedRelationWithAllNullKeys} - -/** - * This optimization rule detects and converts a Join to an empty [[LocalRelation]]: - * 1. Join is single column NULL-aware anti join (NAAJ), and broadcasted [[HashedRelation]] - *is [[HashedRelationWithAllNullKeys]]. - * - * 2. Join is inner or left semi join, and broadcasted [[HashedRelation]] - *
[spark] branch branch-3.0 updated: [SPARK-34760][EXAMPLES] Replace `favorite_color` with `age` in JavaSQLDataSourceExample
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 71a2f48 [SPARK-34760][EXAMPLES] Replace `favorite_color` with `age` in JavaSQLDataSourceExample 71a2f48 is described below commit 71a2f48c6b61822f775e9603ef604187c9e0d081 Author: zengruios <578395...@qq.com> AuthorDate: Thu Mar 18 22:53:58 2021 +0800 [SPARK-34760][EXAMPLES] Replace `favorite_color` with `age` in JavaSQLDataSourceExample ### What changes were proposed in this pull request? In JavaSparkSQLExample when excecute 'peopleDF.write().partitionBy("favorite_color").bucketBy(42,"name").saveAsTable("people_partitioned_bucketed");' throws Exception: 'Exception in thread "main" org.apache.spark.sql.AnalysisException: partition column favorite_color is not defined in table people_partitioned_bucketed, defined table columns are: age, name;' Change the column favorite_color to age. ### Why are the changes needed? Run JavaSparkSQLExample successfully. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? test in JavaSparkSQLExample . Closes #31851 from zengruios/SPARK-34760. Authored-by: zengruios <578395...@qq.com> Signed-off-by: Kent Yao (cherry picked from commit 5570f817b2862c2680546f35c412bb06779ae1c9) Signed-off-by: Kent Yao --- .../spark/examples/sql/JavaSQLDataSourceExample.java | 6 +++--- .../apache/spark/examples/sql/JavaSparkSQLExample.java | 16 examples/src/main/python/sql/datasource.py | 4 ++-- 3 files changed, 13 insertions(+), 13 deletions(-) diff --git a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java b/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java index 2295225..f4d4329 100644 --- a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java +++ b/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java @@ -188,15 +188,15 @@ public class JavaSQLDataSourceExample { .save("namesPartByColor.parquet"); // $example off:write_partitioning$ // $example on:write_partition_and_bucket$ -peopleDF +usersDF .write() .partitionBy("favorite_color") .bucketBy(42, "name") - .saveAsTable("people_partitioned_bucketed"); + .saveAsTable("users_partitioned_bucketed"); // $example off:write_partition_and_bucket$ spark.sql("DROP TABLE IF EXISTS people_bucketed"); -spark.sql("DROP TABLE IF EXISTS people_partitioned_bucketed"); +spark.sql("DROP TABLE IF EXISTS users_partitioned_bucketed"); } private static void runBasicParquetExample(SparkSession spark) { diff --git a/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java b/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java index 8605852..86a9045 100644 --- a/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java +++ b/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java @@ -65,7 +65,7 @@ public class JavaSparkSQLExample { // $example on:create_ds$ public static class Person implements Serializable { private String name; -private int age; +private long age; public String getName() { return name; @@ -75,11 +75,11 @@ public class JavaSparkSQLExample { this.name = name; } -public int getAge() { +public long getAge() { return age; } -public void setAge(int age) { +public void setAge(long age) { this.age = age; } } @@ -225,11 +225,11 @@ public class JavaSparkSQLExample { // +---++ // Encoders for most common types are provided in class Encoders -Encoder integerEncoder = Encoders.INT(); -Dataset primitiveDS = spark.createDataset(Arrays.asList(1, 2, 3), integerEncoder); -Dataset transformedDS = primitiveDS.map( -(MapFunction) value -> value + 1, -integerEncoder); +Encoder longEncoder = Encoders.LONG(); +Dataset primitiveDS = spark.createDataset(Arrays.asList(1L, 2L, 3L), longEncoder); +Dataset transformedDS = primitiveDS.map( +(MapFunction) value -> value + 1L, +longEncoder); transformedDS.collect(); // Returns [2, 3, 4] // DataFrames can be converted to a Dataset by providing a class. Mapping based on name diff --git a/examples/src/main/python/sql/datasource.py b/examples/src/main/python/sql/datasource.py index 9f8fdd7..29bea14 100644 --- a/examples/src/main/python/sql/datasource.py +++ b/examples/src/main/python/sql/datasource.py @@ -86,7 +86,7 @@ def basic_datasource_example(spark): .write .partitionBy("favorite_color") .bucketBy(42
[spark] branch branch-3.1 updated: [SPARK-34760][EXAMPLES] Replace `favorite_color` with `age` in JavaSQLDataSourceExample
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new a00235b [SPARK-34760][EXAMPLES] Replace `favorite_color` with `age` in JavaSQLDataSourceExample a00235b is described below commit a00235b004ceb345e599096f176b2176b0501ea4 Author: zengruios <578395...@qq.com> AuthorDate: Thu Mar 18 22:53:58 2021 +0800 [SPARK-34760][EXAMPLES] Replace `favorite_color` with `age` in JavaSQLDataSourceExample ### What changes were proposed in this pull request? In JavaSparkSQLExample when excecute 'peopleDF.write().partitionBy("favorite_color").bucketBy(42,"name").saveAsTable("people_partitioned_bucketed");' throws Exception: 'Exception in thread "main" org.apache.spark.sql.AnalysisException: partition column favorite_color is not defined in table people_partitioned_bucketed, defined table columns are: age, name;' Change the column favorite_color to age. ### Why are the changes needed? Run JavaSparkSQLExample successfully. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? test in JavaSparkSQLExample . Closes #31851 from zengruios/SPARK-34760. Authored-by: zengruios <578395...@qq.com> Signed-off-by: Kent Yao (cherry picked from commit 5570f817b2862c2680546f35c412bb06779ae1c9) Signed-off-by: Kent Yao --- .../spark/examples/sql/JavaSQLDataSourceExample.java | 6 +++--- .../apache/spark/examples/sql/JavaSparkSQLExample.java | 16 examples/src/main/python/sql/datasource.py | 4 ++-- 3 files changed, 13 insertions(+), 13 deletions(-) diff --git a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java b/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java index 46e740d..cb34db1 100644 --- a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java +++ b/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java @@ -204,15 +204,15 @@ public class JavaSQLDataSourceExample { .save("namesPartByColor.parquet"); // $example off:write_partitioning$ // $example on:write_partition_and_bucket$ -peopleDF +usersDF .write() .partitionBy("favorite_color") .bucketBy(42, "name") - .saveAsTable("people_partitioned_bucketed"); + .saveAsTable("users_partitioned_bucketed"); // $example off:write_partition_and_bucket$ spark.sql("DROP TABLE IF EXISTS people_bucketed"); -spark.sql("DROP TABLE IF EXISTS people_partitioned_bucketed"); +spark.sql("DROP TABLE IF EXISTS users_partitioned_bucketed"); } private static void runBasicParquetExample(SparkSession spark) { diff --git a/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java b/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java index 8605852..86a9045 100644 --- a/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java +++ b/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java @@ -65,7 +65,7 @@ public class JavaSparkSQLExample { // $example on:create_ds$ public static class Person implements Serializable { private String name; -private int age; +private long age; public String getName() { return name; @@ -75,11 +75,11 @@ public class JavaSparkSQLExample { this.name = name; } -public int getAge() { +public long getAge() { return age; } -public void setAge(int age) { +public void setAge(long age) { this.age = age; } } @@ -225,11 +225,11 @@ public class JavaSparkSQLExample { // +---++ // Encoders for most common types are provided in class Encoders -Encoder integerEncoder = Encoders.INT(); -Dataset primitiveDS = spark.createDataset(Arrays.asList(1, 2, 3), integerEncoder); -Dataset transformedDS = primitiveDS.map( -(MapFunction) value -> value + 1, -integerEncoder); +Encoder longEncoder = Encoders.LONG(); +Dataset primitiveDS = spark.createDataset(Arrays.asList(1L, 2L, 3L), longEncoder); +Dataset transformedDS = primitiveDS.map( +(MapFunction) value -> value + 1L, +longEncoder); transformedDS.collect(); // Returns [2, 3, 4] // DataFrames can be converted to a Dataset by providing a class. Mapping based on name diff --git a/examples/src/main/python/sql/datasource.py b/examples/src/main/python/sql/datasource.py index 8c146ba..3bc31a0 100644 --- a/examples/src/main/python/sql/datasource.py +++ b/examples/src/main/python/sql/datasource.py @@ -104,7 +104,7 @@ def basic_datasource_example(spark): .write .partitionBy("favorite_color") .bucketBy(
[spark] branch branch-3.1 updated (99d4ed6 -> a4f6049)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git. from 99d4ed6 [SPARK-34774][BUILD] Ensure change-scala-version.sh update scala.version in parent POM correctly add a4f6049 [SPARK-34766][SQL][3.1] Do not capture maven config for views No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/spark/sql/execution/command/views.scala | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-34774][BUILD] Ensure change-scala-version.sh update scala.version in parent POM correctly
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new fa13a01 [SPARK-34774][BUILD] Ensure change-scala-version.sh update scala.version in parent POM correctly fa13a01 is described below commit fa13a013a2bb316a7d6fcce58f0d721bcc19588f Author: yangjie01 AuthorDate: Thu Mar 18 07:33:23 2021 -0500 [SPARK-34774][BUILD] Ensure change-scala-version.sh update scala.version in parent POM correctly ### What changes were proposed in this pull request? After SPARK-34507, execute` change-scala-version.sh` script will update `scala.version` in parent pom, but if we execute the following commands in order: ``` dev/change-scala-version.sh 2.13 dev/change-scala-version.sh 2.12 git status ``` there will generate git diff as follow: ``` diff --git a/pom.xml b/pom.xml index ddc4ce2f68..f43d8c8f78 100644 --- a/pom.xml +++ b/pom.xml -162,7 +162,7 3.4.1 3.2.2 -2.12.10 +2.13.5 2.12 2.0.0 --test ``` seem 'scala.version' property was not update correctly. So this pr add an extra 'scala.version' to scala-2.12 profile to ensure change-scala-version.sh can update the public `scala.version` property correctly. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? **Manual test** Execute the following commands in order: ``` dev/change-scala-version.sh 2.13 dev/change-scala-version.sh 2.12 git status ``` **Before** ``` diff --git a/pom.xml b/pom.xml index ddc4ce2f68..f43d8c8f78 100644 --- a/pom.xml +++ b/pom.xml -162,7 +162,7 3.4.1 3.2.2 -2.12.10 +2.13.5 2.12 2.0.0 --test ``` **After** No git diff. Closes #31865 from LuciferYang/SPARK-34774. Authored-by: yangjie01 Signed-off-by: Sean Owen (cherry picked from commit 2e836cdb598255d6b43b386e98fcb79b70338e69) Signed-off-by: Sean Owen --- pom.xml | 7 +++ 1 file changed, 7 insertions(+) diff --git a/pom.xml b/pom.xml index 80fcc55..1a42165 100644 --- a/pom.xml +++ b/pom.xml @@ -3173,6 +3173,13 @@ scala-2.12 + + +2.12.10 + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated (dd825e8 -> 99d4ed6)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git. from dd825e8 [SPARK-34731][CORE] Avoid ConcurrentModificationException when redacting properties in EventLoggingListener add 99d4ed6 [SPARK-34774][BUILD] Ensure change-scala-version.sh update scala.version in parent POM correctly No new revisions were added by this update. Summary of changes: pom.xml | 7 +++ 1 file changed, 7 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d99135b -> 2e836cd)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d99135b [SPARK-34741][SQL] MergeIntoTable should avoid ambiguous reference in UpdateAction add 2e836cd [SPARK-34774][BUILD] Ensure change-scala-version.sh update scala.version in parent POM correctly No new revisions were added by this update. Summary of changes: pom.xml | 7 +++ 1 file changed, 7 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (86ea520 -> d99135b)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 86ea520 [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit add d99135b [SPARK-34741][SQL] MergeIntoTable should avoid ambiguous reference in UpdateAction No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/analysis/Analyzer.scala | 3 +++ .../spark/sql/catalyst/plans/logical/v2Commands.scala | 1 + .../spark/sql/catalyst/analysis/AnalysisSuite.scala | 13 + .../ReplaceNullWithFalseInPredicateSuite.scala| 19 +-- 4 files changed, 30 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org