[jira] [Commented] (SPARK-42966) Make memory stream implement SupportsReportStatistics
[ https://issues.apache.org/jira/browse/SPARK-42966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706520#comment-17706520 ] Shixiong Zhu commented on SPARK-42966: -- cc [~kabhwan] > Make memory stream implement SupportsReportStatistics > - > > Key: SPARK-42966 > URL: https://issues.apache.org/jira/browse/SPARK-42966 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Shixiong Zhu >Priority: Major > > Memory stream is pretty convenient when writing Structured Streaming tests. > However, it doesn't implement SupportsReportStatistics. > Would be great to implement SupportsReportStatistics so that we can also use > memory stream to write tests that need to use stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42966) Make memory stream implement SupportsReportStatistics
Shixiong Zhu created SPARK-42966: Summary: Make memory stream implement SupportsReportStatistics Key: SPARK-42966 URL: https://issues.apache.org/jira/browse/SPARK-42966 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.5.0 Reporter: Shixiong Zhu Memory stream is pretty convenient when writing Structured Streaming tests. However, it doesn't implement SupportsReportStatistics. Would be great to implement SupportsReportStatistics so that we can also use memory stream to write tests that need to use stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41885) --packages may not work on Windows 11
Shixiong Zhu created SPARK-41885: Summary: --packages may not work on Windows 11 Key: SPARK-41885 URL: https://issues.apache.org/jira/browse/SPARK-41885 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.1 Environment: Hadoop 2.7 in windows 11 Reporter: Shixiong Zhu Gastón Ortiz reported an issue when using spark 3.2.1 and hadoop 2.7 in windows 11. See [https://github.com/delta-io/delta/issues/1059] Looks like executor cannot fetch the jar files. See the critical stack trace below (the full stack trace is in [https://github.com/delta-io/delta/issues/1059] ): {code:java} org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:366) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:762) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:549) at org.apache.spark.executor.Executor.$anonfun$updateDependencies$13(Executor.scala:962) at org.apache.spark.executor.Executor.$anonfun$updateDependencies$13$adapted(Executor.scala:954) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:954) at org.apache.spark.executor.Executor.(Executor.scala:247) at {code} This is not a Delta Lake issue, as this can be reproduced by running `pyspark --packages org.apache.kafka:kafka-clients:2.8.1` as well. I don't have a Windows 11 environment to debug. Hence I help Gastón Ortiz create this ticket and it would be great if anyone who has a Windows 11 environment can help this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41040) Self-union streaming query may fail when using readStream.table
[ https://issues.apache.org/jira/browse/SPARK-41040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-41040. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request https://github.com/apache/spark/pull/38553 > Self-union streaming query may fail when using readStream.table > --- > > Key: SPARK-41040 > URL: https://issues.apache.org/jira/browse/SPARK-41040 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.4.0 > > > In a self-union query, the batch plan created by the source will be shared by > multiple nodes in the plan. When we transform the plan, the batch plan will > be visited multiple times. Hence, the first visit will set the CatalogTable > and the second visit will try to set it again and fail the query at [this > line|https://github.com/apache/spark/blob/406d0e243cfec9b29df946e1a0e20ed5fe25e152/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L625]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41045) Pre-compute to eliminate ScalaReflection calls after deserializer is created
[ https://issues.apache.org/jira/browse/SPARK-41045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-41045. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request https://github.com/apache/spark/pull/38556 > Pre-compute to eliminate ScalaReflection calls after deserializer is created > > > Key: SPARK-41045 > URL: https://issues.apache.org/jira/browse/SPARK-41045 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.4.0 > > > Currently when `ScalaReflection` returns a deserializer, for a few complex > types, such as array, map, udt, etc, it creates functions that may still > touch `ScalaReflection` after the deserializer is created. > `ScalaReflection` is a performance bottleneck for multiple threads as it > holds multiple global locks. We can refactor > `ScalaReflection.deserializerFor` to pre-compute everything that needs to > touch `ScalaReflection` before creating the deserializer. After this, once > the deserializer is created, it can be reused by multiple threads without > touching `ScalaReflection.deserializerFor` any more and it will be much > faster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41045) Pre-compute to eliminate ScalaReflection calls after deserializer is created
[ https://issues.apache.org/jira/browse/SPARK-41045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-41045: Assignee: Shixiong Zhu > Pre-compute to eliminate ScalaReflection calls after deserializer is created > > > Key: SPARK-41045 > URL: https://issues.apache.org/jira/browse/SPARK-41045 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Currently when `ScalaReflection` returns a deserializer, for a few complex > types, such as array, map, udt, etc, it creates functions that may still > touch `ScalaReflection` after the deserializer is created. > `ScalaReflection` is a performance bottleneck for multiple threads as it > holds multiple global locks. We can refactor > `ScalaReflection.deserializerFor` to pre-compute everything that needs to > touch `ScalaReflection` before creating the deserializer. After this, once > the deserializer is created, it can be reused by multiple threads without > touching `ScalaReflection.deserializerFor` any more and it will be much > faster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41045) Pre-compute to eliminate ScalaReflection calls after deserializer is created
Shixiong Zhu created SPARK-41045: Summary: Pre-compute to eliminate ScalaReflection calls after deserializer is created Key: SPARK-41045 URL: https://issues.apache.org/jira/browse/SPARK-41045 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Shixiong Zhu Currently when `ScalaReflection` returns a deserializer, for a few complex types, such as array, map, udt, etc, it creates functions that may still touch `ScalaReflection` after the deserializer is created. `ScalaReflection` is a performance bottleneck for multiple threads as it holds multiple global locks. We can refactor `ScalaReflection.deserializerFor` to pre-compute everything that needs to touch `ScalaReflection` before creating the deserializer. After this, once the deserializer is created, it can be reused by multiple threads without touching `ScalaReflection.deserializerFor` any more and it will be much faster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41040) Self-union streaming query may fail when using readStream.table
[ https://issues.apache.org/jira/browse/SPARK-41040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-41040: Assignee: Shixiong Zhu > Self-union streaming query may fail when using readStream.table > --- > > Key: SPARK-41040 > URL: https://issues.apache.org/jira/browse/SPARK-41040 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > In a self-union query, the batch plan created by the source will be shared by > multiple nodes in the plan. When we transform the plan, the batch plan will > be visited multiple times. Hence, the first visit will set the CatalogTable > and the second visit will try to set it again and fail the query at [this > line|https://github.com/apache/spark/blob/406d0e243cfec9b29df946e1a0e20ed5fe25e152/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L625]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41040) Self-union streaming query may fail when using readStream.table
Shixiong Zhu created SPARK-41040: Summary: Self-union streaming query may fail when using readStream.table Key: SPARK-41040 URL: https://issues.apache.org/jira/browse/SPARK-41040 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Shixiong Zhu In a self-union query, the batch plan created by the source will be shared by multiple nodes in the plan. When we transform the plan, the batch plan will be visited multiple times. Hence, the first visit will set the CatalogTable and the second visit will try to set it again and fail the query at [this line|https://github.com/apache/spark/blob/406d0e243cfec9b29df946e1a0e20ed5fe25e152/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L625]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40937) Support programming APIs for update/delete/merge
[ https://issues.apache.org/jira/browse/SPARK-40937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625249#comment-17625249 ] Shixiong Zhu commented on SPARK-40937: -- cc [~cloud_fan] > Support programming APIs for update/delete/merge > > > Key: SPARK-40937 > URL: https://issues.apache.org/jira/browse/SPARK-40937 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Shixiong Zhu >Priority: Major > > Today Spark supports update/delete/merge commands in SQL. It would be great > to add programming APIs (Scala, Java, Python etc) so that users can use these > commands directly rather than manually creating SQL statements in their non > SQL applications. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40937) Support programming APIs for update/delete/merge
Shixiong Zhu created SPARK-40937: Summary: Support programming APIs for update/delete/merge Key: SPARK-40937 URL: https://issues.apache.org/jira/browse/SPARK-40937 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: Shixiong Zhu Today Spark supports update/delete/merge commands in SQL. It would be great to add programming APIs (Scala, Java, Python etc) so that users can use these commands directly rather than manually creating SQL statements in their non SQL applications. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582457#comment-17582457 ] Shixiong Zhu commented on SPARK-39915: -- Yeah. I would consider this is a bug since the doc of `repartition` explicitly says {code:java} Returns a new Dataset that has exactly `numPartitions` partitions. {code} > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576319#comment-17576319 ] Shixiong Zhu commented on SPARK-39915: -- {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.3.0 /_/ Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171) Type in expressions to have them evaluated. Type :help for more information. scala> spark.range(10, 0).repartition(5).rdd.getNumPartitions res0: Int = 0 {code} > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572685#comment-17572685 ] Shixiong Zhu commented on SPARK-39915: -- cc [~cloud_fan] > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39915) Dataset.repartition(N) may not create N partitions
Shixiong Zhu created SPARK-39915: Summary: Dataset.repartition(N) may not create N partitions Key: SPARK-39915 URL: https://issues.apache.org/jira/browse/SPARK-39915 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Shixiong Zhu Looks like there is a behavior change in Dataset.repartition in 3.3.0. For example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34978) Support time-travel SQL syntax
[ https://issues.apache.org/jira/browse/SPARK-34978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-34978. -- Resolution: Duplicate This has been added by SPARK-37219 > Support time-travel SQL syntax > -- > > Key: SPARK-34978 > URL: https://issues.apache.org/jira/browse/SPARK-34978 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.1 >Reporter: Yann Byron >Priority: Major > > Some DataSource have the ability to query the older snapshot, like Delta. > The syntax may be like `TIMESTAMP AS OF ' > 2018-10-18 22:15:12' ` to query the snapshot closest to this time point, or > `VERSION AS OF 12` to query the 12th snapshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38080) Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and resetTerminated'
[ https://issues.apache.org/jira/browse/SPARK-38080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-38080: - Description: {code:java} [info] - awaitAnyTermination with timeout and resetTerminated *** FAILED *** (14 seconds, 304 milliseconds) [info] Did not throw SparkException when expected. Expected exception org.apache.spark.sql.streaming.StreamingQueryException to be thrown, but org.scalatest.exceptions.TestFailedException was thrown (StreamTest.scala:935) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.intercept(Assertions.scala:756) [info] at org.scalatest.Assertions.intercept$(Assertions.scala:746) [info] at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.$anonfun$test$4(StreamTest.scala:935) [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) [info] at org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.test(StreamTest.scala:935) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.testAwaitAnyTermination(StreamingQueryManagerSuite.scala:445) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10(StreamingQueryManagerSuite.scala:221) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10$adapted(StreamingQueryManagerSuite.scala:140) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$withQueriesOn$1(StreamingQueryManagerSuite.scala:421) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127) [info] at org.scalatest.concurrent.TimeLimits.failAfterImpl(TimeLimits.scala:239) [info] at org.scalatest.concurrent.TimeLimits.failAfterImpl$(TimeLimits.scala:233) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfterImpl(StreamingQueryManagerSuite.scala:39) [info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:230) [info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:229) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfter(StreamingQueryManagerSuite.scala:39) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.withQueriesOn(StreamingQueryManagerSuite.scala:397) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$8(StreamingQueryManagerSuite.scala:140) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) [info] at org.apache.spark.sql.test.SQLTestUtils.$anonfun$testQuietly$1(SQLTestUtils.scala:115) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.S
[jira] [Updated] (SPARK-38080) Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and resetTerminated'
[ https://issues.apache.org/jira/browse/SPARK-38080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-38080: - Description: {code:java} [info] - awaitAnyTermination with timeout and resetTerminated *** FAILED *** (14 seconds, 893 milliseconds) [info] Did not throw SparkException when expected. Expected exception org.apache.spark.sql.streaming.StreamingQueryException to be thrown, but no exception was thrown (StreamTest.scala:935) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.intercept(Assertions.scala:766) [info] at org.scalatest.Assertions.intercept$(Assertions.scala:746) [info] at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.$anonfun$test$4(StreamTest.scala:935) [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) [info] at org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.test(StreamTest.scala:935) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.testAwaitAnyTermination(StreamingQueryManagerSuite.scala:445) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10(StreamingQueryManagerSuite.scala:207) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10$adapted(StreamingQueryManagerSuite.scala:140) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$withQueriesOn$1(StreamingQueryManagerSuite.scala:421) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127) [info] at org.scalatest.concurrent.TimeLimits.failAfterImpl(TimeLimits.scala:239) [info] at org.scalatest.concurrent.TimeLimits.failAfterImpl$(TimeLimits.scala:233) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfterImpl(StreamingQueryManagerSuite.scala:39) [info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:230) [info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:229) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfter(StreamingQueryManagerSuite.scala:39) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.withQueriesOn(StreamingQueryManagerSuite.scala:397) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$8(StreamingQueryManagerSuite.scala:140) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) [info] at org.apache.spark.sql.test.SQLTestUtils.$anonfun$testQuietly$1(SQLTestUtils.scala:115) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(En
[jira] [Created] (SPARK-38080) Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and resetTerminated'
Shixiong Zhu created SPARK-38080: Summary: Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and resetTerminated' Key: SPARK-38080 URL: https://issues.apache.org/jira/browse/SPARK-38080 Project: Spark Issue Type: Test Components: Structured Streaming, Tests Affects Versions: 3.2.1 Reporter: Shixiong Zhu {code:java} [info] - awaitAnyTermination with timeout and resetTerminated *** FAILED *** (18 seconds, 333 milliseconds) [info] Did not throw SparkException when expected. Expected exception org.apache.spark.sql.streaming.StreamingQueryException to be thrown, but no exception was thrown (StreamTest.scala:935) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.intercept(Assertions.scala:766) [info] at org.scalatest.Assertions.intercept$(Assertions.scala:746) [info] at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.$anonfun$test$4(StreamTest.scala:935) [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) [info] at org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.test(StreamTest.scala:935) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.testAwaitAnyTermination(StreamingQueryManagerSuite.scala:445) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10(StreamingQueryManagerSuite.scala:207) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10$adapted(StreamingQueryManagerSuite.scala:140) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$withQueriesOn$1(StreamingQueryManagerSuite.scala:421) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127) [info] at org.scalatest.concurrent.TimeLimits.failAfterImpl(TimeLimits.scala:239) [info] at org.scalatest.concurrent.TimeLimits.failAfterImpl$(TimeLimits.scala:233) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfterImpl(StreamingQueryManagerSuite.scala:39) [info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:230) [info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:229) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfter(StreamingQueryManagerSuite.scala:39) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.withQueriesOn(StreamingQueryManagerSuite.scala:397) [info] at org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$8(StreamingQueryManagerSuite.scala:140) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) [info] at org.apache.spark.sql.test.SQLTestUtils.$anonfun$testQuietly$1(SQLTestUtils.scala:115) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$
[jira] [Assigned] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function
[ https://issues.apache.org/jira/browse/SPARK-37646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-37646: Assignee: Shixiong Zhu > Avoid touching Scala reflection APIs in the lit function > > > Key: SPARK-37646 > URL: https://issues.apache.org/jira/browse/SPARK-37646 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Currently lit is slow when the concurrency is high as it needs to hit the > Scala reflection code which hits global locks. For example, running the > following test locally using Spark 3.2 shows the difference: > {code:java} > scala> :paste > // Entering paste mode (ctrl-D to finish)import > org.apache.spark.sql.functions._ > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = > 50def testLiteral(): Unit = { > val ts = for (_ <- 0 until parallelism) yield { > new Thread() { > override def run() { > for (_ <- 0 until 50) { > new Column(Literal(0L)) > } > } > } > } > ts.foreach(_.start()) > ts.foreach(_.join()) > }def testLit(): Unit = { > val ts = for (_ <- 0 until parallelism) yield { > new Thread() { > override def run() { > for (_ <- 0 until 50) { > lit(0L) > } > } > } > } > ts.foreach(_.start()) > ts.foreach(_.join()) > }println("warmup") > testLiteral() > testLit()println("lit: false") > spark.time { > testLiteral() > } > println("lit: true") > spark.time { > testLit() > }// Exiting paste mode, now interpreting.warmup > lit: false > Time taken: 8 ms > lit: true > Time taken: 682 ms > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.expressions.Literal > parallelism: Int = 50 > testLiteral: ()Unit > testLit: ()Unit {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function
Shixiong Zhu created SPARK-37646: Summary: Avoid touching Scala reflection APIs in the lit function Key: SPARK-37646 URL: https://issues.apache.org/jira/browse/SPARK-37646 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Shixiong Zhu Currently lit is slow when the concurrency is high as it needs to hit the Scala reflection code which hits global locks. For example, running the following test locally using Spark 3.2 shows the difference: {code:java} scala> :paste // Entering paste mode (ctrl-D to finish)import org.apache.spark.sql.functions._ import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = 50def testLiteral(): Unit = { val ts = for (_ <- 0 until parallelism) yield { new Thread() { override def run() { for (_ <- 0 until 50) { new Column(Literal(0L)) } } } } ts.foreach(_.start()) ts.foreach(_.join()) }def testLit(): Unit = { val ts = for (_ <- 0 until parallelism) yield { new Thread() { override def run() { for (_ <- 0 until 50) { lit(0L) } } } } ts.foreach(_.start()) ts.foreach(_.join()) }println("warmup") testLiteral() testLit()println("lit: false") spark.time { testLiteral() } println("lit: true") spark.time { testLit() }// Exiting paste mode, now interpreting.warmup lit: false Time taken: 8 ms lit: true Time taken: 682 ms import org.apache.spark.sql.functions._ import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.expressions.Literal parallelism: Int = 50 testLiteral: ()Unit testLit: ()Unit {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36519) Store the RocksDB format in the checkpoint for a streaming query
Shixiong Zhu created SPARK-36519: Summary: Store the RocksDB format in the checkpoint for a streaming query Key: SPARK-36519 URL: https://issues.apache.org/jira/browse/SPARK-36519 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu RocksDB provides backward compatibility but it doesn't always provide forward compatibility. It's better to store the RocksDB format version in the checkpoint so that it would give us more information to provide the rollback guarantee when we upgrade the RocksDB version that may introduce incompatible change in a new Spark version. A typical case is when a user upgrades their query to a new Spark version, and this new Spark version has a new RocksDB version which may use a new format. But the user hits some bug and decide to rollback. But in the old Spark version, the old RocksDB version cannot read the new format. In order to handle this case, we will write the RocksDB format version to the checkpoint. When restarting from a checkpoint, we will force RocksDB to use the format version stored in the checkpoint. This will ensure the user can rollback their Spark version if needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319631#comment-17319631 ] Shixiong Zhu commented on SPARK-31923: -- Do you have a reproduction? It's weird to see `java.util.Collections$SynchronizedSet cannot be cast to java.util.List` since before calling `v.asScala.toList`, the pattern match `case v: java.util.List[_]` should not accept `java.util.Collections$SynchronizedSet`. > Event log cannot be generated when some internal accumulators use unexpected > types > -- > > Key: SPARK-31923 > URL: https://issues.apache.org/jira/browse/SPARK-31923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4, 2.4.5, 2.4.6 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > A user may use internal accumulators by adding the "internal.metrics." prefix > to the accumulator name to hide sensitive information from UI (Accumulators > except internal ones will be shown in Spark UI). > However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an > internal accumulator has only 3 possible types: int, long, and > java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an > unexpected type, it will crash. > An event log that contains such accumulator will be dropped because it cannot > be converted to JSON, and it will cause weird UI issue when rendering in > Spark History Server. For example, if `SparkListenerTaskEnd` is dropped > because of this issue, the user will see the task is still running even if it > was finished. > It's better to make *accumValueToJson* more robust. > > How to reproduce it: > - Enable Spark event log > - Run the following command: > {code} > scala> val accu = sc.doubleAccumulator("internal.metrics.foo") > accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, > name: Some(internal.metrics.foo), value: 0.0) > scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } > 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at scala.collection.immutable.List.map(List.scala:284) > at > org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEvent
[jira] [Commented] (SPARK-19256) Hive bucketing write support
[ https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300594#comment-17300594 ] Shixiong Zhu commented on SPARK-19256: -- [~chengsu] Thanks for the answers. Sounds like Metastore is a requirement to distinguish two formats. If I use Spark to write to a table using the table path directly, it will break the table. Right? > Hive bucketing write support > > > Key: SPARK-19256 > URL: https://issues.apache.org/jira/browse/SPARK-19256 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0 >Reporter: Tejas Patil >Priority: Minor > > Update (2020 by Cheng Su): > We use this JIRA to track progress for Hive bucketing write support in Spark. > The goal is for Spark to write Hive bucketed table, to be compatible with > other compute engines (Hive and Presto). > > Current status for Hive bucketed table in Spark: > Not support for reading Hive bucketed table: read bucketed table as > non-bucketed table. > Wrong behavior for writing Hive ORC and Parquet bucketed table: write > orc/parquet bucketed table as non-bucketed table (code path: > InsertIntoHadoopFsRelationCommand -> FileFormatWriter). > Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception > by default if writing non-orc/parquet bucketed table (code path: > InsertIntoHiveTable), and exception can be disabled by setting config > `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will > write as non-bucketed table. > > Current status for Hive bucketed table in Hive: > Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash > (https://issues.apache.org/jira/browse/HIVE-18910). > Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash. > Hive on Tez: support zero and multiple files per bucket > (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on > read path - > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212] > . > > Current status for Hive bucketed table in Presto (take presto-sql here): > Support writing bucketed table with Hive murmur3hash and hivehash > ([https://github.com/prestosql/presto/pull/1697]). > Support zero and multiple files per bucket > ([https://github.com/prestosql/presto/pull/822]). > > TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and > Hive. Here with this JIRA, we need to add support writing Hive bucketed table > with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and > 2.x.y). > > To allow Spark efficiently read Hive bucketed table, this needs more radical > change and we decide to wait until data source v2 supports bucketing, and do > the read path on data source v2. Read path will not covered by this JIRA. > > Original description (2017 by Tejas Patil): > JIRA to track design discussions and tasks related to Hive bucketing support > in Spark. > Proposal : > [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19256) Hive bucketing write support
[ https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300578#comment-17300578 ] Shixiong Zhu commented on SPARK-19256: -- I read [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing] but didn't find any section about compatibility. I'm curious about both the forward and backward compatibility. For example, * What happens if using an old Spark version to read this new format? Does it treat it as a non-bucketed table, or read it a a bucketed table but may return different results? * How does Spark know which bucket format when reading an existing table, and how does it decide whether to use the old format or the new format to read and write? > Hive bucketing write support > > > Key: SPARK-19256 > URL: https://issues.apache.org/jira/browse/SPARK-19256 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0 >Reporter: Tejas Patil >Priority: Minor > > Update (2020 by Cheng Su): > We use this JIRA to track progress for Hive bucketing write support in Spark. > The goal is for Spark to write Hive bucketed table, to be compatible with > other compute engines (Hive and Presto). > > Current status for Hive bucketed table in Spark: > Not support for reading Hive bucketed table: read bucketed table as > non-bucketed table. > Wrong behavior for writing Hive ORC and Parquet bucketed table: write > orc/parquet bucketed table as non-bucketed table (code path: > InsertIntoHadoopFsRelationCommand -> FileFormatWriter). > Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception > by default if writing non-orc/parquet bucketed table (code path: > InsertIntoHiveTable), and exception can be disabled by setting config > `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will > write as non-bucketed table. > > Current status for Hive bucketed table in Hive: > Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash > (https://issues.apache.org/jira/browse/HIVE-18910). > Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash. > Hive on Tez: support zero and multiple files per bucket > (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on > read path - > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212] > . > > Current status for Hive bucketed table in Presto (take presto-sql here): > Support writing bucketed table with Hive murmur3hash and hivehash > ([https://github.com/prestosql/presto/pull/1697]). > Support zero and multiple files per bucket > ([https://github.com/prestosql/presto/pull/822]). > > TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and > Hive. Here with this JIRA, we need to add support writing Hive bucketed table > with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and > 2.x.y). > > To allow Spark efficiently read Hive bucketed table, this needs more radical > change and we decide to wait until data source v2 supports bucketing, and do > the read path on data source v2. Read path will not covered by this JIRA. > > Original description (2017 by Tejas Patil): > JIRA to track design discussions and tasks related to Hive bucketing support > in Spark. > Proposal : > [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34626) UnresolvedAttribute.sql may return an incorrect sql
[ https://issues.apache.org/jira/browse/SPARK-34626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-34626: - Description: UnresolvedAttribute("a" :: "b" :: Nil).sql returns {code:java} `a.b`{code} The correct one should be either a.b or {code:java} `a`.`b`{code} was: UnresolvedAttribute("a" :: "b" :: Nil).sql returns {{`a.b`}} The correct one should be either a.b or {{`a`.`b`}} > UnresolvedAttribute.sql may return an incorrect sql > --- > > Key: SPARK-34626 > URL: https://issues.apache.org/jira/browse/SPARK-34626 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Shixiong Zhu >Priority: Major > > UnresolvedAttribute("a" :: "b" :: Nil).sql > returns > {code:java} > `a.b`{code} > The correct one should be either a.b or > {code:java} > `a`.`b`{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34626) UnresolvedAttribute.sql may return an incorrect sql
Shixiong Zhu created SPARK-34626: Summary: UnresolvedAttribute.sql may return an incorrect sql Key: SPARK-34626 URL: https://issues.apache.org/jira/browse/SPARK-34626 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.0.2 Reporter: Shixiong Zhu UnresolvedAttribute("a" :: "b" :: Nil).sql returns {{`a.b`}} The correct one should be either a.b or {{`a`.`b`}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34599) INSERT INTO OVERWRITE doesn't support partition columns containing dot for DSv2
Shixiong Zhu created SPARK-34599: Summary: INSERT INTO OVERWRITE doesn't support partition columns containing dot for DSv2 Key: SPARK-34599 URL: https://issues.apache.org/jira/browse/SPARK-34599 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.0.2 Reporter: Shixiong Zhu ResolveInsertInto.staticDeleteExpression is using an incorrect method to create UnresolvedAttribute, which makes it not support partition columns containing dot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34556) Checking duplicate static partition columns doesn't respect case sensitive conf
Shixiong Zhu created SPARK-34556: Summary: Checking duplicate static partition columns doesn't respect case sensitive conf Key: SPARK-34556 URL: https://issues.apache.org/jira/browse/SPARK-34556 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2, 2.4.7, 3.1.1 Reporter: Shixiong Zhu When parsing the partition spec, Spark will call `org.apache.spark.sql.catalyst.parser.ParserUtils.checkDuplicateKeys` to check if there are duplicate partition column names in the list. But this method is always case sensitive and doesn't discover duplicate partition column names when using different cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34034) "show create table" doesn't work for v2 table
[ https://issues.apache.org/jira/browse/SPARK-34034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260095#comment-17260095 ] Shixiong Zhu commented on SPARK-34034: -- NVM. I just realized `show create table` didn't return the correct create table command for v2 table in Spark 3.0.1 (missing the table schema). Yep, it makes sense to disallow it since it returns an incorrect answer. > "show create table" doesn't work for v2 table > - > > Key: SPARK-34034 > URL: https://issues.apache.org/jira/browse/SPARK-34034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Shixiong Zhu >Priority: Blocker > Labels: regression > > I was QAing Spark 3.1.0 RC1 and found one regression: "show create table" > doesn't work for v2 table. > But when using Spark 3.0.1, "show create table" works for v2 table. > Steps to test: > {code:java} > /bin/spark-shell --packages io.delta:delta-core_2.12:0.7.0 --conf > "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > scala> spark.sql("create table foo(i INT) using delta") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("show create table foo").show(false) > +---+ > |createtab_stmt | > +---+ > |CREATE TABLE `default`.`foo` ( > ) > USING delta > | > +---+ > {code} > Looks like it's caused by [https://github.com/apache/spark/pull/30321] > which blocks "show create table" for v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34034) "show create table" doesn't work for v2 table
[ https://issues.apache.org/jira/browse/SPARK-34034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-34034. -- Resolution: Not A Problem > "show create table" doesn't work for v2 table > - > > Key: SPARK-34034 > URL: https://issues.apache.org/jira/browse/SPARK-34034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Shixiong Zhu >Priority: Blocker > Labels: regression > > I was QAing Spark 3.1.0 RC1 and found one regression: "show create table" > doesn't work for v2 table. > But when using Spark 3.0.1, "show create table" works for v2 table. > Steps to test: > {code:java} > /bin/spark-shell --packages io.delta:delta-core_2.12:0.7.0 --conf > "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > scala> spark.sql("create table foo(i INT) using delta") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("show create table foo").show(false) > +---+ > |createtab_stmt | > +---+ > |CREATE TABLE `default`.`foo` ( > ) > USING delta > | > +---+ > {code} > Looks like it's caused by [https://github.com/apache/spark/pull/30321] > which blocks "show create table" for v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34034) "show create table" doesn't work for v2 table
[ https://issues.apache.org/jira/browse/SPARK-34034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-34034: - Description: I was QAing Spark 3.1.0 RC1 and found one regression: "show create table" doesn't work for v2 table. But when using Spark 3.0.1, "show create table" works for v2 table. Steps to test: {code:java} /bin/spark-shell --packages io.delta:delta-core_2.12:0.7.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" scala> spark.sql("create table foo(i INT) using delta") res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("show create table foo").show(false) +---+ |createtab_stmt | +---+ |CREATE TABLE `default`.`foo` ( ) USING delta | +---+ {code} Looks like it's caused by [https://github.com/apache/spark/pull/30321] which blocks "show create table" for v2. was: I was QAing Spark 3.1.0 RC1 and found one regression: "show create table" doesn't work for v2 table. But when using Spark 3.0.1, "show create table" works for v2 table. Steps to test: {code:java} /bin/spark-shell --packages io.delta:delta-core_2.12:0.7.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" scala> spark.sql("create table foo(i INT) using delta") res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("show create table foo").show(false) +---+ |createtab_stmt | +---+ |CREATE TABLE `default`.`foo` ( ) USING delta | +---+ {code} Looks like it's caused by [https://github.com/apache/spark/pull/30321] > "show create table" doesn't work for v2 table > - > > Key: SPARK-34034 > URL: https://issues.apache.org/jira/browse/SPARK-34034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Shixiong Zhu >Priority: Blocker > Labels: regression > > I was QAing Spark 3.1.0 RC1 and found one regression: "show create table" > doesn't work for v2 table. > But when using Spark 3.0.1, "show create table" works for v2 table. > Steps to test: > {code:java} > /bin/spark-shell --packages io.delta:delta-core_2.12:0.7.0 --conf > "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > scala> spark.sql("create table foo(i INT) using delta") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("show create table foo").show(false) > +---+ > |createtab_stmt | > +---+ > |CREATE TABLE `default`.`foo` ( > ) > USING delta > | > +---+ > {code} > Looks like it's caused by [https://github.com/apache/spark/pull/30321] > which blocks "show create table" for v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34034) "show create table" doesn't work for v2 table
[ https://issues.apache.org/jira/browse/SPARK-34034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260061#comment-17260061 ] Shixiong Zhu commented on SPARK-34034: -- [~cloud_fan] > "show create table" doesn't work for v2 table > - > > Key: SPARK-34034 > URL: https://issues.apache.org/jira/browse/SPARK-34034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Shixiong Zhu >Priority: Blocker > > I was QAing Spark 3.1.0 RC1 and found one regression: "show create table" > doesn't work for v2 table. > But when using Spark 3.0.1, "show create table" works for v2 table. > Steps to test: > {code:java} > /bin/spark-shell --packages io.delta:delta-core_2.12:0.7.0 --conf > "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > scala> spark.sql("create table foo(i INT) using delta") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("show create table foo").show(false) > +---+ > |createtab_stmt | > +---+ > |CREATE TABLE `default`.`foo` ( > ) > USING delta > | > +---+ > {code} > Looks like it's caused by [https://github.com/apache/spark/pull/30321] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34034) "show create table" doesn't work for v2 table
[ https://issues.apache.org/jira/browse/SPARK-34034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-34034: - Labels: regression (was: regresion) > "show create table" doesn't work for v2 table > - > > Key: SPARK-34034 > URL: https://issues.apache.org/jira/browse/SPARK-34034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Shixiong Zhu >Priority: Blocker > Labels: regression > > I was QAing Spark 3.1.0 RC1 and found one regression: "show create table" > doesn't work for v2 table. > But when using Spark 3.0.1, "show create table" works for v2 table. > Steps to test: > {code:java} > /bin/spark-shell --packages io.delta:delta-core_2.12:0.7.0 --conf > "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > scala> spark.sql("create table foo(i INT) using delta") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("show create table foo").show(false) > +---+ > |createtab_stmt | > +---+ > |CREATE TABLE `default`.`foo` ( > ) > USING delta > | > +---+ > {code} > Looks like it's caused by [https://github.com/apache/spark/pull/30321] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34034) "show create table" doesn't work for v2 table
[ https://issues.apache.org/jira/browse/SPARK-34034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-34034: - Labels: regresion (was: ) > "show create table" doesn't work for v2 table > - > > Key: SPARK-34034 > URL: https://issues.apache.org/jira/browse/SPARK-34034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Shixiong Zhu >Priority: Blocker > Labels: regresion > > I was QAing Spark 3.1.0 RC1 and found one regression: "show create table" > doesn't work for v2 table. > But when using Spark 3.0.1, "show create table" works for v2 table. > Steps to test: > {code:java} > /bin/spark-shell --packages io.delta:delta-core_2.12:0.7.0 --conf > "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > scala> spark.sql("create table foo(i INT) using delta") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("show create table foo").show(false) > +---+ > |createtab_stmt | > +---+ > |CREATE TABLE `default`.`foo` ( > ) > USING delta > | > +---+ > {code} > Looks like it's caused by [https://github.com/apache/spark/pull/30321] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34034) "show create table" doesn't work for v2 table
Shixiong Zhu created SPARK-34034: Summary: "show create table" doesn't work for v2 table Key: SPARK-34034 URL: https://issues.apache.org/jira/browse/SPARK-34034 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Shixiong Zhu I was QAing Spark 3.1.0 RC1 and found one regression: "show create table" doesn't work for v2 table. But when using Spark 3.0.1, "show create table" works for v2 table. Steps to test: {code:java} /bin/spark-shell --packages io.delta:delta-core_2.12:0.7.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" scala> spark.sql("create table foo(i INT) using delta") res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("show create table foo").show(false) +---+ |createtab_stmt | +---+ |CREATE TABLE `default`.`foo` ( ) USING delta | +---+ {code} Looks like it's caused by [https://github.com/apache/spark/pull/30321] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32896) Add DataStreamWriter.toTable API
[ https://issues.apache.org/jira/browse/SPARK-32896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242844#comment-17242844 ] Shixiong Zhu commented on SPARK-32896: -- The method name is changed to `toTable` in [https://github.com/apache/spark/pull/30571] > Add DataStreamWriter.toTable API > > > Key: SPARK-32896 > URL: https://issues.apache.org/jira/browse/SPARK-32896 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.1.0 > > > For now, there's no way to write to the table (especially catalog table) even > the table is capable to handle streaming write. > We can add DataStreamWriter. -table- toTable API to let end users specify > table as provider, and let streaming query write into the table. That is just > to specify the table, and the overall usage of DataStreamWriter isn't changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32896) Add DataStreamWriter.toTable API
[ https://issues.apache.org/jira/browse/SPARK-32896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-32896: - Description: For now, there's no way to write to the table (especially catalog table) even the table is capable to handle streaming write. We can add DataStreamWriter. -table- toTable API to let end users specify table as provider, and let streaming query write into the table. That is just to specify the table, and the overall usage of DataStreamWriter isn't changed. was: For now, there's no way to write to the table (especially catalog table) even the table is capable to handle streaming write. We can add DataStreamWriter.table API to let end users specify table as provider, and let streaming query write into the table. That is just to specify the table, and the overall usage of DataStreamWriter isn't changed. > Add DataStreamWriter.toTable API > > > Key: SPARK-32896 > URL: https://issues.apache.org/jira/browse/SPARK-32896 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.1.0 > > > For now, there's no way to write to the table (especially catalog table) even > the table is capable to handle streaming write. > We can add DataStreamWriter. -table- toTable API to let end users specify > table as provider, and let streaming query write into the table. That is just > to specify the table, and the overall usage of DataStreamWriter isn't changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32896) Add DataStreamWriter.toTable API
[ https://issues.apache.org/jira/browse/SPARK-32896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-32896: - Summary: Add DataStreamWriter.toTable API (was: Add DataStreamWriter.table API) > Add DataStreamWriter.toTable API > > > Key: SPARK-32896 > URL: https://issues.apache.org/jira/browse/SPARK-32896 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.1.0 > > > For now, there's no way to write to the table (especially catalog table) even > the table is capable to handle streaming write. > We can add DataStreamWriter.table API to let end users specify table as > provider, and let streaming query write into the table. That is just to > specify the table, and the overall usage of DataStreamWriter isn't changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31953) Add Spark Structured Streaming History Server Support
[ https://issues.apache.org/jira/browse/SPARK-31953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-31953. -- Fix Version/s: 3.1.0 Resolution: Done > Add Spark Structured Streaming History Server Support > - > > Key: SPARK-31953 > URL: https://issues.apache.org/jira/browse/SPARK-31953 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.4.6, 3.0.0 >Reporter: Genmao Yu >Assignee: Genmao Yu >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31953) Add Spark Structured Streaming History Server Support
[ https://issues.apache.org/jira/browse/SPARK-31953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-31953: Assignee: Genmao Yu > Add Spark Structured Streaming History Server Support > - > > Key: SPARK-31953 > URL: https://issues.apache.org/jira/browse/SPARK-31953 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.4.6, 3.0.0 >Reporter: Genmao Yu >Assignee: Genmao Yu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33587) Kill the executor on nested fatal errors
Shixiong Zhu created SPARK-33587: Summary: Kill the executor on nested fatal errors Key: SPARK-33587 URL: https://issues.apache.org/jira/browse/SPARK-33587 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: Shixiong Zhu Currently we kill the executor when hitting a fatal error. However, if the fatal error is wrapped by another exception, such as - java.util.concurrent.ExecutionException, com.google.common.util.concurrent.UncheckedExecutionException, com.google.common.util.concurrent.ExecutionError when using Guava cache and java thread pool. - SparkException thrown from this line: https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L231 We will still keep the executor running. Fatal errors are usually unrecoverable (such as OutOfMemoryError), some components may be in a broken state when hitting a fatal error. Hence, it's better to detect the nested fatal error as well and kill the executor. Then we can rely on Spark's fault tolerance to recover. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
[ https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216375#comment-17216375 ] Shixiong Zhu commented on SPARK-21065: -- If you are seeing many active batches, it's likely your streaming application is too slow. You can try to look at UI and see if there are anything obvious that you can optimize. > Spark Streaming concurrentJobs + StreamingJobProgressListener conflict > -- > > Key: SPARK-21065 > URL: https://issues.apache.org/jira/browse/SPARK-21065 > Project: Spark > Issue Type: Bug > Components: DStreams, Scheduler, Spark Core, Web UI >Affects Versions: 2.1.0 >Reporter: Dan Dutrow >Priority: Major > > My streaming application has 200+ output operations, many of them stateful > and several of them windowed. In an attempt to reduce the processing times, I > set "spark.streaming.concurrentJobs" to 2+. Initial results are very > positive, cutting our processing time from ~3 minutes to ~1 minute, but > eventually we encounter an exception as follows: > (Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a > batch from 45 minutes before the exception is thrown.) > 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR > org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener > StreamingJobProgressListener threw an exception > java.util.NoSuchElementException: key not found 149697756 ms > at scala.collection.MalLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.mutable.HashMap.apply(HashMap.scala:65) > at > org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29) > at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43) > ... > The Spark code causing the exception is here: > https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125 > override def onOutputOperationCompleted( > outputOperationCompleted: StreamingListenerOutputOperationCompleted): > Unit = synchronized { > // This method is called before onBatchCompleted > {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color} > updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo) > } > It seems to me that it may be caused by that batch being removed earlier. > https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102 > override def onBatchCompleted(batchCompleted: > StreamingListenerBatchCompleted): Unit = { > synchronized { > waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime) > > {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color} > val batchUIData = BatchUIData(batchCompleted.batchInfo) > completedBatchUIData.enqueue(batchUIData) > if (completedBatchUIData.size > batchUIDataLimit) { > val removedBatch = completedBatchUIData.dequeue() > batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime) > } > totalCompletedBatches += 1L > totalProcessedRecords += batchUIData.numRecords > } > } > What is the solution here? Should I make my spark streaming context remember > duration a lot longer? ssc.remember(batchDuration * rememberMultiple) > Otherwise, it seems like there should be some kind of existence check on > runningBatchUIData before dereferencing it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
[ https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216341#comment-17216341 ] Shixiong Zhu commented on SPARK-21065: -- `spark.streaming.concurrentJobs` is not safe. Fixing it requires fundamental system changes. We don't have any plan for this. > Spark Streaming concurrentJobs + StreamingJobProgressListener conflict > -- > > Key: SPARK-21065 > URL: https://issues.apache.org/jira/browse/SPARK-21065 > Project: Spark > Issue Type: Bug > Components: DStreams, Scheduler, Spark Core, Web UI >Affects Versions: 2.1.0 >Reporter: Dan Dutrow >Priority: Major > > My streaming application has 200+ output operations, many of them stateful > and several of them windowed. In an attempt to reduce the processing times, I > set "spark.streaming.concurrentJobs" to 2+. Initial results are very > positive, cutting our processing time from ~3 minutes to ~1 minute, but > eventually we encounter an exception as follows: > (Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a > batch from 45 minutes before the exception is thrown.) > 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR > org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener > StreamingJobProgressListener threw an exception > java.util.NoSuchElementException: key not found 149697756 ms > at scala.collection.MalLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.mutable.HashMap.apply(HashMap.scala:65) > at > org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29) > at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43) > ... > The Spark code causing the exception is here: > https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125 > override def onOutputOperationCompleted( > outputOperationCompleted: StreamingListenerOutputOperationCompleted): > Unit = synchronized { > // This method is called before onBatchCompleted > {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color} > updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo) > } > It seems to me that it may be caused by that batch being removed earlier. > https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102 > override def onBatchCompleted(batchCompleted: > StreamingListenerBatchCompleted): Unit = { > synchronized { > waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime) > > {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color} > val batchUIData = BatchUIData(batchCompleted.batchInfo) > completedBatchUIData.enqueue(batchUIData) > if (completedBatchUIData.size > batchUIDataLimit) { > val removedBatch = completedBatchUIData.dequeue() > batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime) > } > totalCompletedBatches += 1L > totalProcessedRecords += batchUIData.numRecords > } > } > What is the solution here? Should I make my spark streaming context remember > duration a lot longer? ssc.remember(batchDuration * rememberMultiple) > Otherwise, it seems like there should be some kind of existence check on > runningBatchUIData before dereferencing it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32256) Hive may fail to detect Hadoop version when using isolated classloader
[ https://issues.apache.org/jira/browse/SPARK-32256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155018#comment-17155018 ] Shixiong Zhu commented on SPARK-32256: -- Yep. This doesn't happen in Hadoop 3.1.0 and above > Hive may fail to detect Hadoop version when using isolated classloader > -- > > Key: SPARK-32256 > URL: https://issues.apache.org/jira/browse/SPARK-32256 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Blocker > > Spark allows the user to set `spark.sql.hive.metastore.jars` to specify jars > to access Hive Metastore. These jars are loaded by the isolated classloader. > Because we also share Hadoop classes with the isolated classloader, the user > doesn't need to add Hadoop jars to `spark.sql.hive.metastore.jars`, which > means when we are using the isolated classloader, hadoop-common jar is not > available in this case. If Hadoop VersionInfo is not initialized before we > switch to the isolated classloader, and we try to initialize it using the > isolated classloader (the current thread context classloader), it will fail > and report `Unknown` which causes Hive to throw the following exception: > {code} > java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* > format) > at > org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:147) > at > org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:122) > at > org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:88) > at > org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:377) > at > org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:268) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:58) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:517) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:482) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:544) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:370) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:78) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84) > at > org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:219) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:67) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1548) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3080) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3108) > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3349) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:217) > at > org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:204) > at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:331) > at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:292) > at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:262) > at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:247) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:543) > at > org.apache.hadoop.hive.ql.session.S
[jira] [Updated] (SPARK-32256) Hive may fail to detect Hadoop version when using isolated classloader
[ https://issues.apache.org/jira/browse/SPARK-32256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-32256: - Description: Spark allows the user to set `spark.sql.hive.metastore.jars` to specify jars to access Hive Metastore. These jars are loaded by the isolated classloader. Because we also share Hadoop classes with the isolated classloader, the user doesn't need to add Hadoop jars to `spark.sql.hive.metastore.jars`, which means when we are using the isolated classloader, hadoop-common jar is not available in this case. If Hadoop VersionInfo is not initialized before we switch to the isolated classloader, and we try to initialize it using the isolated classloader (the current thread context classloader), it will fail and report `Unknown` which causes Hive to throw the following exception: {code} java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* format) at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:147) at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:122) at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:88) at org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:377) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:268) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136) at org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:58) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:517) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:482) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:544) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:370) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:78) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84) at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:219) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:67) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1548) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3080) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3108) at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3349) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:217) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:204) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:331) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:292) at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:262) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:247) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:543) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:511) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:175) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:128) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.Isolate
[jira] [Updated] (SPARK-32256) Hive may fail to detect Hadoop version when using isolated classloader
[ https://issues.apache.org/jira/browse/SPARK-32256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-32256: - Description: Spark allows the user to set `spark.sql.hive.metastore.jars` to specify jars to access Hive Metastore. These jars are loaded by the isolated classloader. Because we also share Hadoop classes with the isolated classloader, the user doesn't need to add Hadoop jars to `spark.sql.hive.metastore.jars`, which means when we are using the isolated classloader, hadoop-common jar is not available in this case. If Hadoop VersionInfo is not initialized before we switch to the isolated classloader, and we try to initialize it using the isolated classloader (the current thread context classloader), it will fail and report `Unknown` which causes Hive to throw the following exception: {code} 08:49:33.242 ERROR org.apache.hadoop.hive.shims.ShimLoader: Error loading shims java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* format) at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:147) at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:122) at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:88) at org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:377) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:268) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136) at org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:58) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:517) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:482) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:544) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:370) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:78) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84) at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:219) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:67) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1548) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3080) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3108) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:543) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:511) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:175) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:128) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:301) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:431) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:324) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:72) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:71) {code} Technically, T
[jira] [Assigned] (SPARK-32256) Hive may fail to detect Hadoop version when using isolated classloader
[ https://issues.apache.org/jira/browse/SPARK-32256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-32256: Assignee: Shixiong Zhu > Hive may fail to detect Hadoop version when using isolated classloader > -- > > Key: SPARK-32256 > URL: https://issues.apache.org/jira/browse/SPARK-32256 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Blocker > > TODO > {code} > 08:49:33.242 ERROR org.apache.hadoop.hive.shims.ShimLoader: Error loading > shims > java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* > format) > at > org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:147) > at > org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:122) > at > org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:88) > at > org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:377) > at > org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:268) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:58) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:517) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:482) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:544) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:370) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:78) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84) > at > org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:219) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:67) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1548) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3080) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3108) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:543) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:511) > at > org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:175) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:128) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:301) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:431) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:324) > at > org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:72) > at > org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:71) > {code} > I marked this is a blocker because it's a regression in 3.0.0. -- This message was sent by Atlassian Jira (v8.3.4#8
[jira] [Created] (SPARK-32256) Hive may fail to detect Hadoop version when using isolated classloader
Shixiong Zhu created SPARK-32256: Summary: Hive may fail to detect Hadoop version when using isolated classloader Key: SPARK-32256 URL: https://issues.apache.org/jira/browse/SPARK-32256 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Shixiong Zhu TODO {code} 08:49:33.242 ERROR org.apache.hadoop.hive.shims.ShimLoader: Error loading shims java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* format) at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:147) at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:122) at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:88) at org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:377) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:268) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136) at org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:58) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:517) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:482) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:544) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:370) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:78) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84) at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:219) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:67) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1548) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3080) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3108) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:543) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:511) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:175) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:128) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:301) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:431) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:324) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:72) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:71) {code} I marked this is a blocker because it's a regression in 3.0.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28489) KafkaOffsetRangeCalculator.getRanges may drop offsets
[ https://issues.apache.org/jira/browse/SPARK-28489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145719#comment-17145719 ] Shixiong Zhu commented on SPARK-28489: -- [~klaustelenius] No. The "minPartitions" option for batch queries is added in Spark 3.0. Batch queries before Spark 3.0 will ignore the "minPartitions" option. > KafkaOffsetRangeCalculator.getRanges may drop offsets > - > > Key: SPARK-28489 > URL: https://issues.apache.org/jira/browse/SPARK-28489 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Blocker > Labels: correctness, dataloss > Fix For: 2.4.4, 3.0.0 > > > KafkaOffsetRangeCalculator.getRanges may drop offsets due to round off errors. > > This only affects queries using "minPartitions" option. A workaround is just > removing the "minPartitions" option from the query. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Affects Version/s: 2.3.0 2.3.1 2.3.2 2.3.3 2.3.4 > Event log cannot be generated when some internal accumulators use unexpected > types > -- > > Key: SPARK-31923 > URL: https://issues.apache.org/jira/browse/SPARK-31923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4, 2.4.5, 2.4.6 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.0.1, 3.1.0, 2.4.7 > > > A user may use internal accumulators by adding the "internal.metrics." prefix > to the accumulator name to hide sensitive information from UI (Accumulators > except internal ones will be shown in Spark UI). > However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an > internal accumulator has only 3 possible types: int, long, and > java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an > unexpected type, it will crash. > An event log that contains such accumulator will be dropped because it cannot > be converted to JSON, and it will cause weird UI issue when rendering in > Spark History Server. For example, if `SparkListenerTaskEnd` is dropped > because of this issue, the user will see the task is still running even if it > was finished. > It's better to make *accumValueToJson* more robust. > > How to reproduce it: > - Enable Spark event log > - Run the following command: > {code} > scala> val accu = sc.doubleAccumulator("internal.metrics.foo") > accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, > name: Some(internal.metrics.foo), value: 0.0) > scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } > 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at scala.collection.immutable.List.map(List.scala:284) > at > org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) > at org.apach
[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Affects Version/s: 2.4.0 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 > Event log cannot be generated when some internal accumulators use unexpected > types > -- > > Key: SPARK-31923 > URL: https://issues.apache.org/jira/browse/SPARK-31923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.0.1, 3.1.0, 2.4.7 > > > A user may use internal accumulators by adding the "internal.metrics." prefix > to the accumulator name to hide sensitive information from UI (Accumulators > except internal ones will be shown in Spark UI). > However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an > internal accumulator has only 3 possible types: int, long, and > java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an > unexpected type, it will crash. > An event log that contains such accumulator will be dropped because it cannot > be converted to JSON, and it will cause weird UI issue when rendering in > Spark History Server. For example, if `SparkListenerTaskEnd` is dropped > because of this issue, the user will see the task is still running even if it > was finished. > It's better to make *accumValueToJson* more robust. > > How to reproduce it: > - Enable Spark event log > - Run the following command: > {code} > scala> val accu = sc.doubleAccumulator("internal.metrics.foo") > accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, > name: Some(internal.metrics.foo), value: 0.0) > scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } > 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at scala.collection.immutable.List.map(List.scala:284) > at > org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) > at org.apache.spark.u
[jira] [Issue Comment Deleted] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Comment: was deleted (was: User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/28744) > Event log cannot be generated when some internal accumulators use unexpected > types > -- > > Key: SPARK-31923 > URL: https://issues.apache.org/jira/browse/SPARK-31923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.0.1, 3.1.0, 2.4.7 > > > A user may use internal accumulators by adding the "internal.metrics." prefix > to the accumulator name to hide sensitive information from UI (Accumulators > except internal ones will be shown in Spark UI). > However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an > internal accumulator has only 3 possible types: int, long, and > java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an > unexpected type, it will crash. > An event log that contains such accumulator will be dropped because it cannot > be converted to JSON, and it will cause weird UI issue when rendering in > Spark History Server. For example, if `SparkListenerTaskEnd` is dropped > because of this issue, the user will see the task is still running even if it > was finished. > It's better to make *accumValueToJson* more robust. > > How to reproduce it: > - Enable Spark event log > - Run the following command: > {code} > scala> val accu = sc.doubleAccumulator("internal.metrics.foo") > accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, > name: Some(internal.metrics.foo), value: 0.0) > scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } > 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at scala.collection.immutable.List.map(List.scala:284) > at > org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) > at > org.apache.spark.
[jira] [Resolved] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-31923. -- Assignee: Shixiong Zhu (was: Apache Spark) Resolution: Fixed > Event log cannot be generated when some internal accumulators use unexpected > types > -- > > Key: SPARK-31923 > URL: https://issues.apache.org/jira/browse/SPARK-31923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.0.1, 3.1.0, 2.4.7 > > > A user may use internal accumulators by adding the "internal.metrics." prefix > to the accumulator name to hide sensitive information from UI (Accumulators > except internal ones will be shown in Spark UI). > However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an > internal accumulator has only 3 possible types: int, long, and > java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an > unexpected type, it will crash. > An event log that contains such accumulator will be dropped because it cannot > be converted to JSON, and it will cause weird UI issue when rendering in > Spark History Server. For example, if `SparkListenerTaskEnd` is dropped > because of this issue, the user will see the task is still running even if it > was finished. > It's better to make *accumValueToJson* more robust. > > How to reproduce it: > - Enable Spark event log > - Run the following command: > {code} > scala> val accu = sc.doubleAccumulator("internal.metrics.foo") > accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, > name: Some(internal.metrics.foo), value: 0.0) > scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } > 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at scala.collection.immutable.List.map(List.scala:284) > at > org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:
[jira] [Assigned] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-31923: Assignee: Shixiong Zhu > Event log cannot be generated when some internal accumulators use unexpected > types > -- > > Key: SPARK-31923 > URL: https://issues.apache.org/jira/browse/SPARK-31923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.0.1, 3.1.0, 2.4.7 > > > A user may use internal accumulators by adding the "internal.metrics." prefix > to the accumulator name to hide sensitive information from UI (Accumulators > except internal ones will be shown in Spark UI). > However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an > internal accumulator has only 3 possible types: int, long, and > java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an > unexpected type, it will crash. > An event log that contains such accumulator will be dropped because it cannot > be converted to JSON, and it will cause weird UI issue when rendering in > Spark History Server. For example, if `SparkListenerTaskEnd` is dropped > because of this issue, the user will see the task is still running even if it > was finished. > It's better to make *accumValueToJson* more robust. > > How to reproduce it: > - Enable Spark event log > - Run the following command: > {code} > scala> val accu = sc.doubleAccumulator("internal.metrics.foo") > accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, > name: Some(internal.metrics.foo), value: 0.0) > scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } > 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at scala.collection.immutable.List.map(List.scala:284) > at > org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) > {code} > -- This message was
[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Fix Version/s: 2.4.7 > Event log cannot be generated when some internal accumulators use unexpected > types > -- > > Key: SPARK-31923 > URL: https://issues.apache.org/jira/browse/SPARK-31923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Shixiong Zhu >Priority: Major > Fix For: 3.0.1, 3.1.0, 2.4.7 > > > A user may use internal accumulators by adding the "internal.metrics." prefix > to the accumulator name to hide sensitive information from UI (Accumulators > except internal ones will be shown in Spark UI). > However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an > internal accumulator has only 3 possible types: int, long, and > java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an > unexpected type, it will crash. > An event log that contains such accumulator will be dropped because it cannot > be converted to JSON, and it will cause weird UI issue when rendering in > Spark History Server. For example, if `SparkListenerTaskEnd` is dropped > because of this issue, the user will see the task is still running even if it > was finished. > It's better to make *accumValueToJson* more robust. > > How to reproduce it: > - Enable Spark event log > - Run the following command: > {code} > scala> val accu = sc.doubleAccumulator("internal.metrics.foo") > accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, > name: Some(internal.metrics.foo), value: 0.0) > scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } > 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at scala.collection.immutable.List.map(List.scala:284) > at > org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) -
[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Fix Version/s: 3.1.0 3.0.1 > Event log cannot be generated when some internal accumulators use unexpected > types > -- > > Key: SPARK-31923 > URL: https://issues.apache.org/jira/browse/SPARK-31923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Shixiong Zhu >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > A user may use internal accumulators by adding the "internal.metrics." prefix > to the accumulator name to hide sensitive information from UI (Accumulators > except internal ones will be shown in Spark UI). > However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an > internal accumulator has only 3 possible types: int, long, and > java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an > unexpected type, it will crash. > An event log that contains such accumulator will be dropped because it cannot > be converted to JSON, and it will cause weird UI issue when rendering in > Spark History Server. For example, if `SparkListenerTaskEnd` is dropped > because of this issue, the user will see the task is still running even if it > was finished. > It's better to make *accumValueToJson* more robust. > > How to reproduce it: > - Enable Spark event log > - Run the following command: > {code} > scala> val accu = sc.doubleAccumulator("internal.metrics.foo") > accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, > name: Some(internal.metrics.foo), value: 0.0) > scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } > 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at scala.collection.immutable.List.map(List.scala:284) > at > org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) > {code} > -- This message was sent by Atlassian Jira (v8.
[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Description: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators except internal ones will be shown in Spark UI). However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will be dropped because it cannot be converted to JSON, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because of this issue, the user will see the task is still running even if it was finished. It's better to make *accumValueToJson* more robust. How to reproduce it: - Enable Spark event log - Run the following command: {code} scala> val accu = sc.doubleAccumulator("internal.metrics.foo") accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: Some(internal.metrics.foo), value: 0.0) scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at scala.Option.map(Option.scala:146) at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at scala.collection.immutable.List.map(List.scala:284) at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) {code} was: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators except internal ones will be shown in Spark UI). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will be dropped because it cannot be converted to JSON, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped
[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Description: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators except internal ones will be shown in Spark UI). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will be dropped because it cannot be converted to JSON, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because of this issue, the user will see the task is still running even if it was finished. It's better to make accumValueToJson more robust. How to reproduce it: - Enable Spark event log - Run the following command: {code} scala> val accu = sc.doubleAccumulator("internal.metrics.foo") accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: Some(internal.metrics.foo), value: 0.0) scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at scala.Option.map(Option.scala:146) at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at scala.collection.immutable.List.map(List.scala:284) at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) {code} was: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators will be shown in Spark UI by default). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will be dropped because it cannot be converted to JSON, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because of this is
[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Description: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators will be shown in Spark UI by default). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will be dropped because it cannot be converted to JSON, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because of this issue, the user will see the task is still running even if it was finished. It's better to make accumValueToJson more robust. How to reproduce it: - Enable Spark event log - Run the following command: {code} scala> val accu = sc.doubleAccumulator("internal.metrics.foo") accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: Some(internal.metrics.foo), value: 0.0) scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at scala.Option.map(Option.scala:146) at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at scala.collection.immutable.List.map(List.scala:284) at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) {code} was: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators will be shown in Spark UI by default). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will not be able to convert to json, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because of this issue, the user will see the ta
[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Description: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators will be shown in Spark UI by default). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will not be able to convert to json, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because of this issue, the user will see the task is still running even if it was finished. It's better to make accumValueToJson more robust. How to reproduce it: - Enable Spark event log - Run the following command: {code} scala> val accu = sc.doubleAccumulator("internal.metrics.foo") accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: Some(internal.metrics.foo), value: 0.0) scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at scala.Option.map(Option.scala:146) at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at scala.collection.immutable.List.map(List.scala:284) at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) {code} was: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators will be shown in Spark UI by default). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. It's better to make accumValueToJson more robust. How to reproduce it: - Enable Spark event log - Run the following command: {code} scala> val accu = sc.doubleAccumulator("internal.metrics.foo") accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, na
[jira] [Created] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
Shixiong Zhu created SPARK-31923: Summary: Event log cannot be generated when some internal accumulators use unexpected types Key: SPARK-31923 URL: https://issues.apache.org/jira/browse/SPARK-31923 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.6 Reporter: Shixiong Zhu A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators will be shown in Spark UI by default). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. It's better to make accumValueToJson more robust. How to reproduce it: - Enable Spark event log - Run the following command: {code} scala> val accu = sc.doubleAccumulator("internal.metrics.foo") accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: Some(internal.metrics.foo), value: 0.0) scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at scala.Option.map(Option.scala:146) at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at scala.collection.immutable.List.map(List.scala:284) at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30915) FileStreamSinkLog: Avoid reading the metadata log file when finding the latest batch ID
[ https://issues.apache.org/jira/browse/SPARK-30915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-30915. -- Fix Version/s: 3.1.0 Assignee: Jungtaek Lim Resolution: Fixed > FileStreamSinkLog: Avoid reading the metadata log file when finding the > latest batch ID > --- > > Key: SPARK-30915 > URL: https://issues.apache.org/jira/browse/SPARK-30915 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.1.0 > > > FileStreamSink.addBatch checks the latest batch ID before writing outputs to > skip writing batch if the batch was committed before. > While it's valid to compare the current batch with the latest batch ID, > getLatest() method is designed to return both the batch ID as well as content > which denotes that the latest metadata log file is being read and > deserialized. This would introduces heavy latency when the latest batch is a > compacted batch. > We could just find the metadata log file for latest batch ID, and only do the > minimal check without reading content. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31542) Backport SPARK-25692 Remove static initialization of worker eventLoop handling chunk fetch requests within TransportContext. This fixes ChunkFetchIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-31542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-31542. -- Resolution: Not A Problem > Backport SPARK-25692 Remove static initialization of worker eventLoop > handling chunk fetch requests within TransportContext. This fixes > ChunkFetchIntegrationSuite as well > > > Key: SPARK-31542 > URL: https://issues.apache.org/jira/browse/SPARK-31542 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-25692 Remove static initialization of worker eventLoop > handling chunk fetch requests within TransportContext. This fixes > ChunkFetchIntegrationSuite as well. > While the test was only flaky in the 3.0 branch, it seems possible the same > code path could be triggered in 2.4 so consider for backport. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31542) Backport SPARK-25692 Remove static initialization of worker eventLoop handling chunk fetch requests within TransportContext. This fixes ChunkFetchIntegrationSuit
[ https://issues.apache.org/jira/browse/SPARK-31542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091060#comment-17091060 ] Shixiong Zhu commented on SPARK-31542: -- [~holden] The flaky test was caused by a new improvement in 3.0: SPARK-24355 It doesn't impact branch-2.4. > Backport SPARK-25692 Remove static initialization of worker eventLoop > handling chunk fetch requests within TransportContext. This fixes > ChunkFetchIntegrationSuite as well > > > Key: SPARK-31542 > URL: https://issues.apache.org/jira/browse/SPARK-31542 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-25692 Remove static initialization of worker eventLoop > handling chunk fetch requests within TransportContext. This fixes > ChunkFetchIntegrationSuite as well. > While the test was only flaky in the 3.0 branch, it seems possible the same > code path could be triggered in 2.4 so consider for backport. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31245) Document the backward incompatibility issue for Encoders.kryo and Encoders.javaSerialization
Shixiong Zhu created SPARK-31245: Summary: Document the backward incompatibility issue for Encoders.kryo and Encoders.javaSerialization Key: SPARK-31245 URL: https://issues.apache.org/jira/browse/SPARK-31245 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Shixiong Zhu We should document that when a user uses Encoders.kryo or Encoders.javaSerialization for state store, we don't guarantee binary compatibility across different Spark versions because these two types of encoders can be broken easily. Maybe add a new section after http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure
[ https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059259#comment-17059259 ] Shixiong Zhu commented on SPARK-28556: -- This change has been reverted. See SPARK-31144 for the new fix. > Error should also be sent to QueryExecutionListener.onFailure > - > > Key: SPARK-28556 > URL: https://issues.apache.org/jira/browse/SPARK-28556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.0.0 > > > Right now Error is not sent to QueryExecutionListener.onFailure. If there is > any Error when running a query, QueryExecutionListener.onFailure cannot be > triggered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure
[ https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-28556: - Docs Text: (was: In Spark 3.0, the type of "error" parameter in the "org.apache.spark.sql.util.QueryExecutionListener.onFailure" method is changed to "java.lang.Throwable" from "java.lang.Exception" to accept more types of failures such as "java.lang.Error" and its subclasses.) > Error should also be sent to QueryExecutionListener.onFailure > - > > Key: SPARK-28556 > URL: https://issues.apache.org/jira/browse/SPARK-28556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.0.0 > > > Right now Error is not sent to QueryExecutionListener.onFailure. If there is > any Error when running a query, QueryExecutionListener.onFailure cannot be > triggered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure
[ https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-28556: - Labels: (was: release-notes) > Error should also be sent to QueryExecutionListener.onFailure > - > > Key: SPARK-28556 > URL: https://issues.apache.org/jira/browse/SPARK-28556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.0.0 > > > Right now Error is not sent to QueryExecutionListener.onFailure. If there is > any Error when running a query, QueryExecutionListener.onFailure cannot be > triggered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31144) Wrap java.lang.Error with an exception for QueryExecutionListener.onFailure
[ https://issues.apache.org/jira/browse/SPARK-31144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31144: - Issue Type: Bug (was: Test) > Wrap java.lang.Error with an exception for QueryExecutionListener.onFailure > --- > > Key: SPARK-31144 > URL: https://issues.apache.org/jira/browse/SPARK-31144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Major > > SPARK-28556 changed the method QueryExecutionListener.onFailure to allow > Spark sending java.lang.Error to this method. As this change breaks APIs, we > cannot fix branch-2.4. > [~marmbrus] suggested to wrap java.lang.Error with an exception instead to > avoid a breaking change. A bonus of this solution is we can also fix the > issue (if a query throws java.lang.Error, QueryExecutionListener doesn't get > notified) in branch-2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31144) Wrap java.lang.Error with an exception for QueryExecutionListener.onFailure
Shixiong Zhu created SPARK-31144: Summary: Wrap java.lang.Error with an exception for QueryExecutionListener.onFailure Key: SPARK-31144 URL: https://issues.apache.org/jira/browse/SPARK-31144 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Shixiong Zhu SPARK-28556 changed the method QueryExecutionListener.onFailure to allow Spark sending java.lang.Error to this method. As this change breaks APIs, we cannot fix branch-2.4. [~marmbrus] suggested to wrap java.lang.Error with an exception instead to avoid a breaking change. A bonus of this solution is we can also fix the issue (if a query throws java.lang.Error, QueryExecutionListener doesn't get notified) in branch-2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30984) Add UI test for Structured Streaming UI
Shixiong Zhu created SPARK-30984: Summary: Add UI test for Structured Streaming UI Key: SPARK-30984 URL: https://issues.apache.org/jira/browse/SPARK-30984 Project: Spark Issue Type: Test Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu It's better to add some basic UI test for Structured Streaming UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30943) Show "batch ID" in tool tip string for Structured Streaming UI graphs
[ https://issues.apache.org/jira/browse/SPARK-30943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-30943. -- Fix Version/s: 3.0.0 Assignee: Jungtaek Lim Resolution: Fixed > Show "batch ID" in tool tip string for Structured Streaming UI graphs > - > > Key: SPARK-30943 > URL: https://issues.apache.org/jira/browse/SPARK-30943 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > While we introduced the new UI for Structured Streaming, it was basically the > port of Streaming UI to Structured Streaming. > That missed the major difference between twos, ID of the "batch". In Spark > Streaming (DStream), timestamp is the key of the batch and you will need to > find any relevant information with such timestamp. > In Structured Streaming, batchId is the key of the batch. It's not a simple > difference because there're some known latency issues on periodical batches > (at least in FileStreamSource and FileStreamSink) and it's bothering to find > the right batchId if you only know about timestamp where the most of things > are logged with the batchId in Structured Streaming world. > This issue tracks the efforts on showing batchId as main aspect of x axis, > but only in tooltip string as changing x axis breaks coupling between Spark > Streaming and Structured Streaming and lots of code should be duplicated and > rewritten. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30936) Forwards-compatibility in JsonProtocol in broken
[ https://issues.apache.org/jira/browse/SPARK-30936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-30936: - Summary: Forwards-compatibility in JsonProtocol in broken (was: Fix the broken forwards-compatibility in JsonProtocol) > Forwards-compatibility in JsonProtocol in broken > > > Key: SPARK-30936 > URL: https://issues.apache.org/jira/browse/SPARK-30936 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Major > > JsonProtocol is supposed to provide strong backwards-compatibility and > forwards-compatibility guarantees: any version of Spark should be able to > read JSON output written by any other version, including newer versions. > However, the forwards-compatibility guarantee is broken for events parsed by > "ObjectMapper". If a new field is added to an event parsed by "ObjectMapper" > (e.g., > https://github.com/apache/spark/commit/6dc5921e66d56885b95c07e56e687f9f6c1eaca7#diff-dc5c7a41fbb7479cef48b67eb41ad254R33), > this event cannot be parsed by an old version of Spark History Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30936) Fix the broken forwards-compatibility in JsonProtocol
Shixiong Zhu created SPARK-30936: Summary: Fix the broken forwards-compatibility in JsonProtocol Key: SPARK-30936 URL: https://issues.apache.org/jira/browse/SPARK-30936 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Shixiong Zhu JsonProtocol is supposed to provide strong backwards-compatibility and forwards-compatibility guarantees: any version of Spark should be able to read JSON output written by any other version, including newer versions. However, the forwards-compatibility guarantee is broken for events parsed by "ObjectMapper". If a new field is added to an event parsed by "ObjectMapper" (e.g., https://github.com/apache/spark/commit/6dc5921e66d56885b95c07e56e687f9f6c1eaca7#diff-dc5c7a41fbb7479cef48b67eb41ad254R33), this event cannot be parsed by an old version of Spark History Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30927) StreamingQueryManager should avoid keeping reference to terminated StreamingQuery
Shixiong Zhu created SPARK-30927: Summary: StreamingQueryManager should avoid keeping reference to terminated StreamingQuery Key: SPARK-30927 URL: https://issues.apache.org/jira/browse/SPARK-30927 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Shixiong Zhu Right now StreamingQueryManager will keep the last terminated query until "resetTerminated" is called. When the last terminated query has lots of states (a large sql plan, cached RDDs, etc.), it will waste these memory unnecessarily. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15006) Generated JavaDoc should hide package private objects
[ https://issues.apache.org/jira/browse/SPARK-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033886#comment-17033886 ] Shixiong Zhu commented on SPARK-15006: -- cc [~jodersky] has this fixed in the upstream? > Generated JavaDoc should hide package private objects > - > > Key: SPARK-15006 > URL: https://issues.apache.org/jira/browse/SPARK-15006 > Project: Spark > Issue Type: Improvement > Components: Build, Documentation >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Major > Labels: bulk-closed > > After we switch to the official release of genjavadoc in SPARK-14511, package > private objects are no longer hidden in the generated JavaDoc. This JIRA is > to track this upstream issue and update genjavadoc in Spark when there comes > a fix in the upstream. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-15006) Generated JavaDoc should hide package private objects
[ https://issues.apache.org/jira/browse/SPARK-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reopened SPARK-15006: -- Reopening this since the issue still exists (See http://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/util/Utils.html ) It would be great that we can fix this. > Generated JavaDoc should hide package private objects > - > > Key: SPARK-15006 > URL: https://issues.apache.org/jira/browse/SPARK-15006 > Project: Spark > Issue Type: Improvement > Components: Build, Documentation >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Major > Labels: bulk-closed > > After we switch to the official release of genjavadoc in SPARK-14511, package > private objects are no longer hidden in the generated JavaDoc. This JIRA is > to track this upstream issue and update genjavadoc in Spark when there comes > a fix in the upstream. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30779) Review and fix issues in Structured Streaming API docs
Shixiong Zhu created SPARK-30779: Summary: Review and fix issues in Structured Streaming API docs Key: SPARK-30779 URL: https://issues.apache.org/jira/browse/SPARK-30779 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Shixiong Zhu We should do an API review to ensure we don't leak public APIs unintentionally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30778) Javadoc of Java classes doesn't show up in Scala doc
Shixiong Zhu created SPARK-30778: Summary: Javadoc of Java classes doesn't show up in Scala doc Key: SPARK-30778 URL: https://issues.apache.org/jira/browse/SPARK-30778 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.4.5 Reporter: Shixiong Zhu There are some APIs written in Java. However, Javadoc of Java classes doesn't show up in Scala doc. For example, http://spark.apache.org/docs/2.4.5/api/scala/index.html#org.apache.spark.sql.SaveMode Scala users need to look at Javadoc for these Java classes right now. This looks like a unidoc bug, but it's worth to investigate whether we can fix it since we are adding more and more Java classes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30657) Streaming limit after streaming dropDuplicates can throw error
[ https://issues.apache.org/jira/browse/SPARK-30657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028223#comment-17028223 ] Shixiong Zhu commented on SPARK-30657: -- [~tdas] Make sense. Agreed that the risk is high but the benefit is pretty low. We can backport it later whenever needed. > Streaming limit after streaming dropDuplicates can throw error > -- > > Key: SPARK-30657 > URL: https://issues.apache.org/jira/browse/SPARK-30657 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > Fix For: 3.0.0 > > > {{LocalLimitExec}} does not consume the iterator of the child plan. So if > there is a limit after a stateful operator like streaming dedup in append > mode (e.g. {{streamingdf.dropDuplicates().limit(5}})), the state changes of > streaming duplicate may not be committed (most stateful ops commit state > changes only after the generated iterator is fully consumed). This leads to > the next batch failing with {{java.lang.IllegalStateException: Error reading > delta file .../N.delta does not exist}} as the state store delta file was > never generated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30658) Limit after on streaming dataframe before streaming agg returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-30658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-30658: - Fix Version/s: 3.0.0 > Limit after on streaming dataframe before streaming agg returns wrong results > - > > Key: SPARK-30658 > URL: https://issues.apache.org/jira/browse/SPARK-30658 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4 >Reporter: Tathagata Das >Priority: Critical > Fix For: 3.0.0 > > > Limit before a streaming aggregate (i.e. [[df.limit(5).groupBy().count()}}) > in complete mode was not being planned as a streaming limit. The planner rule > planned a logical limit with a stateful streaming limit plan only if the > query is in append mode. As a result, instead of allowing max 5 rows across > batches, the planned streaming query was allowing 5 rows in every batch thus > producing incorrect results. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30657) Streaming limit after streaming dropDuplicates can throw error
[ https://issues.apache.org/jira/browse/SPARK-30657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-30657: - Fix Version/s: 3.0.0 > Streaming limit after streaming dropDuplicates can throw error > -- > > Key: SPARK-30657 > URL: https://issues.apache.org/jira/browse/SPARK-30657 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > Fix For: 3.0.0 > > > {{LocalLimitExec}} does not consume the iterator of the child plan. So if > there is a limit after a stateful operator like streaming dedup in append > mode (e.g. {{streamingdf.dropDuplicates().limit(5}})), the state changes of > streaming duplicate may not be committed (most stateful ops commit state > changes only after the generated iterator is fully consumed). This leads to > the next batch failing with {{java.lang.IllegalStateException: Error reading > delta file .../N.delta does not exist}} as the state store delta file was > never generated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30656) Support the "minPartitions" option in Kafka batch source and streaming source v1
[ https://issues.apache.org/jira/browse/SPARK-30656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-30656. -- Fix Version/s: 3.0.0 Resolution: Done > Support the "minPartitions" option in Kafka batch source and streaming source > v1 > > > Key: SPARK-30656 > URL: https://issues.apache.org/jira/browse/SPARK-30656 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.0.0 > > > Right now, the "minPartitions" option only works in Kafka streaming source > v2. It would be great that we can support it in batch and streaming source v1 > (v1 is the fallback mode when a user hits a regression in v2) as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29543) Support Structured Streaming UI
[ https://issues.apache.org/jira/browse/SPARK-29543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-29543. -- Fix Version/s: 3.0.0 Assignee: Genmao Yu Resolution: Done > Support Structured Streaming UI > --- > > Key: SPARK-29543 > URL: https://issues.apache.org/jira/browse/SPARK-29543 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming, Web UI >Affects Versions: 3.0.0 >Reporter: Genmao Yu >Assignee: Genmao Yu >Priority: Major > Fix For: 3.0.0 > > > Open this jira to support structured streaming UI -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure
[ https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026098#comment-17026098 ] Shixiong Zhu commented on SPARK-28556: -- This also reminds me that we should also review all public APIs to see if there is any similar issue, since 3.0.0 is a good chance to fix API bugs. > Error should also be sent to QueryExecutionListener.onFailure > - > > Key: SPARK-28556 > URL: https://issues.apache.org/jira/browse/SPARK-28556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Right now Error is not sent to QueryExecutionListener.onFailure. If there is > any Error when running a query, QueryExecutionListener.onFailure cannot be > triggered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure
[ https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025655#comment-17025655 ] Shixiong Zhu commented on SPARK-28556: -- This is not deprecating the API. We just fixed an issue in an experimental API. Hence, I don't think we need any deprecation note. > Error should also be sent to QueryExecutionListener.onFailure > - > > Key: SPARK-28556 > URL: https://issues.apache.org/jira/browse/SPARK-28556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Right now Error is not sent to QueryExecutionListener.onFailure. If there is > any Error when running a query, QueryExecutionListener.onFailure cannot be > triggered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30656) Support the "minPartitions" option in Kafka batch source and streaming source v1
[ https://issues.apache.org/jira/browse/SPARK-30656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-30656: - Issue Type: Improvement (was: Bug) > Support the "minPartitions" option in Kafka batch source and streaming source > v1 > > > Key: SPARK-30656 > URL: https://issues.apache.org/jira/browse/SPARK-30656 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Right now, the "minPartitions" option only works in Kafka streaming source > v2. It would be great that we can support it in batch and streaming source v1 > (v1 is the fallback mode when a user hits a regression in v2) as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26060) Track SparkConf entries and make SET command reject such entries.
[ https://issues.apache.org/jira/browse/SPARK-26060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26060: - Labels: release-notes (was: ) > Track SparkConf entries and make SET command reject such entries. > - > > Key: SPARK-26060 > URL: https://issues.apache.org/jira/browse/SPARK-26060 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Currently the {{SET}} command works without any warnings even if the > specified key is for {{SparkConf}} entries and it has no effect because the > command does not update {{SparkConf}}, but the behavior might confuse users. > We should track {{SparkConf}} entries and make the command reject for such > entries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19287) JavaPairRDD flatMapValues requires function returning Iterable, not Iterator
[ https://issues.apache.org/jira/browse/SPARK-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19287: - Docs Text: JavaPairRDD/JavaPairDStream.flatMapValues() now requires a FlatMapFunction as an argument. This means this function now must return an Iterator, not Iterable. This corrects a long-standing inconsistency between the Scala and Java API, and allows the caller to supply merely an Iterator, not a full Iterable. Existing functions passed to this method can simply invoke ".iterator()" on their existing return value to comply with the new signature. (was: JavaPairRDD.flatMapValues() now requires a FlatMapFunction as an argument. This means this function now must return an Iterator, not Iterable. This corrects a long-standing inconsistency between the Scala and Java API, and allows the caller to supply merely an Iterator, not a full Iterable. Existing functions passed to this method can simply invoke ".iterator()" on their existing return value to comply with the new signature.) > JavaPairRDD flatMapValues requires function returning Iterable, not Iterator > > > Key: SPARK-19287 > URL: https://issues.apache.org/jira/browse/SPARK-19287 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.1 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > SPARK-3369 corrected an old oversight in the Java API, wherein > {{FlatMapFunction}} required an {{Iterable}} rather than {{Iterator}}. As > reported by [~akrim], it seems that this same type of problem was overlooked > also in {{JavaPairRDD}} > (https://github.com/apache/spark/blob/6c00c069e3c3f5904abd122cea1d56683031cca0/core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala#L677 > ): > {code} > def flatMapValues[U](f: JFunction[V, java.lang.Iterable[U]]): JavaPairRDD[K, > U] = > {code} > As in {{PairRDDFunctions.scala}}, whose {{flatMapValues}} operates on > {{TraversableOnce}}, this should really take a function that returns an > {{Iterator}} -- really, {{FlatMapFunction}}. > We can easily add an overload and deprecate the existing method. > {code} > def flatMapValues[U](f: FlatMapFunction[V, U]): JavaPairRDD[K, U] > {code} > This is source- and binary-backwards-compatible, in Java 7. It's > binary-backwards-compatible in Java 8, but not source-compatible. The > following natural usage with Java 8 lambdas becomes ambiguous and won't > compile -- Java won't figure out which to implement even based on the return > type unfortunately: > {code} > JavaPairRDD pairRDD = ... > JavaPairRDD mappedRDD = > pairRDD.flatMapValues(s -> Arrays.asList(s.length()).iterator()); > {code} > It can be resolved by explicitly casting the lambda. > We can at least document this. One day in Spark 3.x this can just be changed > outright. > It's conceivable to resolve this by making the new method called > "flatMapValues2" or something ugly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30656) Support the "minPartitions" option in Kafka batch source and streaming source v1
Shixiong Zhu created SPARK-30656: Summary: Support the "minPartitions" option in Kafka batch source and streaming source v1 Key: SPARK-30656 URL: https://issues.apache.org/jira/browse/SPARK-30656 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Right now, the "minPartitions" option only works in Kafka streaming source v2. It would be great that we can support it in batch and streaming source v1 (v1 is the fallback mode when a user hits a regression in v2) as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25382) Remove ImageSchema.readImages in 3.0
[ https://issues.apache.org/jira/browse/SPARK-25382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-25382: - Docs Text: In Spark 3.0.0, the deprecated ImageSchema class and its readImages methods have been removed. Use `spark.read.format("image").load(path)` instead. (was: In Spark 3.0.0, the deprecated ImageSchema class and its readImages methods have been removed. Use `spark.read.format(\"image\").load(path)` instead.) > Remove ImageSchema.readImages in 3.0 > > > Key: SPARK-25382 > URL: https://issues.apache.org/jira/browse/SPARK-25382 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > A follow-up task from SPARK-25345. We might need to support sampling > (SPARK-25383) in order to remove readImages. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25382) Remove ImageSchema.readImages in 3.0
[ https://issues.apache.org/jira/browse/SPARK-25382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-25382: - Docs Text: In Spark 3.0.0, the deprecated ImageSchema class and its readImages methods have been removed. Use `spark.read.format(\"image\").load(path)` instead. (was: In Spark 3.0.0, the deprecated ImageSchema class and its readImages methods have been removed.) > Remove ImageSchema.readImages in 3.0 > > > Key: SPARK-25382 > URL: https://issues.apache.org/jira/browse/SPARK-25382 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > A follow-up task from SPARK-25345. We might need to support sampling > (SPARK-25383) in order to remove readImages. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26132: - Docs Text: Scala 2.11 support is removed in Apache Spark 3.0.0. > Remove support for Scala 2.11 in Spark 3.0.0 > > > Key: SPARK-26132 > URL: https://issues.apache.org/jira/browse/SPARK-26132 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Per some discussion on the mailing list, we are_considering_ formally not > supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26132: - Labels: release-notes (was: ) > Remove support for Scala 2.11 in Spark 3.0.0 > > > Key: SPARK-26132 > URL: https://issues.apache.org/jira/browse/SPARK-26132 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Per some discussion on the mailing list, we are_considering_ formally not > supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24516) PySpark Bindings for K8S - make Python 3 the default
[ https://issues.apache.org/jira/browse/SPARK-24516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-24516: - Labels: release-no (was: ) > PySpark Bindings for K8S - make Python 3 the default > > > Key: SPARK-24516 > URL: https://issues.apache.org/jira/browse/SPARK-24516 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Ondrej Kokes >Assignee: Ilan Filonenko >Priority: Minor > Labels: release-no > Fix For: 3.0.0 > > > Initial PySpark-k8s bindings have just been resolved (SPARK-23984), but the > default Python version there is 2. While you can override this by setting it > to 3, I think we should have sensible defaults. > Python 3 has been around for ten years and is the clear successor, Python 2 > has only 18 months left in terms of support. There isn't a good reason to > suggest Python 2 should be used, not in 2018 and not when both versions are > supported. > The relevant commit [is > here|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-6e882d5561424e7e6651eb46f10104b8R194], > the version is also [in the > documentation|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-b5527f236b253e0d9f5db5164bdb43e9R643]. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24516) PySpark Bindings for K8S - make Python 3 the default
[ https://issues.apache.org/jira/browse/SPARK-24516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-24516: - Labels: release-notes (was: release-no) > PySpark Bindings for K8S - make Python 3 the default > > > Key: SPARK-24516 > URL: https://issues.apache.org/jira/browse/SPARK-24516 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Ondrej Kokes >Assignee: Ilan Filonenko >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > Initial PySpark-k8s bindings have just been resolved (SPARK-23984), but the > default Python version there is 2. While you can override this by setting it > to 3, I think we should have sensible defaults. > Python 3 has been around for ten years and is the clear successor, Python 2 > has only 18 months left in terms of support. There isn't a good reason to > suggest Python 2 should be used, not in 2018 and not when both versions are > supported. > The relevant commit [is > here|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-6e882d5561424e7e6651eb46f10104b8R194], > the version is also [in the > documentation|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-b5527f236b253e0d9f5db5164bdb43e9R643]. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25016) remove Support for hadoop 2.6
[ https://issues.apache.org/jira/browse/SPARK-25016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-25016: - Docs Text: Hadoop 2.6.x support is removed. The default Hadoop version in Spark 3.0.0 in 2.7.x. > remove Support for hadoop 2.6 > - > > Key: SPARK-25016 > URL: https://issues.apache.org/jira/browse/SPARK-25016 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Thomas Graves >Assignee: Sean R. Owen >Priority: Major > Labels: release_notes > Fix For: 3.0.0 > > > Hadoop 2.6 is now old, no releases have been done in 2 year and it doesn't > look like they are patching it for security or bug fixes. We should stop > supporting it in spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25016) remove Support for hadoop 2.6
[ https://issues.apache.org/jira/browse/SPARK-25016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-25016: - Labels: release_notes (was: ) > remove Support for hadoop 2.6 > - > > Key: SPARK-25016 > URL: https://issues.apache.org/jira/browse/SPARK-25016 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Thomas Graves >Assignee: Sean R. Owen >Priority: Major > Labels: release_notes > Fix For: 3.0.0 > > > Hadoop 2.6 is now old, no releases have been done in 2 year and it doesn't > look like they are patching it for security or bug fixes. We should stop > supporting it in spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29930) Remove SQL configs declared to be removed in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-29930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-29930: - Labels: release-notes (was: ) > Remove SQL configs declared to be removed in Spark 3.0 > -- > > Key: SPARK-29930 > URL: https://issues.apache.org/jira/browse/SPARK-29930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > Need to remove the following SQL configs: > * spark.sql.fromJsonForceNullableSchema > * spark.sql.legacy.compareDateTimestampInTimestamp -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org