[jira] [Updated] (SPARK-35267) nullable field is set to false for integer type when using reflection to get StructType for a case class
[ https://issues.apache.org/jira/browse/SPARK-35267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ganesh Chand updated SPARK-35267: - Description: {code:java} // code placeholder object Util { def toStructType[T](implicit typeTags: ScalaReflection.universe.TypeTag[T]): StructType = ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType] } case class MyTable(val a: Int, val b: String) // The following test fails because schema returned using reflection sets nullable=false for integer column. The test passes if you set it to false it must "return a Spark Schema of type StructType for a case class" in { val schemaFromCaseClass: StructType = Util.toStructType[MyTable] val expectedSchema = new StructType().add(StructField("a", IntegerType, true)).add(StructField("b", StringType, true)) schemaFromCaseClass.size mustBe 2 schemaFromCaseClass mustBe expectedSchema } {code} was: {code:java} // code placeholder object Util { def toStructType[T](implicit typeTags: ScalaReflection.universe.TypeTag[T]): StructType = ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType] } case class MyTable(val a: Int, val b: String) // The following test fails because schema returned using reflection sets nullable=false for integer column. The test passes if you set it to false it must "return a Spark Schema of type StructType for a case class" in { val schemaFromCaseClass: StructType = Util.toStructType[MyTable] val expectedSchema = new StructType().add(StructField("a", IntegerType, true)).add(StructField("b", StringType, true)) schemaFromCaseClass.size mustBe 2 schemaFromCaseClass mustBe expectedSchema } {code} > nullable field is set to false for integer type when using reflection to get > StructType for a case class > > > Key: SPARK-35267 > URL: https://issues.apache.org/jira/browse/SPARK-35267 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 > Environment: Scala version: 2.12.13 > sparkVersion = "3.1.1" >Reporter: Ganesh Chand >Priority: Major > > {code:java} > // code placeholder > object Util { > def toStructType[T](implicit typeTags: > ScalaReflection.universe.TypeTag[T]): StructType = > ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType] > } > case class MyTable(val a: Int, val b: String) > // The following test fails because schema returned using reflection sets > nullable=false for integer column. The test passes if you set it to false > it must "return a Spark Schema of type StructType for a case class" in { > val schemaFromCaseClass: StructType = Util.toStructType[MyTable] > val expectedSchema = new StructType().add(StructField("a", IntegerType, > true)).add(StructField("b", StringType, true)) > schemaFromCaseClass.size mustBe 2 > schemaFromCaseClass mustBe expectedSchema > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35267) nullable field is set to false for integer type when using reflection to get StructType for a case class
Ganesh Chand created SPARK-35267: Summary: nullable field is set to false for integer type when using reflection to get StructType for a case class Key: SPARK-35267 URL: https://issues.apache.org/jira/browse/SPARK-35267 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Environment: Scala version: 2.12.13 sparkVersion = "3.1.1" Reporter: Ganesh Chand {code:java} // code placeholder object Util { def toStructType[T](implicit typeTags: ScalaReflection.universe.TypeTag[T]): StructType = ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType] } case class MyTable(val a: Int, val b: String) // The following test fails because schema returned using reflection sets nullable=false for integer column. The test passes if you set it to false it must "return a Spark Schema of type StructType for a case class" in { val schemaFromCaseClass: StructType = Util.toStructType[MyTable] val expectedSchema = new StructType().add(StructField("a", IntegerType, true)).add(StructField("b", StringType, true)) schemaFromCaseClass.size mustBe 2 schemaFromCaseClass mustBe expectedSchema } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35105) Support multiple paths for ADD FILE/JAR/ARCHIVE commands
[ https://issues.apache.org/jira/browse/SPARK-35105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-35105. Fix Version/s: 3.2.0 Resolution: Fixed This issue was resolved in https://github.com/apache/spark/pull/32205. > Support multiple paths for ADD FILE/JAR/ARCHIVE commands > > > Key: SPARK-35105 > URL: https://issues.apache.org/jira/browse/SPARK-35105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > In the current master, ADD FILE/JAR/ARCHIVE don't support multiple path > arguments. > It's great if those commands can take multiple paths like Hive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35226) JDBC datasources should accept refreshKrb5Config parameter
[ https://issues.apache.org/jira/browse/SPARK-35226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-35226. Fix Version/s: 3.2.0 3.1.2 Resolution: Fixed This issue was resolved in https://github.com/apache/spark/pull/32344. > JDBC datasources should accept refreshKrb5Config parameter > -- > > Key: SPARK-35226 > URL: https://issues.apache.org/jira/browse/SPARK-35226 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.2, 3.2.0 > > > In the current master, JDBC datasources can't accept refreshKrb5Config which > is defined in Krb5LoginModule. > So even if we change the krb5.conf after establishing a connection, the > change will not be reflected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35264) Support AQE side broadcastJoin threshold
[ https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335144#comment-17335144 ] Apache Spark commented on SPARK-35264: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/32391 > Support AQE side broadcastJoin threshold > > > Key: SPARK-35264 > URL: https://issues.apache.org/jira/browse/SPARK-35264 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Major > > The main idea here is that make join config isolation between normal planner > and aqe planner which shared the same code path. > Actually we don not very trust using the static stat to consider if it can > build broadcast hash join. In our experience it's very common that Spark > throw broadcast timeout or driver side OOM exception when execute a bit large > plan. And due to braodcast join is not reversed which means if we covert join > to braodcast hash join at first time, we(AQE) can not optimize it again, so > it should make sense to decide if we can do broadcast at aqe side using > different sql config. > In order to achieve this we use a specific join hint in advance during AQE > framework and then at JoinSelection side it will take and follow the inserted > hint. > For now we only support select strategy for equi join, and follow this order > 1. mark join as broadcast hash join if possible > 2. mark join as shuffled hash join if possible > Note that, we don't override join strategy if user specifies a join hint. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35264) Support AQE side broadcastJoin threshold
[ https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35264: Assignee: (was: Apache Spark) > Support AQE side broadcastJoin threshold > > > Key: SPARK-35264 > URL: https://issues.apache.org/jira/browse/SPARK-35264 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Major > > The main idea here is that make join config isolation between normal planner > and aqe planner which shared the same code path. > Actually we don not very trust using the static stat to consider if it can > build broadcast hash join. In our experience it's very common that Spark > throw broadcast timeout or driver side OOM exception when execute a bit large > plan. And due to braodcast join is not reversed which means if we covert join > to braodcast hash join at first time, we(AQE) can not optimize it again, so > it should make sense to decide if we can do broadcast at aqe side using > different sql config. > In order to achieve this we use a specific join hint in advance during AQE > framework and then at JoinSelection side it will take and follow the inserted > hint. > For now we only support select strategy for equi join, and follow this order > 1. mark join as broadcast hash join if possible > 2. mark join as shuffled hash join if possible > Note that, we don't override join strategy if user specifies a join hint. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35264) Support AQE side broadcastJoin threshold
[ https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35264: Assignee: Apache Spark > Support AQE side broadcastJoin threshold > > > Key: SPARK-35264 > URL: https://issues.apache.org/jira/browse/SPARK-35264 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: Apache Spark >Priority: Major > > The main idea here is that make join config isolation between normal planner > and aqe planner which shared the same code path. > Actually we don not very trust using the static stat to consider if it can > build broadcast hash join. In our experience it's very common that Spark > throw broadcast timeout or driver side OOM exception when execute a bit large > plan. And due to braodcast join is not reversed which means if we covert join > to braodcast hash join at first time, we(AQE) can not optimize it again, so > it should make sense to decide if we can do broadcast at aqe side using > different sql config. > In order to achieve this we use a specific join hint in advance during AQE > framework and then at JoinSelection side it will take and follow the inserted > hint. > For now we only support select strategy for equi join, and follow this order > 1. mark join as broadcast hash join if possible > 2. mark join as shuffled hash join if possible > Note that, we don't override join strategy if user specifies a join hint. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35266) Fix an error in BenchmarkBase.scala that occurs when creating a benchmark file in a non-existent directory
[ https://issues.apache.org/jira/browse/SPARK-35266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335143#comment-17335143 ] Byungsoo Oh commented on SPARK-35266: - I fixed this issue and checked it's working fine. If it's okay, I will submit a pull request for this. > Fix an error in BenchmarkBase.scala that occurs when creating a benchmark > file in a non-existent directory > -- > > Key: SPARK-35266 > URL: https://issues.apache.org/jira/browse/SPARK-35266 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.2.0 >Reporter: Byungsoo Oh >Priority: Minor > Labels: easyfix > > When submitting a benchmark job using _org.apache.spark.benchmark.Benchmarks_ > class with _SPARK_GENERATE_BENCHMARK_FILES=1_ option, an exception is raised > if the directory where the benchmark file will be generated does not exist. > For example, if you execute _BLASBenchmark_ like the command below, you get > an error unless you manually create _benchmarks/_ directory under > _spark/mllib-local/_. > {code:java} > SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit \ > --driver-memory 6g --class org.apache.spark.benchmark.Benchmarks \ > --jars "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | > paste -sd ',' -`" \ > "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \ > "org.apache.spark.ml.linalg.BLASBenchmark" > {code} > This is caused by the code in _BenchmarkBase.scala_ where an attempt is made > to create the benchmark file without validating the path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35264) Support AQE side broadcastJoin threshold
[ https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-35264: Description: The main idea here is that make join config isolation between normal planner and aqe planner which shared the same code path. Actually we don not very trust using the static stat to consider if it can build broadcast hash join. In our experience it's very common that Spark throw broadcast timeout or driver side OOM exception when execute a bit large plan. And due to braodcast join is not reversed which means if we covert join to braodcast hash join at first time, we(AQE) can not optimize it again, so it should make sense to decide if we can do broadcast at aqe side using different sql config. In order to achieve this we use a specific join hint in advance during AQE framework and then at JoinSelection side it will take and follow the inserted hint. For now we only support select strategy for equi join, and follow this order 1. mark join as broadcast hash join if possible 2. mark join as shuffled hash join if possible Note that, we don't override join strategy if user specifies a join hint. was: The main idea here is that make join config isolation between normal planner and aqe planner which shared the same code path. Actually we don not very trust using the static stat to consider if it can build broadcast hash join. In our experience it's very common that Spark throw broadcast timeout or driver side OOM exception when execute a bit large plan. And due to braodcast join is not reversed which means if we covert join to braodcast hash join at first time, we(aqe) can not optimize it again, so it should make sense to decide if we can do broadcast at aqe side using different sql config. In order to achieve this we use a specific join hint in advance during AQE framework and then at JoinSelection side it will take and follow the inserted hint. For now we only support select strategy for equi join, and follow this order 1. mark join as broadcast hash join if possible 2. mark join as shuffled hash join if possible Note that, we don't override join strategy if user specifies a join hint. > Support AQE side broadcastJoin threshold > > > Key: SPARK-35264 > URL: https://issues.apache.org/jira/browse/SPARK-35264 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Major > > The main idea here is that make join config isolation between normal planner > and aqe planner which shared the same code path. > Actually we don not very trust using the static stat to consider if it can > build broadcast hash join. In our experience it's very common that Spark > throw broadcast timeout or driver side OOM exception when execute a bit large > plan. And due to braodcast join is not reversed which means if we covert join > to braodcast hash join at first time, we(AQE) can not optimize it again, so > it should make sense to decide if we can do broadcast at aqe side using > different sql config. > In order to achieve this we use a specific join hint in advance during AQE > framework and then at JoinSelection side it will take and follow the inserted > hint. > For now we only support select strategy for equi join, and follow this order > 1. mark join as broadcast hash join if possible > 2. mark join as shuffled hash join if possible > Note that, we don't override join strategy if user specifies a join hint. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35266) Fix an error in BenchmarkBase.scala that occurs when creating a benchmark file in a non-existent directory
Byungsoo Oh created SPARK-35266: --- Summary: Fix an error in BenchmarkBase.scala that occurs when creating a benchmark file in a non-existent directory Key: SPARK-35266 URL: https://issues.apache.org/jira/browse/SPARK-35266 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 3.2.0 Reporter: Byungsoo Oh When submitting a benchmark job using _org.apache.spark.benchmark.Benchmarks_ class with _SPARK_GENERATE_BENCHMARK_FILES=1_ option, an exception is raised if the directory where the benchmark file will be generated does not exist. For example, if you execute _BLASBenchmark_ like the command below, you get an error unless you manually create _benchmarks/_ directory under _spark/mllib-local/_. {code:java} SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit \ --driver-memory 6g --class org.apache.spark.benchmark.Benchmarks \ --jars "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \ "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \ "org.apache.spark.ml.linalg.BLASBenchmark" {code} This is caused by the code in _BenchmarkBase.scala_ where an attempt is made to create the benchmark file without validating the path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35135) Duplicate code implementation of `WritablePartitionedIterator`
[ https://issues.apache.org/jira/browse/SPARK-35135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi resolved SPARK-35135. -- Fix Version/s: 3.2.0 Target Version/s: 3.2.0 Resolution: Fixed Issue resolved by https://github.com/apache/spark/pull/32232 > Duplicate code implementation of `WritablePartitionedIterator` > -- > > Key: SPARK-35135 > URL: https://issues.apache.org/jira/browse/SPARK-35135 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.2.0 > > > `WritablePartitionedIterator` define in > `WritablePartitionedPairCollection.scala` and there are two implementation of > these trait, but the code for these two implementations is duplicate code -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35135) Duplicate code implementation of `WritablePartitionedIterator`
[ https://issues.apache.org/jira/browse/SPARK-35135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi reassigned SPARK-35135: Assignee: Yang Jie (was: Apache Spark) > Duplicate code implementation of `WritablePartitionedIterator` > -- > > Key: SPARK-35135 > URL: https://issues.apache.org/jira/browse/SPARK-35135 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > `WritablePartitionedIterator` define in > `WritablePartitionedPairCollection.scala` and there are two implementation of > these trait, but the code for these two implementations is duplicate code -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35264) Support AQE side broadcastJoin threshold
[ https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-35264: Description: The main idea here is that make join config isolation between normal planner and aqe planner which shared the same code path. Actually we don not very trust using the static stat to consider if it can build broadcast hash join. In our experience it's very common that Spark throw broadcast timeout or driver side OOM exception when execute a bit large plan. And due to braodcast join is not reversed which means if we covert join to braodcast hash join at first time, we(aqe) can not optimize it again, so it should make sense to decide if we can do broadcast at aqe side using different sql config. In order to achieve this we use a specific join hint in advance during AQE framework and then at JoinSelection side it will take and follow the inserted hint. For now we only support select strategy for equi join, and follow this order 1. mark join as broadcast hash join if possible 2. mark join as shuffled hash join if possible Note that, we don't override join strategy if user specifies a join hint. was: The main idea here is that make join config isolation between normal planner and aqe planner which shared the same code path. In order to achieve this we use a specific join hint in advance during AQE framework and then at JoinSelection side it will take and follow the inserted hint. For now we only support select strategy for equi join, and follow this order 1. mark join as broadcast hash join if possible 2. mark join as shuffled hash join if possible Note that, we don't override join strategy if user specifies a join hint. > Support AQE side broadcastJoin threshold > > > Key: SPARK-35264 > URL: https://issues.apache.org/jira/browse/SPARK-35264 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Major > > The main idea here is that make join config isolation between normal planner > and aqe planner which shared the same code path. > Actually we don not very trust using the static stat to consider if it can > build broadcast hash join. In our experience it's very common that Spark > throw broadcast timeout or driver side OOM exception when execute a bit large > plan. And due to braodcast join is not reversed which means if we covert join > to braodcast hash join at first time, we(aqe) can not optimize it again, so > it should make sense to decide if we can do broadcast at aqe side using > different sql config. > In order to achieve this we use a specific join hint in advance during AQE > framework and then at JoinSelection side it will take and follow the inserted > hint. > For now we only support select strategy for equi join, and follow this order > 1. mark join as broadcast hash join if possible > 2. mark join as shuffled hash join if possible > Note that, we don't override join strategy if user specifies a join hint. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34786) read parquet uint64 as decimal
[ https://issues.apache.org/jira/browse/SPARK-34786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335110#comment-17335110 ] Apache Spark commented on SPARK-34786: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/32390 > read parquet uint64 as decimal > -- > > Key: SPARK-34786 > URL: https://issues.apache.org/jira/browse/SPARK-34786 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Kent Yao >Priority: Major > Fix For: 3.2.0 > > > Currently Spark can't read parquet uint64 as it doesn't fit the Spark long > type. We can read uint64 as decimal as a workaround. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35265) abs return negative
[ https://issues.apache.org/jira/browse/SPARK-35265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuzhenjie updated SPARK-35265: --- Component/s: (was: Spark Core) PySpark Affects Version/s: 3.1.1 > abs return negative > --- > > Key: SPARK-35265 > URL: https://issues.apache.org/jira/browse/SPARK-35265 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1, 3.1.1 >Reporter: liuzhenjie >Priority: Major > > from pyspark.sql.functions import lit, abs, concat, hash,col > df = df.withColumn('partition_id', lit(-2147483648)) > df = df.withColumn('abs_id', abs(col('partition_id'))) > df.select('abs_id','partition_id').show() > > when the number is -2147483648,method abs return negative > +---++ > | abs_id |partition_id | > +---++ > |-2147483648| -2147483648| > |-2147483648| -2147483648| > |-2147483648| -2147483648| > |-2147483648| -2147483648| > |-2147483648| -2147483648| > +---++ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34786) read parquet uint64 as decimal
[ https://issues.apache.org/jira/browse/SPARK-34786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335108#comment-17335108 ] Apache Spark commented on SPARK-34786: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/32390 > read parquet uint64 as decimal > -- > > Key: SPARK-34786 > URL: https://issues.apache.org/jira/browse/SPARK-34786 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Kent Yao >Priority: Major > Fix For: 3.2.0 > > > Currently Spark can't read parquet uint64 as it doesn't fit the Spark long > type. We can read uint64 as decimal as a workaround. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35265) abs return negative
liuzhenjie created SPARK-35265: -- Summary: abs return negative Key: SPARK-35265 URL: https://issues.apache.org/jira/browse/SPARK-35265 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1 Reporter: liuzhenjie from pyspark.sql.functions import lit, abs, concat, hash,col df = df.withColumn('partition_id', lit(-2147483648)) df = df.withColumn('abs_id', abs(col('partition_id'))) df.select('abs_id','partition_id').show() when the number is -2147483648,method abs return negative -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35265) abs return negative
[ https://issues.apache.org/jira/browse/SPARK-35265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuzhenjie updated SPARK-35265: --- Description: from pyspark.sql.functions import lit, abs, concat, hash,col df = df.withColumn('partition_id', lit(-2147483648)) df = df.withColumn('abs_id', abs(col('partition_id'))) df.select('abs_id','partition_id').show() when the number is -2147483648,method abs return negative +---++ | abs_id |partition_id | +---++ |-2147483648| -2147483648| |-2147483648| -2147483648| |-2147483648| -2147483648| |-2147483648| -2147483648| |-2147483648| -2147483648| +---++ was: from pyspark.sql.functions import lit, abs, concat, hash,col df = df.withColumn('partition_id', lit(-2147483648)) df = df.withColumn('abs_id', abs(col('partition_id'))) df.select('abs_id','partition_id').show() when the number is -2147483648,method abs return negative > abs return negative > --- > > Key: SPARK-35265 > URL: https://issues.apache.org/jira/browse/SPARK-35265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: liuzhenjie >Priority: Major > > from pyspark.sql.functions import lit, abs, concat, hash,col > df = df.withColumn('partition_id', lit(-2147483648)) > df = df.withColumn('abs_id', abs(col('partition_id'))) > df.select('abs_id','partition_id').show() > > when the number is -2147483648,method abs return negative > +---++ > | abs_id |partition_id | > +---++ > |-2147483648| -2147483648| > |-2147483648| -2147483648| > |-2147483648| -2147483648| > |-2147483648| -2147483648| > |-2147483648| -2147483648| > +---++ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35264) Support AQE side broadcastJoin threshold
[ https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-35264: Description: The main idea here is that make join config isolation between normal planner and aqe planner which shared the same code path. In order to achieve this we use a specific join hint in advance during AQE framework and then at JoinSelection side it will take and follow the inserted hint. For now we only support select strategy for equi join, and follow this order 1. mark join as broadcast hash join if possible 2. mark join as shuffled hash join if possible Note that, we don't override join strategy if user specifies a join hint. was: The main idea here is that make join config isolation between normal planner and aqe planner which shared the same code path. In order to achieve this we use a specific join hint in advance during AQE framework and then at JoinSelection side it will take and follow the inserted hint. For now we only support select strategy for equi join, and follow this order 1. mark join as broadcast hash join if possible 2. mark join as shuffled hash join if possible Note that, we don't override join strategy if user specifies a join hint. > Support AQE side broadcastJoin threshold > > > Key: SPARK-35264 > URL: https://issues.apache.org/jira/browse/SPARK-35264 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Major > > The main idea here is that make join config isolation between normal planner > and aqe planner which shared the same code path. > In order to achieve this we use a specific join hint in advance during AQE > framework and then at JoinSelection side it will take and follow the inserted > hint. > For now we only support select strategy for equi join, and follow this order > 1. mark join as broadcast hash join if possible > 2. mark join as shuffled hash join if possible > Note that, we don't override join strategy if user specifies a join hint. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35264) Support AQE side broadcastJoin threshold
ulysses you created SPARK-35264: --- Summary: Support AQE side broadcastJoin threshold Key: SPARK-35264 URL: https://issues.apache.org/jira/browse/SPARK-35264 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: ulysses you The main idea here is that make join config isolation between normal planner and aqe planner which shared the same code path. In order to achieve this we use a specific join hint in advance during AQE framework and then at JoinSelection side it will take and follow the inserted hint. For now we only support select strategy for equi join, and follow this order 1. mark join as broadcast hash join if possible 2. mark join as shuffled hash join if possible Note that, we don't override join strategy if user specifies a join hint. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35227) Replace Bintray with the new repository service for the spark-packages resolver in SparkSubmit
[ https://issues.apache.org/jira/browse/SPARK-35227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335094#comment-17335094 ] L. C. Hsieh commented on SPARK-35227: - This issue was resolved by https://github.com/apache/spark/pull/32346. > Replace Bintray with the new repository service for the spark-packages > resolver in SparkSubmit > -- > > Key: SPARK-35227 > URL: https://issues.apache.org/jira/browse/SPARK-35227 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.4.7, 3.0.3, 3.1.2, 3.2.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > Fix For: 2.4.8, 3.0.3, 3.1.2, 3.2.0 > > > As Bintray is being shut down, we have setup a new repository service at > repos.spark-packages.org. We need to replace Bintray with the new service for > the spark-packages resolver in SparkSumit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35227) Replace Bintray with the new repository service for the spark-packages resolver in SparkSubmit
[ https://issues.apache.org/jira/browse/SPARK-35227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-35227: Issue Type: Improvement (was: Task) > Replace Bintray with the new repository service for the spark-packages > resolver in SparkSubmit > -- > > Key: SPARK-35227 > URL: https://issues.apache.org/jira/browse/SPARK-35227 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.7, 3.0.3, 3.1.2, 3.2.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > Fix For: 2.4.8, 3.0.3, 3.1.2, 3.2.0 > > > As Bintray is being shut down, we have setup a new repository service at > repos.spark-packages.org. We need to replace Bintray with the new service for > the spark-packages resolver in SparkSumit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35227) Replace Bintray with the new repository service for the spark-packages resolver in SparkSubmit
[ https://issues.apache.org/jira/browse/SPARK-35227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-35227: --- Assignee: Bo Zhang > Replace Bintray with the new repository service for the spark-packages > resolver in SparkSubmit > -- > > Key: SPARK-35227 > URL: https://issues.apache.org/jira/browse/SPARK-35227 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.4.7, 3.0.3, 3.1.2, 3.2.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > Fix For: 2.4.8, 3.0.3, 3.1.2, 3.2.0 > > > As Bintray is being shut down, we have setup a new repository service at > repos.spark-packages.org. We need to replace Bintray with the new service for > the spark-packages resolver in SparkSumit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35227) Replace Bintray with the new repository service for the spark-packages resolver in SparkSubmit
[ https://issues.apache.org/jira/browse/SPARK-35227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-35227. - Resolution: Fixed > Replace Bintray with the new repository service for the spark-packages > resolver in SparkSubmit > -- > > Key: SPARK-35227 > URL: https://issues.apache.org/jira/browse/SPARK-35227 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.4.7, 3.0.3, 3.1.2, 3.2.0 >Reporter: Bo Zhang >Priority: Major > Fix For: 2.4.8, 3.0.3, 3.1.2, 3.2.0 > > > As Bintray is being shut down, we have setup a new repository service at > repos.spark-packages.org. We need to replace Bintray with the new service for > the spark-packages resolver in SparkSumit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2: sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Summary: PartitionReaderFactory's Implemention Class of DataSourceV2: sqlConf parameter is null (was: PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null) > PartitionReaderFactory's Implemention Class of DataSourceV2: sqlConf > parameter is null > -- > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: lynn >Priority: Major > Attachments: spark-sqlconf-isnull.png > > > The codes of "MyPartitionReaderFactory" : > {code:scala} > // Implemention Class > package com.lynn.spark.sql.v2 > import org.apache.spark.internal.Logging > import org.apache.spark.sql.catalyst.InternalRow > import > com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE, > MY_VECTORIZED_READER_ENABLED} > import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, > PartitionReaderFactory} > import org.apache.spark.sql.internal.SQLConf > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.vectorized.ColumnarBatch > import org.apache.spark.sql.internal.SQLConf.buildConf > case class MyPartitionReaderFactory(sqlConf: SQLConf, > dataSchema: StructType, > readSchema: StructType) > extends PartitionReaderFactory with Logging { > val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) > val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) > override def createReader(partition: InputPartition): > PartitionReader[InternalRow] = { > MyRowReader(batchSize, dataSchema, readSchema) > } > override def createColumnarReader(partition: InputPartition): > PartitionReader[ColumnarBatch] = { > if(!supportColumnarReads(partition)) > throw new UnsupportedOperationException("Cannot create columnar > reader.") >MyColumnReader(batchSize, dataSchema, readSchema) > } > override def supportColumnarReads(partition: InputPartition) = > enableVectorized > } > object MyPartitionReaderFactory { > val MY_VECTORIZED_READER_ENABLED = > buildConf("spark.sql.my.enableVectorizedReader") > .doc("Enables vectorized my source scan.") > .version("1.0.0") > .booleanConf > .createWithDefault(false) > val MY_VECTORIZED_READER_BATCH_SIZE = > buildConf("spark.sql.my.columnarReaderBatchSize") > .doc("The number of rows to include in a my source vectorized reader > batch. The number should " + > "be carefully chosen to minimize overhead and avoid OOMs in reading > data.") > .version("1.0.0") > .intConf > .createWithDefault(4096) > } > {code} > The driver construct a RDD instance(DataSourceRDD), the sqlConf parameter > pass to the MyPartitionReaderFactory is not null. > But when the executor deserialize the RDD, the sqlConf parameter is null. > The codes as follows: > {code:scala} > // RunTask.scala > override def runTask(context: TaskContext): U = { > // Deserialize the RDD and the func using the broadcast variables. > val threadMXBean = ManagementFactory.getThreadMXBean > val deserializeStartTimeNs = System.nanoTime() > val deserializeStartCpuTime = if > (threadMXBean.isCurrentThreadCpuTimeSupported) { > threadMXBean.getCurrentThreadCpuTime > } else 0L > val ser = SparkEnv.get.closureSerializer.newInstance() >// the rdd > val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => > U)]( > ByteBuffer.wrap(taskBinary.value), > Thread.currentThread.getContextClassLoader) > _executorDeserializeTimeNs = System.nanoTime() - deserializeStartTimeNs > _executorDeserializeCpuTime = if > (threadMXBean.isCurrentThreadCpuTimeSupported) { > threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime > } else 0L > func(context, rdd.iterator(partition, context)) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34705) Add code-gen for all join types of sort merge join
[ https://issues.apache.org/jira/browse/SPARK-34705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335022#comment-17335022 ] Cheng Su commented on SPARK-34705: -- [~advancedxy] - We saw ~10% CPU performance improvement for targeted queries. I think it makes sense to update the benchmark after feature is merged. > Add code-gen for all join types of sort merge join > -- > > Key: SPARK-34705 > URL: https://issues.apache.org/jira/browse/SPARK-34705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Minor > > Currently sort merge join only supports inner join type > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L374] > ). We added code-gen for other join types internally in our fork and saw > obvious CPU performance improvement. Create this Jira to propose to merge > back to upstream. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
[ https://issues.apache.org/jira/browse/SPARK-35263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334987#comment-17334987 ] Apache Spark commented on SPARK-35263: -- User 'xkrogen' has created a pull request for this issue: https://github.com/apache/spark/pull/32389 > Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code > --- > > Key: SPARK-35263 > URL: https://issues.apache.org/jira/browse/SPARK-35263 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Tests >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Priority: Major > > {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like: > {code} > val iterator = new ShuffleBlockFetcherIterator( > taskContext, > transfer, > blockManager, > blocksByAddress, > (_, in) => in, > 48 * 1024 * 1024, > Int.MaxValue, > Int.MaxValue, > Int.MaxValue, > true, > false, > metrics, > false) > {code} > It's challenging to tell what the interesting parts are vs. what is just > being set to some default/unused value. > Similarly but not as bad, there are 10 calls like: > {code} > verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), > any()) > {code} > and 7 like > {code} > when(transfer.fetchBlocks(any(), any(), any(), any(), any(), > any())).thenAnswer ... > {code} > This can result in about 10% reduction in both lines and characters in the > file: > {code} > # Before > > wc > > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > 10633950 43201 > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > # After > > wc > > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > 9283609 39053 > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > {code} > It also helps readability: > {code} > val iterator = createShuffleBlockIteratorWithDefaults( > transfer, > blocksByAddress, > maxBytesInFlight = 1000L > ) > {code} > Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're > interested in here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
[ https://issues.apache.org/jira/browse/SPARK-35263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35263: Assignee: (was: Apache Spark) > Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code > --- > > Key: SPARK-35263 > URL: https://issues.apache.org/jira/browse/SPARK-35263 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Tests >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Priority: Major > > {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like: > {code} > val iterator = new ShuffleBlockFetcherIterator( > taskContext, > transfer, > blockManager, > blocksByAddress, > (_, in) => in, > 48 * 1024 * 1024, > Int.MaxValue, > Int.MaxValue, > Int.MaxValue, > true, > false, > metrics, > false) > {code} > It's challenging to tell what the interesting parts are vs. what is just > being set to some default/unused value. > Similarly but not as bad, there are 10 calls like: > {code} > verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), > any()) > {code} > and 7 like > {code} > when(transfer.fetchBlocks(any(), any(), any(), any(), any(), > any())).thenAnswer ... > {code} > This can result in about 10% reduction in both lines and characters in the > file: > {code} > # Before > > wc > > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > 10633950 43201 > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > # After > > wc > > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > 9283609 39053 > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > {code} > It also helps readability: > {code} > val iterator = createShuffleBlockIteratorWithDefaults( > transfer, > blocksByAddress, > maxBytesInFlight = 1000L > ) > {code} > Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're > interested in here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
[ https://issues.apache.org/jira/browse/SPARK-35263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334986#comment-17334986 ] Apache Spark commented on SPARK-35263: -- User 'xkrogen' has created a pull request for this issue: https://github.com/apache/spark/pull/32389 > Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code > --- > > Key: SPARK-35263 > URL: https://issues.apache.org/jira/browse/SPARK-35263 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Tests >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Priority: Major > > {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like: > {code} > val iterator = new ShuffleBlockFetcherIterator( > taskContext, > transfer, > blockManager, > blocksByAddress, > (_, in) => in, > 48 * 1024 * 1024, > Int.MaxValue, > Int.MaxValue, > Int.MaxValue, > true, > false, > metrics, > false) > {code} > It's challenging to tell what the interesting parts are vs. what is just > being set to some default/unused value. > Similarly but not as bad, there are 10 calls like: > {code} > verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), > any()) > {code} > and 7 like > {code} > when(transfer.fetchBlocks(any(), any(), any(), any(), any(), > any())).thenAnswer ... > {code} > This can result in about 10% reduction in both lines and characters in the > file: > {code} > # Before > > wc > > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > 10633950 43201 > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > # After > > wc > > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > 9283609 39053 > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > {code} > It also helps readability: > {code} > val iterator = createShuffleBlockIteratorWithDefaults( > transfer, > blocksByAddress, > maxBytesInFlight = 1000L > ) > {code} > Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're > interested in here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
[ https://issues.apache.org/jira/browse/SPARK-35263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35263: Assignee: Apache Spark > Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code > --- > > Key: SPARK-35263 > URL: https://issues.apache.org/jira/browse/SPARK-35263 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Tests >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Assignee: Apache Spark >Priority: Major > > {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like: > {code} > val iterator = new ShuffleBlockFetcherIterator( > taskContext, > transfer, > blockManager, > blocksByAddress, > (_, in) => in, > 48 * 1024 * 1024, > Int.MaxValue, > Int.MaxValue, > Int.MaxValue, > true, > false, > metrics, > false) > {code} > It's challenging to tell what the interesting parts are vs. what is just > being set to some default/unused value. > Similarly but not as bad, there are 10 calls like: > {code} > verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), > any()) > {code} > and 7 like > {code} > when(transfer.fetchBlocks(any(), any(), any(), any(), any(), > any())).thenAnswer ... > {code} > This can result in about 10% reduction in both lines and characters in the > file: > {code} > # Before > > wc > > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > 10633950 43201 > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > # After > > wc > > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > 9283609 39053 > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > {code} > It also helps readability: > {code} > val iterator = createShuffleBlockIteratorWithDefaults( > transfer, > blocksByAddress, > maxBytesInFlight = 1000L > ) > {code} > Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're > interested in here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35262) Memory leak when dataset is being persisted
[ https://issues.apache.org/jira/browse/SPARK-35262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Amelin updated SPARK-35262: Priority: Critical (was: Major) > Memory leak when dataset is being persisted > --- > > Key: SPARK-35262 > URL: https://issues.apache.org/jira/browse/SPARK-35262 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Igor Amelin >Priority: Critical > > If a Java- or Scala-application with SparkSession runs a long time and > persists a lot of datasets, it can crash because of a memory leak. > I've noticed the following. When we have a dataset and persist it, the > SparkSession used to load that dataset is cloned in CacheManager, and this > clone is added as a listener to `listenersPlusTimers` in `ListenerBus`. But > this clone isn't removed from the list of listeners after that, e.g. > unpersisting the dataset. If we persist a lot of datasets, the SparkSession > is cloned and added to `ListenerBus` many times. This leads to a memory leak > since the `listenersPlusTimers` list become very large. > I've found out that the SparkSession is cloned is CacheManager when the > parameters `spark.sql.sources.bucketing.autoBucketedScan.enabled` and > `spark.sql.adaptive.enabled` are true. The first one is true by default, and > this default behavior leads to the problem. When auto bucketed scan is > disabled, the SparkSession isn't cloned, and there are no duplicates in > ListenerBus, so the memory leak doesn't occur. > Here is a small Java application to reproduce the memory leak: > [https://github.com/iamelin/spark-memory-leak] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
Erik Krogen created SPARK-35263: --- Summary: Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code Key: SPARK-35263 URL: https://issues.apache.org/jira/browse/SPARK-35263 Project: Spark Issue Type: Improvement Components: Shuffle, Tests Affects Versions: 3.1.1 Reporter: Erik Krogen {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like: {code} val iterator = new ShuffleBlockFetcherIterator( taskContext, transfer, blockManager, blocksByAddress, (_, in) => in, 48 * 1024 * 1024, Int.MaxValue, Int.MaxValue, Int.MaxValue, true, false, metrics, false) {code} It's challenging to tell what the interesting parts are vs. what is just being set to some default/unused value. Similarly but not as bad, there are 10 calls like: {code} verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), any()) {code} and 7 like {code} when(transfer.fetchBlocks(any(), any(), any(), any(), any(), any())).thenAnswer ... {code} This can result in about 10% reduction in both lines and characters in the file: {code} # Before > wc > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala 10633950 43201 core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala # After > wc > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala 9283609 39053 core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala {code} It also helps readability: {code} val iterator = createShuffleBlockIteratorWithDefaults( transfer, blocksByAddress, maxBytesInFlight = 1000L ) {code} Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're interested in here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35262) Memory leak when dataset is being persisted
Igor Amelin created SPARK-35262: --- Summary: Memory leak when dataset is being persisted Key: SPARK-35262 URL: https://issues.apache.org/jira/browse/SPARK-35262 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Reporter: Igor Amelin If a Java- or Scala-application with SparkSession runs a long time and persists a lot of datasets, it can crash because of a memory leak. I've noticed the following. When we have a dataset and persist it, the SparkSession used to load that dataset is cloned in CacheManager, and this clone is added as a listener to `listenersPlusTimers` in `ListenerBus`. But this clone isn't removed from the list of listeners after that, e.g. unpersisting the dataset. If we persist a lot of datasets, the SparkSession is cloned and added to `ListenerBus` many times. This leads to a memory leak since the `listenersPlusTimers` list become very large. I've found out that the SparkSession is cloned is CacheManager when the parameters `spark.sql.sources.bucketing.autoBucketedScan.enabled` and `spark.sql.adaptive.enabled` are true. The first one is true by default, and this default behavior leads to the problem. When auto bucketed scan is disabled, the SparkSession isn't cloned, and there are no duplicates in ListenerBus, so the memory leak doesn't occur. Here is a small Java application to reproduce the memory leak: [https://github.com/iamelin/spark-memory-leak] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35259) ExternalBlockHandler metrics have misleading unit in the name
[ https://issues.apache.org/jira/browse/SPARK-35259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated SPARK-35259: Description: Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics: {code} // Time latency for open block request in ms private final Timer openBlockRequestLatencyMillis = new Timer(); // Time latency for executor registration latency in ms private final Timer registerExecutorRequestLatencyMillis = new Timer(); // Time latency for processing finalize shuffle merge request latency in ms private final Timer finalizeShuffleMergeLatencyMillis = new Timer(); {code} However these Dropwizard Timers by default use nanoseconds ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]). It's certainly possible to extract milliseconds from them, but it seems misleading to have millis in the name here. {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrectly-named metrics since it doesn't export any timing information from these metrics (which I am trying to address in SPARK-35258), but these names still result in kind of misleading metric names like {{finalizeShuffleMergeLatencyMillis_count}} -- a count doesn't have a unit. It should be up to the metrics exporter, like {{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name accordingly. was: Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics: {code} // Time latency for open block request in ms private final Timer openBlockRequestLatencyMillis = new Timer(); // Time latency for executor registration latency in ms private final Timer registerExecutorRequestLatencyMillis = new Timer(); // Time latency for processing finalize shuffle merge request latency in ms private final Timer finalizeShuffleMergeLatencyMillis = new Timer(); {code} However these Dropwizard Timers by default use nanoseconds ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]). It's certainly possible to extract milliseconds from them, but it seems misleading to have millis in the name here. {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrectly-named metrics since it doesn't export any timing information from these metrics (which I am trying to address in SPARK-35258), but these names still result in kind of misleading metric names like {{finalizeShuffleMergeLatency_count}} -- a count doesn't have a unit. It should be up to the metrics exporter, like {{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name accordingly. > ExternalBlockHandler metrics have misleading unit in the name > - > > Key: SPARK-35259 > URL: https://issues.apache.org/jira/browse/SPARK-35259 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Priority: Major > > Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics: > {code} > // Time latency for open block request in ms > private final Timer openBlockRequestLatencyMillis = new Timer(); > // Time latency for executor registration latency in ms > private final Timer registerExecutorRequestLatencyMillis = new Timer(); > // Time latency for processing finalize shuffle merge request latency in > ms > private final Timer finalizeShuffleMergeLatencyMillis = new Timer(); > {code} > However these Dropwizard Timers by default use nanoseconds > ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]). > It's certainly possible to extract milliseconds from them, but it seems > misleading to have millis in the name here. > {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrectly-named > metrics since it doesn't export any timing information from these metrics > (which I am trying to address in SPARK-35258), but these names still result > in kind of misleading metric names like > {{finalizeShuffleMergeLatencyMillis_count}} -- a count doesn't have a unit. > It should be up to the metrics exporter, like {{YarnShuffleServiceMetrics}}, > to decide the unit and adjust the name accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35259) ExternalBlockHandler metrics have misleading unit in the name
[ https://issues.apache.org/jira/browse/SPARK-35259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated SPARK-35259: Description: Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics: {code} // Time latency for open block request in ms private final Timer openBlockRequestLatencyMillis = new Timer(); // Time latency for executor registration latency in ms private final Timer registerExecutorRequestLatencyMillis = new Timer(); // Time latency for processing finalize shuffle merge request latency in ms private final Timer finalizeShuffleMergeLatencyMillis = new Timer(); {code} However these Dropwizard Timers by default use nanoseconds ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]). It's certainly possible to extract milliseconds from them, but it seems misleading to have millis in the name here. {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrectly-named metrics since it doesn't export any timing information from these metrics (which I am trying to address in SPARK-35258), but these names still result in kind of misleading metric names like {{finalizeShuffleMergeLatency_count}} -- a count doesn't have a unit. It should be up to the metrics exporter, like {{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name accordingly. was: Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics: {code} // Time latency for open block request in ms private final Timer openBlockRequestLatencyMillis = new Timer(); // Time latency for executor registration latency in ms private final Timer registerExecutorRequestLatencyMillis = new Timer(); // Time latency for processing finalize shuffle merge request latency in ms private final Timer finalizeShuffleMergeLatencyMillis = new Timer(); {code} However these Dropwizard Timers by default use nanoseconds ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]). It's certainly possible to extract milliseconds from them, but it seems misleading to have millis in the name here. {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrect metrics since it doesn't export any timing information from these metrics (which I am trying to address in SPARK-35258), but these names still result in kind of misleading metric names like {{finalizeShuffleMergeLatency_count}} -- a count doesn't have a unit. It should be up to the metrics exporter, like {{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name accordingly. > ExternalBlockHandler metrics have misleading unit in the name > - > > Key: SPARK-35259 > URL: https://issues.apache.org/jira/browse/SPARK-35259 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Priority: Major > > Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics: > {code} > // Time latency for open block request in ms > private final Timer openBlockRequestLatencyMillis = new Timer(); > // Time latency for executor registration latency in ms > private final Timer registerExecutorRequestLatencyMillis = new Timer(); > // Time latency for processing finalize shuffle merge request latency in > ms > private final Timer finalizeShuffleMergeLatencyMillis = new Timer(); > {code} > However these Dropwizard Timers by default use nanoseconds > ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]). > It's certainly possible to extract milliseconds from them, but it seems > misleading to have millis in the name here. > {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrectly-named > metrics since it doesn't export any timing information from these metrics > (which I am trying to address in SPARK-35258), but these names still result > in kind of misleading metric names like {{finalizeShuffleMergeLatency_count}} > -- a count doesn't have a unit. It should be up to the metrics exporter, like > {{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name > accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35259) ExternalBlockHandler metrics have misleading unit in the name
[ https://issues.apache.org/jira/browse/SPARK-35259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334932#comment-17334932 ] Erik Krogen commented on SPARK-35259: - I have a PR for this but it is based on the PR for SPARK-35258 so I will hold off posting it for now. While that goes through -- [~rxin] or [~jlaskowski] -- I see you participated in the discussions on SPARK-16405 when these were added, do you have any comment here? Maybe I am missing something? > ExternalBlockHandler metrics have misleading unit in the name > - > > Key: SPARK-35259 > URL: https://issues.apache.org/jira/browse/SPARK-35259 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Priority: Major > > Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics: > {code} > // Time latency for open block request in ms > private final Timer openBlockRequestLatencyMillis = new Timer(); > // Time latency for executor registration latency in ms > private final Timer registerExecutorRequestLatencyMillis = new Timer(); > // Time latency for processing finalize shuffle merge request latency in > ms > private final Timer finalizeShuffleMergeLatencyMillis = new Timer(); > {code} > However these Dropwizard Timers by default use nanoseconds > ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]). > It's certainly possible to extract milliseconds from them, but it seems > misleading to have millis in the name here. > {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrect metrics > since it doesn't export any timing information from these metrics (which I am > trying to address in SPARK-35258), but these names still result in kind of > misleading metric names like {{finalizeShuffleMergeLatency_count}} -- a count > doesn't have a unit. It should be up to the metrics exporter, like > {{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name > accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35261) Support static invoke for stateless UDF
Chao Sun created SPARK-35261: Summary: Support static invoke for stateless UDF Key: SPARK-35261 URL: https://issues.apache.org/jira/browse/SPARK-35261 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun For UDFs that are stateless, we should allow users to define "magic method" as a static Java method which removes extra costs from dynamic dispatch and gives better performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35258) Enhance ESS ExternalBlockHandler with additional block rate-based metrics and histograms
[ https://issues.apache.org/jira/browse/SPARK-35258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35258: Assignee: Apache Spark > Enhance ESS ExternalBlockHandler with additional block rate-based metrics and > histograms > > > Key: SPARK-35258 > URL: https://issues.apache.org/jira/browse/SPARK-35258 > Project: Spark > Issue Type: Improvement > Components: Shuffle, YARN >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Assignee: Apache Spark >Priority: Major > > Today the {{ExternalBlockHandler}} component of ESS exposes some useful > metrics, but is lacking around metrics for the rate of block transfers. We > have {{blockTransferRateBytes}} to tell us the rate of _bytes_, but no metric > to tell us the rate of _blocks_, which is especially relevant when running > the ESS on HDDs that are sensitive to random reads. Many small block > transfers can have a negative impact on performance, but won't show up as a > spike in {{blockTransferRateBytes}} since the sizes are small. > We can also enhance {{YarnShuffleServiceMetrics}} to expose histogram-style > metrics from the {{Timer}} instances within {{ExternalBlockHandler}} -- today > it is only exposing the count and rate, but not timing information from the > {{Snapshot}}. > These two changes can make it easier to monitor the health of the ESS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35258) Enhance ESS ExternalBlockHandler with additional block rate-based metrics and histograms
[ https://issues.apache.org/jira/browse/SPARK-35258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35258: Assignee: (was: Apache Spark) > Enhance ESS ExternalBlockHandler with additional block rate-based metrics and > histograms > > > Key: SPARK-35258 > URL: https://issues.apache.org/jira/browse/SPARK-35258 > Project: Spark > Issue Type: Improvement > Components: Shuffle, YARN >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Priority: Major > > Today the {{ExternalBlockHandler}} component of ESS exposes some useful > metrics, but is lacking around metrics for the rate of block transfers. We > have {{blockTransferRateBytes}} to tell us the rate of _bytes_, but no metric > to tell us the rate of _blocks_, which is especially relevant when running > the ESS on HDDs that are sensitive to random reads. Many small block > transfers can have a negative impact on performance, but won't show up as a > spike in {{blockTransferRateBytes}} since the sizes are small. > We can also enhance {{YarnShuffleServiceMetrics}} to expose histogram-style > metrics from the {{Timer}} instances within {{ExternalBlockHandler}} -- today > it is only exposing the count and rate, but not timing information from the > {{Snapshot}}. > These two changes can make it easier to monitor the health of the ESS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35258) Enhance ESS ExternalBlockHandler with additional block rate-based metrics and histograms
[ https://issues.apache.org/jira/browse/SPARK-35258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334929#comment-17334929 ] Apache Spark commented on SPARK-35258: -- User 'xkrogen' has created a pull request for this issue: https://github.com/apache/spark/pull/32388 > Enhance ESS ExternalBlockHandler with additional block rate-based metrics and > histograms > > > Key: SPARK-35258 > URL: https://issues.apache.org/jira/browse/SPARK-35258 > Project: Spark > Issue Type: Improvement > Components: Shuffle, YARN >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Priority: Major > > Today the {{ExternalBlockHandler}} component of ESS exposes some useful > metrics, but is lacking around metrics for the rate of block transfers. We > have {{blockTransferRateBytes}} to tell us the rate of _bytes_, but no metric > to tell us the rate of _blocks_, which is especially relevant when running > the ESS on HDDs that are sensitive to random reads. Many small block > transfers can have a negative impact on performance, but won't show up as a > spike in {{blockTransferRateBytes}} since the sizes are small. > We can also enhance {{YarnShuffleServiceMetrics}} to expose histogram-style > metrics from the {{Timer}} instances within {{ExternalBlockHandler}} -- today > it is only exposing the count and rate, but not timing information from the > {{Snapshot}}. > These two changes can make it easier to monitor the health of the ESS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34981) Implement V2 function resolution and evaluation
[ https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34981: - Parent: SPARK-35260 Issue Type: Sub-task (was: Improvement) > Implement V2 function resolution and evaluation > > > Key: SPARK-34981 > URL: https://issues.apache.org/jira/browse/SPARK-34981 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims > at implementing the function resolution (in analyzer) and evaluation by > wrapping them into corresponding expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35260) DataSourceV2 Function Catalog implementation
Chao Sun created SPARK-35260: Summary: DataSourceV2 Function Catalog implementation Key: SPARK-35260 URL: https://issues.apache.org/jira/browse/SPARK-35260 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun This tracks the implementation and follow-up work for V2 Function Catalog introduced in SPARK-27658 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35244) invoke should throw the original exception
[ https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334925#comment-17334925 ] Apache Spark commented on SPARK-35244: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/32387 > invoke should throw the original exception > -- > > Key: SPARK-35244 > URL: https://issues.apache.org/jira/browse/SPARK-35244 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35259) ExternalBlockHandler metrics have incorrect unit in the name
Erik Krogen created SPARK-35259: --- Summary: ExternalBlockHandler metrics have incorrect unit in the name Key: SPARK-35259 URL: https://issues.apache.org/jira/browse/SPARK-35259 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 3.1.1 Reporter: Erik Krogen Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics: {code} // Time latency for open block request in ms private final Timer openBlockRequestLatencyMillis = new Timer(); // Time latency for executor registration latency in ms private final Timer registerExecutorRequestLatencyMillis = new Timer(); // Time latency for processing finalize shuffle merge request latency in ms private final Timer finalizeShuffleMergeLatencyMillis = new Timer(); {code} However these Dropwizard Timers by default use nanoseconds ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]). It's certainly possible to extract milliseconds from them, but it seems misleading to have millis in the name here. {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrect metrics since it doesn't export any timing information from these metrics (which I am trying to address in SPARK-35258), but these names still result in kind of misleading metric names like {{finalizeShuffleMergeLatency_count}} -- a count doesn't have a unit. It should be up to the metrics exporter, like {{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35259) ExternalBlockHandler metrics have misleading unit in the name
[ https://issues.apache.org/jira/browse/SPARK-35259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated SPARK-35259: Summary: ExternalBlockHandler metrics have misleading unit in the name (was: ExternalBlockHandler metrics have incorrect unit in the name) > ExternalBlockHandler metrics have misleading unit in the name > - > > Key: SPARK-35259 > URL: https://issues.apache.org/jira/browse/SPARK-35259 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Priority: Major > > Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics: > {code} > // Time latency for open block request in ms > private final Timer openBlockRequestLatencyMillis = new Timer(); > // Time latency for executor registration latency in ms > private final Timer registerExecutorRequestLatencyMillis = new Timer(); > // Time latency for processing finalize shuffle merge request latency in > ms > private final Timer finalizeShuffleMergeLatencyMillis = new Timer(); > {code} > However these Dropwizard Timers by default use nanoseconds > ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]). > It's certainly possible to extract milliseconds from them, but it seems > misleading to have millis in the name here. > {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrect metrics > since it doesn't export any timing information from these metrics (which I am > trying to address in SPARK-35258), but these names still result in kind of > misleading metric names like {{finalizeShuffleMergeLatency_count}} -- a count > doesn't have a unit. It should be up to the metrics exporter, like > {{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name > accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35258) Enhance ESS ExternalBlockHandler with additional block rate-based metrics and histograms
Erik Krogen created SPARK-35258: --- Summary: Enhance ESS ExternalBlockHandler with additional block rate-based metrics and histograms Key: SPARK-35258 URL: https://issues.apache.org/jira/browse/SPARK-35258 Project: Spark Issue Type: Improvement Components: Shuffle, YARN Affects Versions: 3.1.1 Reporter: Erik Krogen Today the {{ExternalBlockHandler}} component of ESS exposes some useful metrics, but is lacking around metrics for the rate of block transfers. We have {{blockTransferRateBytes}} to tell us the rate of _bytes_, but no metric to tell us the rate of _blocks_, which is especially relevant when running the ESS on HDDs that are sensitive to random reads. Many small block transfers can have a negative impact on performance, but won't show up as a spike in {{blockTransferRateBytes}} since the sizes are small. We can also enhance {{YarnShuffleServiceMetrics}} to expose histogram-style metrics from the {{Timer}} instances within {{ExternalBlockHandler}} -- today it is only exposing the count and rate, but not timing information from the {{Snapshot}}. These two changes can make it easier to monitor the health of the ESS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34887) Port/integrate Koalas dependencies into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334918#comment-17334918 ] Apache Spark commented on SPARK-34887: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/32386 > Port/integrate Koalas dependencies into PySpark > --- > > Key: SPARK-34887 > URL: https://issues.apache.org/jira/browse/SPARK-34887 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > This JIRA aims to port Koalas dependencies appropriately to PySpark > dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34887) Port/integrate Koalas dependencies into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34887: Assignee: (was: Apache Spark) > Port/integrate Koalas dependencies into PySpark > --- > > Key: SPARK-34887 > URL: https://issues.apache.org/jira/browse/SPARK-34887 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > This JIRA aims to port Koalas dependencies appropriately to PySpark > dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34887) Port/integrate Koalas dependencies into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34887: Assignee: Apache Spark > Port/integrate Koalas dependencies into PySpark > --- > > Key: SPARK-34887 > URL: https://issues.apache.org/jira/browse/SPARK-34887 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > This JIRA aims to port Koalas dependencies appropriately to PySpark > dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34887) Port/integrate Koalas dependencies into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34887: Assignee: (was: Apache Spark) > Port/integrate Koalas dependencies into PySpark > --- > > Key: SPARK-34887 > URL: https://issues.apache.org/jira/browse/SPARK-34887 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > This JIRA aims to port Koalas dependencies appropriately to PySpark > dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34887) Port/integrate Koalas dependencies into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334915#comment-17334915 ] Xinrong Meng commented on SPARK-34887: -- May I work on this ticket? > Port/integrate Koalas dependencies into PySpark > --- > > Key: SPARK-34887 > URL: https://issues.apache.org/jira/browse/SPARK-34887 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > This JIRA aims to port Koalas dependencies appropriately to PySpark > dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34981) Implement V2 function resolution and evaluation
[ https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34981: --- Assignee: Chao Sun > Implement V2 function resolution and evaluation > > > Key: SPARK-34981 > URL: https://issues.apache.org/jira/browse/SPARK-34981 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims > at implementing the function resolution (in analyzer) and evaluation by > wrapping them into corresponding expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34981) Implement V2 function resolution and evaluation
[ https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34981. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32082 [https://github.com/apache/spark/pull/32082] > Implement V2 function resolution and evaluation > > > Key: SPARK-34981 > URL: https://issues.apache.org/jira/browse/SPARK-34981 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims > at implementing the function resolution (in analyzer) and evaluation by > wrapping them into corresponding expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34705) Add code-gen for all join types of sort merge join
[ https://issues.apache.org/jira/browse/SPARK-34705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334830#comment-17334830 ] Xianjin YE commented on SPARK-34705: [~chengsu] could you share some number of the CPU performance improvement? > Add code-gen for all join types of sort merge join > -- > > Key: SPARK-34705 > URL: https://issues.apache.org/jira/browse/SPARK-34705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Minor > > Currently sort merge join only supports inner join type > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L374] > ). We added code-gen for other join types internally in our fork and saw > obvious CPU performance improvement. Create this Jira to propose to merge > back to upstream. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18188) Add checksum for block of broadcast
[ https://issues.apache.org/jira/browse/SPARK-18188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334826#comment-17334826 ] Apache Spark commented on SPARK-18188: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/32385 > Add checksum for block of broadcast > --- > > Key: SPARK-18188 > URL: https://issues.apache.org/jira/browse/SPARK-18188 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Major > Fix For: 2.1.0 > > > There is an understanding issue for a long time: > https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for > the blocks, it's very hard for us to identify where is the bug came from. > Shuffle blocks are compressed separate (have checksum in it), but broadcast > blocks are compressed together, we should add checksum for each of separately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35257) Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up
[ https://issues.apache.org/jira/browse/SPARK-35257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334798#comment-17334798 ] Apache Spark commented on SPARK-35257: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/32384 > Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up > -- > > Key: SPARK-35257 > URL: https://issues.apache.org/jira/browse/SPARK-35257 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > HadoopVersionInfoSuite use a separate ivyPath to download the jars, we can > use ` > HiveClientBuilder.ivyPath` and specify environment variables > `SPARK_VERSIONS_SUITE_IVY_PATH` like `VersionsSuite` to avoid downloading > jars repeatedly and speed up the test. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35257) Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up
[ https://issues.apache.org/jira/browse/SPARK-35257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35257: Assignee: Apache Spark > Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up > -- > > Key: SPARK-35257 > URL: https://issues.apache.org/jira/browse/SPARK-35257 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > HadoopVersionInfoSuite use a separate ivyPath to download the jars, we can > use ` > HiveClientBuilder.ivyPath` and specify environment variables > `SPARK_VERSIONS_SUITE_IVY_PATH` like `VersionsSuite` to avoid downloading > jars repeatedly and speed up the test. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35257) Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up
[ https://issues.apache.org/jira/browse/SPARK-35257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334797#comment-17334797 ] Apache Spark commented on SPARK-35257: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/32384 > Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up > -- > > Key: SPARK-35257 > URL: https://issues.apache.org/jira/browse/SPARK-35257 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > HadoopVersionInfoSuite use a separate ivyPath to download the jars, we can > use ` > HiveClientBuilder.ivyPath` and specify environment variables > `SPARK_VERSIONS_SUITE_IVY_PATH` like `VersionsSuite` to avoid downloading > jars repeatedly and speed up the test. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35257) Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up
[ https://issues.apache.org/jira/browse/SPARK-35257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35257: Assignee: (was: Apache Spark) > Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up > -- > > Key: SPARK-35257 > URL: https://issues.apache.org/jira/browse/SPARK-35257 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > HadoopVersionInfoSuite use a separate ivyPath to download the jars, we can > use ` > HiveClientBuilder.ivyPath` and specify environment variables > `SPARK_VERSIONS_SUITE_IVY_PATH` like `VersionsSuite` to avoid downloading > jars repeatedly and speed up the test. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35257) Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up
Yang Jie created SPARK-35257: Summary: Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up Key: SPARK-35257 URL: https://issues.apache.org/jira/browse/SPARK-35257 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.2.0 Reporter: Yang Jie HadoopVersionInfoSuite use a separate ivyPath to download the jars, we can use ` HiveClientBuilder.ivyPath` and specify environment variables `SPARK_VERSIONS_SUITE_IVY_PATH` like `VersionsSuite` to avoid downloading jars repeatedly and speed up the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35256) str_to_map + split performance regression
[ https://issues.apache.org/jira/browse/SPARK-35256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ondrej Kokes updated SPARK-35256: - Description: I'm seeing almost double the runtime between 3.0.1 and 3.1.1 in my pipeline that does mostly str_to_map, split and a few other operations - all projections, no joins or aggregations (it's here only to trigger the pipeline). I cut it down to the simplest reproducible example I could - anything I remove from this changes the runtime difference quite dramatically. (even moving those two expressions from f.when to standalone columns makes the difference disappear) {code:java} import time import os import pyspark from pyspark.sql import SparkSession import pyspark.sql.functions as f if __name__ == '__main__': print(pyspark.__version__) spark = SparkSession.builder.getOrCreate() filename = 'regression.csv' if not os.path.isfile(filename): with open(filename, 'wt') as fw: fw.write('foo\n') for _ in range(10_000_000): fw.write('foo=bar=bak=f,o,1:2:3\n') df = spark.read.option('header', True).csv(filename) t = time.time() dd = (df .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) .withColumn('extracted', # without this top level split it is only 50% slower, with it # the runtime almost doubles f.split(f.split(f.col("my_map")["bar"], ",")[2], ":")[0] ) .select( f.when( f.col("extracted").startswith("foo"), f.col("extracted") ).otherwise( f.concat(f.lit("foo"), f.col("extracted")) ).alias("foo") ) ) # dd.explain(True) _ = dd.groupby("foo").count().count() print("elapsed", time.time() - t) {code} Running this in 3.0.1 and 3.1.1 respectively (both installed from PyPI, on my local macOS) {code:java} 3.0.1 elapsed 21.262351036071777 3.1.1 elapsed 40.26582884788513 {code} (Meaning the transformation took 21 seconds in 3.0.1 and 40 seconds in 3.1.1) Feel free to make the CSV smaller to get a quicker feedback loop - it scales linearly (I developed this with 2M rows). It might be related to my previous issue - SPARK-32989 - there are similar operations, nesting etc. (splitting on the original column, not on a map, makes the difference disappear) I tried dissecting the queries in SparkUI and via explain, but both 3.0.1 and 3.1.1 produced identical plans. was: I'm seeing almost double the runtime between 3.0.1 and 3.1.1 in my pipeline that does mostly str_to_map, split and a few other operations - all projections, no joins or aggregations (it's here only to trigger the pipeline). I cut it down to the simplest reproducible example I could - anything I remove from this changes the runtime difference quite dramatically. (even moving those two expressions from f.when to standalone columns makes the difference disappear) {code:java} import time import os import pyspark from pyspark.sql import SparkSession import pyspark.sql.functions as f if __name__ == '__main__': print(pyspark.__version__) spark = SparkSession.builder.getOrCreate() filename = 'regression.csv' if not os.path.isfile(filename): with open(filename, 'wt') as fw: fw.write('foo\n') for _ in range(10_000_000): fw.write('foo=bar=bak=f,o,1:2:3\n') df = spark.read.option('header', True).csv(filename) t = time.time() dd = (df .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) .withColumn('extracted', # without this top level split (so just `f.split(f.col("my_map")["bar"], ",")[2]`), it's only 50% slower, with it it's 100% f.split(f.split(f.col("my_map")["bar"], ",")[2], ":")[0] ) .select( f.when( f.col("extracted").startswith("foo"), f.col("extracted") ).otherwise( f.concat(f.lit("foo"), f.col("extracted")) ).alias("foo") ) ) # dd.explain(True) _ = dd.groupby("foo").count().count() print("elapsed", time.time() - t) {code} Running this in 3.0.1 and 3.1.1 respectively (both installed from PyPI, on my local macOS) {code:java} 3.0.1 elapsed 21.262351036071777 3.1.1 elapsed 40.26582884788513 {code} (Meaning the transformation took 21 seconds in 3.0.1 and 40 seconds in 3.1.1) Feel free to make the CSV smaller to get a quicker feedback loop - it scales linearly (I developed this with 2M rows). It might be related to my previous issue - SPARK-32989 - there are similar operations, nesting etc. (splitting on the original column, not on a map, makes the difference
[jira] [Created] (SPARK-35256) str_to_map + split performance regression
Ondrej Kokes created SPARK-35256: Summary: str_to_map + split performance regression Key: SPARK-35256 URL: https://issues.apache.org/jira/browse/SPARK-35256 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Reporter: Ondrej Kokes I'm seeing almost double the runtime between 3.0.1 and 3.1.1 in my pipeline that does mostly str_to_map, split and a few other operations - all projections, no joins or aggregations (it's here only to trigger the pipeline). I cut it down to the simplest reproducible example I could - anything I remove from this changes the runtime difference quite dramatically. (even moving those two expressions from f.when to standalone columns makes the difference disappear) {code:java} import time import os import pyspark from pyspark.sql import SparkSession import pyspark.sql.functions as f if __name__ == '__main__': print(pyspark.__version__) spark = SparkSession.builder.getOrCreate() filename = 'regression.csv' if not os.path.isfile(filename): with open(filename, 'wt') as fw: fw.write('foo\n') for _ in range(10_000_000): fw.write('foo=bar=bak=f,o,1:2:3\n') df = spark.read.option('header', True).csv(filename) t = time.time() dd = (df .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) .withColumn('extracted', # without this top level split (so just `f.split(f.col("my_map")["bar"], ",")[2]`), it's only 50% slower, with it it's 100% f.split(f.split(f.col("my_map")["bar"], ",")[2], ":")[0] ) .select( f.when( f.col("extracted").startswith("foo"), f.col("extracted") ).otherwise( f.concat(f.lit("foo"), f.col("extracted")) ).alias("foo") ) ) # dd.explain(True) _ = dd.groupby("foo").count().count() print("elapsed", time.time() - t) {code} Running this in 3.0.1 and 3.1.1 respectively (both installed from PyPI, on my local macOS) {code:java} 3.0.1 elapsed 21.262351036071777 3.1.1 elapsed 40.26582884788513 {code} (Meaning the transformation took 21 seconds in 3.0.1 and 40 seconds in 3.1.1) Feel free to make the CSV smaller to get a quicker feedback loop - it scales linearly (I developed this with 2M rows). It might be related to my previous issue - SPARK-32989 - there are similar operations, nesting etc. (splitting on the original column, not on a map, makes the difference disappear) I tried dissecting the queries in SparkUI and via explain, but both 3.0.1 and 3.1.1 produced identical plans. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35255) Automated formatting for Scala Code for Blank Lines.
[ https://issues.apache.org/jira/browse/SPARK-35255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334738#comment-17334738 ] Apache Spark commented on SPARK-35255: -- User 'lipzhu' has created a pull request for this issue: https://github.com/apache/spark/pull/32383 > Automated formatting for Scala Code for Blank Lines. > > > Key: SPARK-35255 > URL: https://issues.apache.org/jira/browse/SPARK-35255 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Zhu, Lipeng >Priority: Major > > Based on the databricks' scala-style-code for blanklines. > [https://github.com/databricks/scala-style-guide#blanklines] > Add a configuration to controls whether to enforce a blank line before and/or > after a top-level statement spanning a certain number of lines. > And upgrade mvn-scalafmt plugin to 1.0.4 to enable this configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35255) Automated formatting for Scala Code for Blank Lines.
[ https://issues.apache.org/jira/browse/SPARK-35255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35255: Assignee: (was: Apache Spark) > Automated formatting for Scala Code for Blank Lines. > > > Key: SPARK-35255 > URL: https://issues.apache.org/jira/browse/SPARK-35255 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Zhu, Lipeng >Priority: Major > > Based on the databricks' scala-style-code for blanklines. > [https://github.com/databricks/scala-style-guide#blanklines] > Add a configuration to controls whether to enforce a blank line before and/or > after a top-level statement spanning a certain number of lines. > And upgrade mvn-scalafmt plugin to 1.0.4 to enable this configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35255) Automated formatting for Scala Code for Blank Lines.
[ https://issues.apache.org/jira/browse/SPARK-35255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35255: Assignee: Apache Spark > Automated formatting for Scala Code for Blank Lines. > > > Key: SPARK-35255 > URL: https://issues.apache.org/jira/browse/SPARK-35255 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Zhu, Lipeng >Assignee: Apache Spark >Priority: Major > > Based on the databricks' scala-style-code for blanklines. > [https://github.com/databricks/scala-style-guide#blanklines] > Add a configuration to controls whether to enforce a blank line before and/or > after a top-level statement spanning a certain number of lines. > And upgrade mvn-scalafmt plugin to 1.0.4 to enable this configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35255) Automated formatting for Scala Code for Blank Lines.
[ https://issues.apache.org/jira/browse/SPARK-35255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334736#comment-17334736 ] Apache Spark commented on SPARK-35255: -- User 'lipzhu' has created a pull request for this issue: https://github.com/apache/spark/pull/32383 > Automated formatting for Scala Code for Blank Lines. > > > Key: SPARK-35255 > URL: https://issues.apache.org/jira/browse/SPARK-35255 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Zhu, Lipeng >Priority: Major > > Based on the databricks' scala-style-code for blanklines. > [https://github.com/databricks/scala-style-guide#blanklines] > Add a configuration to controls whether to enforce a blank line before and/or > after a top-level statement spanning a certain number of lines. > And upgrade mvn-scalafmt plugin to 1.0.4 to enable this configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35255) Automated formatting for Scala Code for Blank Lines.
Zhu, Lipeng created SPARK-35255: --- Summary: Automated formatting for Scala Code for Blank Lines. Key: SPARK-35255 URL: https://issues.apache.org/jira/browse/SPARK-35255 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.2.0 Reporter: Zhu, Lipeng Based on the databricks' scala-style-code for blanklines. [https://github.com/databricks/scala-style-guide#blanklines] Add a configuration to controls whether to enforce a blank line before and/or after a top-level statement spanning a certain number of lines. And upgrade mvn-scalafmt plugin to 1.0.4 to enable this configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35254) Upgrade SBT to 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-35254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35254: Assignee: (was: Apache Spark) > Upgrade SBT to 1.5.1 > > > Key: SPARK-35254 > URL: https://issues.apache.org/jira/browse/SPARK-35254 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Zhu, Lipeng >Priority: Major > > https://github.com/sbt/sbt/releases/tag/v1.5.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35254) Upgrade SBT to 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-35254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334687#comment-17334687 ] Apache Spark commented on SPARK-35254: -- User 'lipzhu' has created a pull request for this issue: https://github.com/apache/spark/pull/32382 > Upgrade SBT to 1.5.1 > > > Key: SPARK-35254 > URL: https://issues.apache.org/jira/browse/SPARK-35254 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Zhu, Lipeng >Priority: Major > > https://github.com/sbt/sbt/releases/tag/v1.5.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35254) Upgrade SBT to 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-35254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35254: Assignee: Apache Spark > Upgrade SBT to 1.5.1 > > > Key: SPARK-35254 > URL: https://issues.apache.org/jira/browse/SPARK-35254 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Zhu, Lipeng >Assignee: Apache Spark >Priority: Major > > https://github.com/sbt/sbt/releases/tag/v1.5.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35254) Upgrade SBT to 1.5.1
Zhu, Lipeng created SPARK-35254: --- Summary: Upgrade SBT to 1.5.1 Key: SPARK-35254 URL: https://issues.apache.org/jira/browse/SPARK-35254 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.2.0 Reporter: Zhu, Lipeng https://github.com/sbt/sbt/releases/tag/v1.5.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35159) extract doc of hive format
[ https://issues.apache.org/jira/browse/SPARK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-35159: - Fix Version/s: 3.1.2 3.0.3 > extract doc of hive format > -- > > Key: SPARK-35159 > URL: https://issues.apache.org/jira/browse/SPARK-35159 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > extract doc of hive format -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35229) Spark Job web page is extremely slow while there are more than 1500 events in timeline
[ https://issues.apache.org/jira/browse/SPARK-35229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334633#comment-17334633 ] Apache Spark commented on SPARK-35229: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32381 > Spark Job web page is extremely slow while there are more than 1500 events in > timeline > -- > > Key: SPARK-35229 > URL: https://issues.apache.org/jira/browse/SPARK-35229 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: Wang Yuan >Priority: Blocker > > In a spark streaming application, there are 1000+ executors, more than 2000 > events (executors events, job events) generated, then the jobs/job web page > of spark is not able to show. The browser (chrome, firefox, safari) freezes. > I had to open another window to open stages, executors pages by their > addresses manually. > The jobs page is the home page, so this is rather annoyed. The problem is > that the vis-timeline rending is too slow. > Their are some suggestion: > 1) the page should not render timeline when the page is loading unless user > clicks the link. > 2) the executor group and job group should be separated, and user can choose > to show one, both or neither. > 3) the executor group should display executor event by time horizontally. > Currently it is displayed executors one line by one line. If more than 100 > executors, the page is not that good. > 4)the vis-timeline library is not maintained anymore since 2017. Should be > replaced a new one like [https://github.com/visjs/vis-timeline] > 5) it is also good to get recent events, e.g. 500, and load more if user want > see more, however the data could all be loaded once. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35229) Spark Job web page is extremely slow while there are more than 1500 events in timeline
[ https://issues.apache.org/jira/browse/SPARK-35229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334632#comment-17334632 ] Apache Spark commented on SPARK-35229: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32381 > Spark Job web page is extremely slow while there are more than 1500 events in > timeline > -- > > Key: SPARK-35229 > URL: https://issues.apache.org/jira/browse/SPARK-35229 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: Wang Yuan >Priority: Blocker > > In a spark streaming application, there are 1000+ executors, more than 2000 > events (executors events, job events) generated, then the jobs/job web page > of spark is not able to show. The browser (chrome, firefox, safari) freezes. > I had to open another window to open stages, executors pages by their > addresses manually. > The jobs page is the home page, so this is rather annoyed. The problem is > that the vis-timeline rending is too slow. > Their are some suggestion: > 1) the page should not render timeline when the page is loading unless user > clicks the link. > 2) the executor group and job group should be separated, and user can choose > to show one, both or neither. > 3) the executor group should display executor event by time horizontally. > Currently it is displayed executors one line by one line. If more than 100 > executors, the page is not that good. > 4)the vis-timeline library is not maintained anymore since 2017. Should be > replaced a new one like [https://github.com/visjs/vis-timeline] > 5) it is also good to get recent events, e.g. 500, and load more if user want > see more, however the data could all be loaded once. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35229) Spark Job web page is extremely slow while there are more than 1500 events in timeline
[ https://issues.apache.org/jira/browse/SPARK-35229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35229: Assignee: Apache Spark > Spark Job web page is extremely slow while there are more than 1500 events in > timeline > -- > > Key: SPARK-35229 > URL: https://issues.apache.org/jira/browse/SPARK-35229 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: Wang Yuan >Assignee: Apache Spark >Priority: Blocker > > In a spark streaming application, there are 1000+ executors, more than 2000 > events (executors events, job events) generated, then the jobs/job web page > of spark is not able to show. The browser (chrome, firefox, safari) freezes. > I had to open another window to open stages, executors pages by their > addresses manually. > The jobs page is the home page, so this is rather annoyed. The problem is > that the vis-timeline rending is too slow. > Their are some suggestion: > 1) the page should not render timeline when the page is loading unless user > clicks the link. > 2) the executor group and job group should be separated, and user can choose > to show one, both or neither. > 3) the executor group should display executor event by time horizontally. > Currently it is displayed executors one line by one line. If more than 100 > executors, the page is not that good. > 4)the vis-timeline library is not maintained anymore since 2017. Should be > replaced a new one like [https://github.com/visjs/vis-timeline] > 5) it is also good to get recent events, e.g. 500, and load more if user want > see more, however the data could all be loaded once. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35229) Spark Job web page is extremely slow while there are more than 1500 events in timeline
[ https://issues.apache.org/jira/browse/SPARK-35229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35229: Assignee: (was: Apache Spark) > Spark Job web page is extremely slow while there are more than 1500 events in > timeline > -- > > Key: SPARK-35229 > URL: https://issues.apache.org/jira/browse/SPARK-35229 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: Wang Yuan >Priority: Blocker > > In a spark streaming application, there are 1000+ executors, more than 2000 > events (executors events, job events) generated, then the jobs/job web page > of spark is not able to show. The browser (chrome, firefox, safari) freezes. > I had to open another window to open stages, executors pages by their > addresses manually. > The jobs page is the home page, so this is rather annoyed. The problem is > that the vis-timeline rending is too slow. > Their are some suggestion: > 1) the page should not render timeline when the page is loading unless user > clicks the link. > 2) the executor group and job group should be separated, and user can choose > to show one, both or neither. > 3) the executor group should display executor event by time horizontally. > Currently it is displayed executors one line by one line. If more than 100 > executors, the page is not that good. > 4)the vis-timeline library is not maintained anymore since 2017. Should be > replaced a new one like [https://github.com/visjs/vis-timeline] > 5) it is also good to get recent events, e.g. 500, and load more if user want > see more, however the data could all be loaded once. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34781) Eliminate LEFT SEMI/ANTI join to its left child side with AQE
[ https://issues.apache.org/jira/browse/SPARK-34781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334631#comment-17334631 ] Apache Spark commented on SPARK-34781: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/32380 > Eliminate LEFT SEMI/ANTI join to its left child side with AQE > - > > Key: SPARK-34781 > URL: https://issues.apache.org/jira/browse/SPARK-34781 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > Fix For: 3.2.0 > > > In `EliminateJoinToEmptyRelation.scala`, we can extend it to cover more cases > for LEFT SEMI and LEFT ANI joins: > # Join is left semi join, join right side is non-empty and condition is > empty. Eliminate join to its left side. > # Join is left anti join, join right side is empty. Eliminate join to its > left side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13
[ https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334577#comment-17334577 ] Nick Hryhoriev commented on SPARK-11844: [~Xu_Guang_Lv] My issue is also always reproducible because the file itself is damaged. it fails with any parquet reader, not only spark. So this means that you write damaged data. In my case, I write with spark. And Spark writes corrupted files, silently. without any exception or job fail. Because the data itself damaged, there is no workaround or patch for reading path. You need to rewrite the corrupted file. > can not read class org.apache.parquet.format.PageHeader: don't know what > type: 13 > - > > Key: SPARK-11844 > URL: https://issues.apache.org/jira/browse/SPARK-11844 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Minor > Labels: bulk-closed > > I got the following error once when I was running a query > {code} > java.io.IOException: can not read class org.apache.parquet.format.PageHeader: > don't know what type: 13 > at org.apache.parquet.format.Util.read(Util.java:216) > at org.apache.parquet.format.Util.readPageHeader(Util.java:65) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: parquet.org.apache.thrift.protocol.TProtocolException: don't know > what type: 13 > at > parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806) > at > parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500) > at > org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158) > at > parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108) > {code} > The next retry was good. Right now, seems not critical. But, let's still
[jira] [Commented] (SPARK-35159) extract doc of hive format
[ https://issues.apache.org/jira/browse/SPARK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334569#comment-17334569 ] Apache Spark commented on SPARK-35159: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32379 > extract doc of hive format > -- > > Key: SPARK-35159 > URL: https://issues.apache.org/jira/browse/SPARK-35159 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > extract doc of hive format -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35159) extract doc of hive format
[ https://issues.apache.org/jira/browse/SPARK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334567#comment-17334567 ] Apache Spark commented on SPARK-35159: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32378 > extract doc of hive format > -- > > Key: SPARK-35159 > URL: https://issues.apache.org/jira/browse/SPARK-35159 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > extract doc of hive format -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35159) extract doc of hive format
[ https://issues.apache.org/jira/browse/SPARK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334566#comment-17334566 ] Apache Spark commented on SPARK-35159: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32378 > extract doc of hive format > -- > > Key: SPARK-35159 > URL: https://issues.apache.org/jira/browse/SPARK-35159 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > extract doc of hive format -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35021) Group exception messages in connector/catalog
[ https://issues.apache.org/jira/browse/SPARK-35021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35021: Assignee: Apache Spark > Group exception messages in connector/catalog > - > > Key: SPARK-35021 > URL: https://issues.apache.org/jira/browse/SPARK-35021 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > > 'sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog' > || Filename || Count || > | CatalogManager.scala | 2 | > | CatalogV2Implicits.scala | 3 | > | CatalogV2Util.scala | 8 | > | LookupCatalog.scala | 2 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35021) Group exception messages in connector/catalog
[ https://issues.apache.org/jira/browse/SPARK-35021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35021: Assignee: (was: Apache Spark) > Group exception messages in connector/catalog > - > > Key: SPARK-35021 > URL: https://issues.apache.org/jira/browse/SPARK-35021 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > 'sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog' > || Filename || Count || > | CatalogManager.scala | 2 | > | CatalogV2Implicits.scala | 3 | > | CatalogV2Util.scala | 8 | > | LookupCatalog.scala | 2 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35021) Group exception messages in connector/catalog
[ https://issues.apache.org/jira/browse/SPARK-35021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334551#comment-17334551 ] Apache Spark commented on SPARK-35021: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/32377 > Group exception messages in connector/catalog > - > > Key: SPARK-35021 > URL: https://issues.apache.org/jira/browse/SPARK-35021 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > 'sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog' > || Filename || Count || > | CatalogManager.scala | 2 | > | CatalogV2Implicits.scala | 3 | > | CatalogV2Util.scala | 8 | > | LookupCatalog.scala | 2 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-35021) Group exception messages in connector/catalog
[ https://issues.apache.org/jira/browse/SPARK-35021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-35021: --- Comment: was deleted (was: I'm working on.) > Group exception messages in connector/catalog > - > > Key: SPARK-35021 > URL: https://issues.apache.org/jira/browse/SPARK-35021 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > 'sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog' > || Filename || Count || > | CatalogManager.scala | 2 | > | CatalogV2Implicits.scala | 3 | > | CatalogV2Util.scala | 8 | > | LookupCatalog.scala | 2 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35214) OptimizeSkewedJoin support ShuffledHashJoinExec
[ https://issues.apache.org/jira/browse/SPARK-35214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-35214. -- Fix Version/s: 3.2.0 Assignee: ulysses you Resolution: Fixed Resolved by https://github.com/apache/spark/pull/32328 > OptimizeSkewedJoin support ShuffledHashJoinExec > --- > > Key: SPARK-35214 > URL: https://issues.apache.org/jira/browse/SPARK-35214 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Major > Fix For: 3.2.0 > > > Currently, we have already supported all type of join through hint that make > it easy to choose the join implementation. > We would choose `ShuffledHashJoin` if one table is not big but over the > broadcast threshold. It's better that we can support optimize it in > `OptimizeSkewedJoin`. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33976) Add a dedicated SQL document page for the TRANSFORM-related functionality,
[ https://issues.apache.org/jira/browse/SPARK-33976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33976: - Fix Version/s: 3.1.2 3.0.3 > Add a dedicated SQL document page for the TRANSFORM-related functionality, > -- > > Key: SPARK-33976 > URL: https://issues.apache.org/jira/browse/SPARK-33976 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > Add doc about transform > https://github.com/apache/spark/pull/30973#issuecomment-753715318 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35021) Group exception messages in connector/catalog
[ https://issues.apache.org/jira/browse/SPARK-35021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334524#comment-17334524 ] jiaan.geng commented on SPARK-35021: I'm working on. > Group exception messages in connector/catalog > - > > Key: SPARK-35021 > URL: https://issues.apache.org/jira/browse/SPARK-35021 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > 'sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog' > || Filename || Count || > | CatalogManager.scala | 2 | > | CatalogV2Implicits.scala | 3 | > | CatalogV2Util.scala | 8 | > | LookupCatalog.scala | 2 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35229) Spark Job web page is extremely slow while there are more than 1500 events in timeline
[ https://issues.apache.org/jira/browse/SPARK-35229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334511#comment-17334511 ] Kousuke Saruta commented on SPARK-35229: [~tiehexue]Thanks for the report. I'll try to mitigate this issue by limiting the maximum number of the jobs/executors. The maximum number of the tasks are already limited by spark.ui.timeline.tasks.maximum so I'll try with the similar approach. > Spark Job web page is extremely slow while there are more than 1500 events in > timeline > -- > > Key: SPARK-35229 > URL: https://issues.apache.org/jira/browse/SPARK-35229 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: Wang Yuan >Priority: Blocker > > In a spark streaming application, there are 1000+ executors, more than 2000 > events (executors events, job events) generated, then the jobs/job web page > of spark is not able to show. The browser (chrome, firefox, safari) freezes. > I had to open another window to open stages, executors pages by their > addresses manually. > The jobs page is the home page, so this is rather annoyed. The problem is > that the vis-timeline rending is too slow. > Their are some suggestion: > 1) the page should not render timeline when the page is loading unless user > clicks the link. > 2) the executor group and job group should be separated, and user can choose > to show one, both or neither. > 3) the executor group should display executor event by time horizontally. > Currently it is displayed executors one line by one line. If more than 100 > executors, the page is not that good. > 4)the vis-timeline library is not maintained anymore since 2017. Should be > replaced a new one like [https://github.com/visjs/vis-timeline] > 5) it is also good to get recent events, e.g. 500, and load more if user want > see more, however the data could all be loaded once. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33976) Add a dedicated SQL document page for the TRANSFORM-related functionality,
[ https://issues.apache.org/jira/browse/SPARK-33976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334491#comment-17334491 ] Apache Spark commented on SPARK-33976: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32376 > Add a dedicated SQL document page for the TRANSFORM-related functionality, > -- > > Key: SPARK-33976 > URL: https://issues.apache.org/jira/browse/SPARK-33976 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > Add doc about transform > https://github.com/apache/spark/pull/30973#issuecomment-753715318 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35253) Upgrade Janino from 3.0.x to 3.1.x
[ https://issues.apache.org/jira/browse/SPARK-35253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334488#comment-17334488 ] Apache Spark commented on SPARK-35253: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/32374 > Upgrade Janino from 3.0.x to 3.1.x > -- > > Key: SPARK-35253 > URL: https://issues.apache.org/jira/browse/SPARK-35253 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > From the [change log|http://janino-compiler.github.io/janino/changelog.html], > the janino 3.0.x line has been deprecated, we can use 3.1.x line instead of > it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35253) Upgrade Janino from 3.0.x to 3.1.x
[ https://issues.apache.org/jira/browse/SPARK-35253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35253: Assignee: (was: Apache Spark) > Upgrade Janino from 3.0.x to 3.1.x > -- > > Key: SPARK-35253 > URL: https://issues.apache.org/jira/browse/SPARK-35253 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > From the [change log|http://janino-compiler.github.io/janino/changelog.html], > the janino 3.0.x line has been deprecated, we can use 3.1.x line instead of > it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35253) Upgrade Janino from 3.0.x to 3.1.x
[ https://issues.apache.org/jira/browse/SPARK-35253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334486#comment-17334486 ] Apache Spark commented on SPARK-35253: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/32374 > Upgrade Janino from 3.0.x to 3.1.x > -- > > Key: SPARK-35253 > URL: https://issues.apache.org/jira/browse/SPARK-35253 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > From the [change log|http://janino-compiler.github.io/janino/changelog.html], > the janino 3.0.x line has been deprecated, we can use 3.1.x line instead of > it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35253) Upgrade Janino from 3.0.x to 3.1.x
[ https://issues.apache.org/jira/browse/SPARK-35253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35253: Assignee: Apache Spark > Upgrade Janino from 3.0.x to 3.1.x > -- > > Key: SPARK-35253 > URL: https://issues.apache.org/jira/browse/SPARK-35253 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > From the [change log|http://janino-compiler.github.io/janino/changelog.html], > the janino 3.0.x line has been deprecated, we can use 3.1.x line instead of > it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33976) Add a dedicated SQL document page for the TRANSFORM-related functionality,
[ https://issues.apache.org/jira/browse/SPARK-33976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334485#comment-17334485 ] Apache Spark commented on SPARK-33976: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32375 > Add a dedicated SQL document page for the TRANSFORM-related functionality, > -- > > Key: SPARK-33976 > URL: https://issues.apache.org/jira/browse/SPARK-33976 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > Add doc about transform > https://github.com/apache/spark/pull/30973#issuecomment-753715318 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35253) Upgrade Janino from 3.0.x to 3.1.x
Yang Jie created SPARK-35253: Summary: Upgrade Janino from 3.0.x to 3.1.x Key: SPARK-35253 URL: https://issues.apache.org/jira/browse/SPARK-35253 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 3.2.0 Reporter: Yang Jie >From the [change log|http://janino-compiler.github.io/janino/changelog.html], >the janino 3.0.x line has been deprecated, we can use 3.1.x line instead of >it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35244) invoke should throw the original exception
[ https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-35244: Fix Version/s: 3.1.2 3.0.3 > invoke should throw the original exception > -- > > Key: SPARK-35244 > URL: https://issues.apache.org/jira/browse/SPARK-35244 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35085) Get columns operation should handle ANSI interval column properly
[ https://issues.apache.org/jira/browse/SPARK-35085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-35085. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32345 [https://github.com/apache/spark/pull/32345] > Get columns operation should handle ANSI interval column properly > - > > Key: SPARK-35085 > URL: https://issues.apache.org/jira/browse/SPARK-35085 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: jiaan.geng >Priority: Major > Fix For: 3.2.0 > > > # Write tests for ANSI intervals similar to test("get columns operation > should handle interval column properly") > # views can contain ANSI interval columns which should be handled properly > via SparkGetColumnsOperation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35085) Get columns operation should handle ANSI interval column properly
[ https://issues.apache.org/jira/browse/SPARK-35085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-35085: Assignee: jiaan.geng > Get columns operation should handle ANSI interval column properly > - > > Key: SPARK-35085 > URL: https://issues.apache.org/jira/browse/SPARK-35085 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: jiaan.geng >Priority: Major > > # Write tests for ANSI intervals similar to test("get columns operation > should handle interval column properly") > # views can contain ANSI interval columns which should be handled properly > via SparkGetColumnsOperation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org