date:20210428

[jira] [Updated] (SPARK-35267) nullable field is set to false for integer type when using reflection to get StructType for a case class

2021-04-28 Thread Ganesh Chand (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesh Chand updated SPARK-35267:
-
Description: 
{code:java}
// code placeholder

object Util {
  def toStructType[T](implicit typeTags: ScalaReflection.universe.TypeTag[T]): 
StructType =
  ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]
}

case class MyTable(val a: Int, val b: String)

// The following test fails because schema returned using reflection sets 
nullable=false for integer column. The test passes if you set it to false

it must "return a Spark Schema of type StructType for a case class" in {
 val schemaFromCaseClass: StructType = Util.toStructType[MyTable]
 val expectedSchema = new StructType().add(StructField("a", IntegerType, 
true)).add(StructField("b", StringType, true))
 schemaFromCaseClass.size mustBe 2
 schemaFromCaseClass mustBe expectedSchema
}




{code}

  was:
{code:java}
// code placeholder

object Util {
def toStructType[T](implicit typeTags: ScalaReflection.universe.TypeTag[T]): 
StructType =
  ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]
}

case class MyTable(val a: Int, val b: String)

// The following test fails because schema returned using reflection sets 
nullable=false for integer column. The test passes if you set it to false

it must "return a Spark Schema of type StructType for a case class" in {
 val schemaFromCaseClass: StructType = Util.toStructType[MyTable]
 val expectedSchema = new StructType().add(StructField("a", IntegerType, 
true)).add(StructField("b", StringType, true))
 schemaFromCaseClass.size mustBe 2
 schemaFromCaseClass mustBe expectedSchema
}




{code}


> nullable field is set to false for integer type when using reflection to get 
> StructType for a case class
> 
>
> Key: SPARK-35267
> URL: https://issues.apache.org/jira/browse/SPARK-35267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Scala version: 2.12.13
> sparkVersion = "3.1.1"
>Reporter: Ganesh Chand
>Priority: Major
>
> {code:java}
> // code placeholder
> object Util {
>   def toStructType[T](implicit typeTags: 
> ScalaReflection.universe.TypeTag[T]): StructType =
>   ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]
> }
> case class MyTable(val a: Int, val b: String)
> // The following test fails because schema returned using reflection sets 
> nullable=false for integer column. The test passes if you set it to false
> it must "return a Spark Schema of type StructType for a case class" in {
>  val schemaFromCaseClass: StructType = Util.toStructType[MyTable]
>  val expectedSchema = new StructType().add(StructField("a", IntegerType, 
> true)).add(StructField("b", StringType, true))
>  schemaFromCaseClass.size mustBe 2
>  schemaFromCaseClass mustBe expectedSchema
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35267) nullable field is set to false for integer type when using reflection to get StructType for a case class

2021-04-28 Thread Ganesh Chand (Jira)

Ganesh Chand created SPARK-35267:


 Summary: nullable field is set to false for integer type when 
using reflection to get StructType for a case class
 Key: SPARK-35267
 URL: https://issues.apache.org/jira/browse/SPARK-35267
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
 Environment: Scala version: 2.12.13

sparkVersion = "3.1.1"
Reporter: Ganesh Chand


{code:java}
// code placeholder

object Util {
def toStructType[T](implicit typeTags: ScalaReflection.universe.TypeTag[T]): 
StructType =
  ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]
}

case class MyTable(val a: Int, val b: String)

// The following test fails because schema returned using reflection sets 
nullable=false for integer column. The test passes if you set it to false

it must "return a Spark Schema of type StructType for a case class" in {
 val schemaFromCaseClass: StructType = Util.toStructType[MyTable]
 val expectedSchema = new StructType().add(StructField("a", IntegerType, 
true)).add(StructField("b", StringType, true))
 schemaFromCaseClass.size mustBe 2
 schemaFromCaseClass mustBe expectedSchema
}




{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35105) Support multiple paths for ADD FILE/JAR/ARCHIVE commands

2021-04-28 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-35105.

Fix Version/s: 3.2.0
   Resolution: Fixed

This issue was resolved in https://github.com/apache/spark/pull/32205.

> Support multiple paths for ADD FILE/JAR/ARCHIVE commands
> 
>
> Key: SPARK-35105
> URL: https://issues.apache.org/jira/browse/SPARK-35105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> In the current master, ADD FILE/JAR/ARCHIVE don't support multiple path 
> arguments.
> It's great if those commands can take multiple paths like Hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35226) JDBC datasources should accept refreshKrb5Config parameter

2021-04-28 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-35226.

Fix Version/s: 3.2.0
   3.1.2
   Resolution: Fixed

This issue was resolved in https://github.com/apache/spark/pull/32344.

> JDBC datasources should accept refreshKrb5Config parameter
> --
>
> Key: SPARK-35226
> URL: https://issues.apache.org/jira/browse/SPARK-35226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.2, 3.2.0
>
>
> In the current master, JDBC datasources can't accept refreshKrb5Config which 
> is defined in Krb5LoginModule.
> So even if we change the krb5.conf after establishing  a connection, the 
> change will not be reflected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35264) Support AQE side broadcastJoin threshold

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335144#comment-17335144
 ] 

Apache Spark commented on SPARK-35264:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32391

> Support AQE side broadcastJoin threshold
> 
>
> Key: SPARK-35264
> URL: https://issues.apache.org/jira/browse/SPARK-35264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Major
>
> The main idea here is that make join config isolation between normal planner 
> and aqe planner which shared the same code path.
> Actually we don not very trust using the static stat to consider if it can 
> build broadcast hash join. In our experience it's very common that Spark 
> throw broadcast timeout or driver side OOM exception when execute a bit large 
> plan. And due to braodcast join is not reversed which means if we covert join 
> to braodcast hash join at first time, we(AQE) can not optimize it again, so 
> it should make sense to decide if we can do broadcast at aqe side using 
> different sql config.
> In order to achieve this we use a specific join hint in advance during AQE 
> framework and then at JoinSelection side it will take and follow the inserted 
> hint.
> For now we only support select strategy for equi join, and follow this order
>  1. mark join as broadcast hash join if possible
>  2. mark join as shuffled hash join if possible
> Note that, we don't override join strategy if user specifies a join hint.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35264) Support AQE side broadcastJoin threshold

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35264:


Assignee: (was: Apache Spark)

> Support AQE side broadcastJoin threshold
> 
>
> Key: SPARK-35264
> URL: https://issues.apache.org/jira/browse/SPARK-35264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Major
>
> The main idea here is that make join config isolation between normal planner 
> and aqe planner which shared the same code path.
> Actually we don not very trust using the static stat to consider if it can 
> build broadcast hash join. In our experience it's very common that Spark 
> throw broadcast timeout or driver side OOM exception when execute a bit large 
> plan. And due to braodcast join is not reversed which means if we covert join 
> to braodcast hash join at first time, we(AQE) can not optimize it again, so 
> it should make sense to decide if we can do broadcast at aqe side using 
> different sql config.
> In order to achieve this we use a specific join hint in advance during AQE 
> framework and then at JoinSelection side it will take and follow the inserted 
> hint.
> For now we only support select strategy for equi join, and follow this order
>  1. mark join as broadcast hash join if possible
>  2. mark join as shuffled hash join if possible
> Note that, we don't override join strategy if user specifies a join hint.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35264) Support AQE side broadcastJoin threshold

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35264:


Assignee: Apache Spark

> Support AQE side broadcastJoin threshold
> 
>
> Key: SPARK-35264
> URL: https://issues.apache.org/jira/browse/SPARK-35264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Major
>
> The main idea here is that make join config isolation between normal planner 
> and aqe planner which shared the same code path.
> Actually we don not very trust using the static stat to consider if it can 
> build broadcast hash join. In our experience it's very common that Spark 
> throw broadcast timeout or driver side OOM exception when execute a bit large 
> plan. And due to braodcast join is not reversed which means if we covert join 
> to braodcast hash join at first time, we(AQE) can not optimize it again, so 
> it should make sense to decide if we can do broadcast at aqe side using 
> different sql config.
> In order to achieve this we use a specific join hint in advance during AQE 
> framework and then at JoinSelection side it will take and follow the inserted 
> hint.
> For now we only support select strategy for equi join, and follow this order
>  1. mark join as broadcast hash join if possible
>  2. mark join as shuffled hash join if possible
> Note that, we don't override join strategy if user specifies a join hint.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35266) Fix an error in BenchmarkBase.scala that occurs when creating a benchmark file in a non-existent directory

2021-04-28 Thread Byungsoo Oh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335143#comment-17335143
 ] 

Byungsoo Oh commented on SPARK-35266:
-

I fixed this issue and checked it's working fine. If it's okay, I will submit a 
pull request for this.

> Fix an error in BenchmarkBase.scala that occurs when creating a benchmark 
> file in a non-existent directory
> --
>
> Key: SPARK-35266
> URL: https://issues.apache.org/jira/browse/SPARK-35266
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Byungsoo Oh
>Priority: Minor
>  Labels: easyfix
>
> When submitting a benchmark job using _org.apache.spark.benchmark.Benchmarks_ 
> class with _SPARK_GENERATE_BENCHMARK_FILES=1_ option, an exception is raised 
> if the directory where the benchmark file will be generated does not exist.
>  For example, if you execute _BLASBenchmark_ like the command below, you get 
> an error unless you manually create _benchmarks/_ directory under 
> _spark/mllib-local/_.
> {code:java}
> SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit \
> --driver-memory 6g --class org.apache.spark.benchmark.Benchmarks \
> --jars "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | 
> paste -sd ',' -`" \
> "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
> "org.apache.spark.ml.linalg.BLASBenchmark"
> {code}
> This is caused by the code in _BenchmarkBase.scala_ where an attempt is made 
> to create the benchmark file without validating the path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35264) Support AQE side broadcastJoin threshold

2021-04-28 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-35264:

Description: 
The main idea here is that make join config isolation between normal planner 
and aqe planner which shared the same code path.

Actually we don not very trust using the static stat to consider if it can 
build broadcast hash join. In our experience it's very common that Spark throw 
broadcast timeout or driver side OOM exception when execute a bit large plan. 
And due to braodcast join is not reversed which means if we covert join to 
braodcast hash join at first time, we(AQE) can not optimize it again, so it 
should make sense to decide if we can do broadcast at aqe side using different 
sql config.

In order to achieve this we use a specific join hint in advance during AQE 
framework and then at JoinSelection side it will take and follow the inserted 
hint.

For now we only support select strategy for equi join, and follow this order
 1. mark join as broadcast hash join if possible
 2. mark join as shuffled hash join if possible

Note that, we don't override join strategy if user specifies a join hint.

 

  was:
The main idea here is that make join config isolation between normal planner 
and aqe planner which shared the same code path.

Actually we don not very trust using the static stat to consider if it can 
build broadcast hash join. In our experience it's very common that Spark throw 
broadcast timeout or driver side OOM exception when execute a bit large plan. 
And due to braodcast join is not reversed which means if we covert join to 
braodcast hash join at first time, we(aqe) can not optimize it again, so it 
should make sense to decide if we can do broadcast at aqe side using different 
sql config.


 In order to achieve this we use a specific join hint in advance during AQE 
framework and then at JoinSelection side it will take and follow the inserted 
hint.

For now we only support select strategy for equi join, and follow this order
 1. mark join as broadcast hash join if possible
 2. mark join as shuffled hash join if possible

Note that, we don't override join strategy if user specifies a join hint.

 


> Support AQE side broadcastJoin threshold
> 
>
> Key: SPARK-35264
> URL: https://issues.apache.org/jira/browse/SPARK-35264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Major
>
> The main idea here is that make join config isolation between normal planner 
> and aqe planner which shared the same code path.
> Actually we don not very trust using the static stat to consider if it can 
> build broadcast hash join. In our experience it's very common that Spark 
> throw broadcast timeout or driver side OOM exception when execute a bit large 
> plan. And due to braodcast join is not reversed which means if we covert join 
> to braodcast hash join at first time, we(AQE) can not optimize it again, so 
> it should make sense to decide if we can do broadcast at aqe side using 
> different sql config.
> In order to achieve this we use a specific join hint in advance during AQE 
> framework and then at JoinSelection side it will take and follow the inserted 
> hint.
> For now we only support select strategy for equi join, and follow this order
>  1. mark join as broadcast hash join if possible
>  2. mark join as shuffled hash join if possible
> Note that, we don't override join strategy if user specifies a join hint.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35266) Fix an error in BenchmarkBase.scala that occurs when creating a benchmark file in a non-existent directory

2021-04-28 Thread Byungsoo Oh (Jira)

Byungsoo Oh created SPARK-35266:
---

 Summary: Fix an error in BenchmarkBase.scala that occurs when 
creating a benchmark file in a non-existent directory
 Key: SPARK-35266
 URL: https://issues.apache.org/jira/browse/SPARK-35266
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.2.0
Reporter: Byungsoo Oh


When submitting a benchmark job using _org.apache.spark.benchmark.Benchmarks_ 
class with _SPARK_GENERATE_BENCHMARK_FILES=1_ option, an exception is raised if 
the directory where the benchmark file will be generated does not exist.
 For example, if you execute _BLASBenchmark_ like the command below, you get an 
error unless you manually create _benchmarks/_ directory under 
_spark/mllib-local/_.
{code:java}
SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit \
--driver-memory 6g --class org.apache.spark.benchmark.Benchmarks \
--jars "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | 
paste -sd ',' -`" \
"`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
"org.apache.spark.ml.linalg.BLASBenchmark"
{code}
This is caused by the code in _BenchmarkBase.scala_ where an attempt is made to 
create the benchmark file without validating the path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35135) Duplicate code implementation of `WritablePartitionedIterator`

2021-04-28 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi resolved SPARK-35135.
--
   Fix Version/s: 3.2.0
Target Version/s: 3.2.0
  Resolution: Fixed

Issue resolved by https://github.com/apache/spark/pull/32232

> Duplicate code implementation of `WritablePartitionedIterator`
> --
>
> Key: SPARK-35135
> URL: https://issues.apache.org/jira/browse/SPARK-35135
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> `WritablePartitionedIterator` define in 
> `WritablePartitionedPairCollection.scala` and there are two implementation of 
> these trait, but the code for these two implementations is duplicate code



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35135) Duplicate code implementation of `WritablePartitionedIterator`

2021-04-28 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi reassigned SPARK-35135:


Assignee: Yang Jie  (was: Apache Spark)

> Duplicate code implementation of `WritablePartitionedIterator`
> --
>
> Key: SPARK-35135
> URL: https://issues.apache.org/jira/browse/SPARK-35135
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> `WritablePartitionedIterator` define in 
> `WritablePartitionedPairCollection.scala` and there are two implementation of 
> these trait, but the code for these two implementations is duplicate code



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35264) Support AQE side broadcastJoin threshold

2021-04-28 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-35264:

Description: 
The main idea here is that make join config isolation between normal planner 
and aqe planner which shared the same code path.

Actually we don not very trust using the static stat to consider if it can 
build broadcast hash join. In our experience it's very common that Spark throw 
broadcast timeout or driver side OOM exception when execute a bit large plan. 
And due to braodcast join is not reversed which means if we covert join to 
braodcast hash join at first time, we(aqe) can not optimize it again, so it 
should make sense to decide if we can do broadcast at aqe side using different 
sql config.


 In order to achieve this we use a specific join hint in advance during AQE 
framework and then at JoinSelection side it will take and follow the inserted 
hint.

For now we only support select strategy for equi join, and follow this order
 1. mark join as broadcast hash join if possible
 2. mark join as shuffled hash join if possible

Note that, we don't override join strategy if user specifies a join hint.

 

  was:
The main idea here is that make join config isolation between normal planner 
and aqe planner which shared the same code path.
 In order to achieve this we use a specific join hint in advance during AQE 
framework and then at JoinSelection side it will take and follow the inserted 
hint.

For now we only support select strategy for equi join, and follow this order
 1. mark join as broadcast hash join if possible
 2. mark join as shuffled hash join if possible


 Note that, we don't override join strategy if user specifies a join hint.

 


> Support AQE side broadcastJoin threshold
> 
>
> Key: SPARK-35264
> URL: https://issues.apache.org/jira/browse/SPARK-35264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Major
>
> The main idea here is that make join config isolation between normal planner 
> and aqe planner which shared the same code path.
> Actually we don not very trust using the static stat to consider if it can 
> build broadcast hash join. In our experience it's very common that Spark 
> throw broadcast timeout or driver side OOM exception when execute a bit large 
> plan. And due to braodcast join is not reversed which means if we covert join 
> to braodcast hash join at first time, we(aqe) can not optimize it again, so 
> it should make sense to decide if we can do broadcast at aqe side using 
> different sql config.
>  In order to achieve this we use a specific join hint in advance during AQE 
> framework and then at JoinSelection side it will take and follow the inserted 
> hint.
> For now we only support select strategy for equi join, and follow this order
>  1. mark join as broadcast hash join if possible
>  2. mark join as shuffled hash join if possible
> Note that, we don't override join strategy if user specifies a join hint.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34786) read parquet uint64 as decimal

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335110#comment-17335110
 ] 

Apache Spark commented on SPARK-34786:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/32390

> read parquet uint64 as decimal
> --
>
> Key: SPARK-34786
> URL: https://issues.apache.org/jira/browse/SPARK-34786
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently Spark can't read parquet uint64 as it doesn't fit the Spark long 
> type. We can read uint64 as decimal as a workaround.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35265) abs return negative

2021-04-28 Thread liuzhenjie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuzhenjie updated SPARK-35265:
---
  Component/s: (was: Spark Core)
   PySpark
Affects Version/s: 3.1.1

> abs return negative
> ---
>
> Key: SPARK-35265
> URL: https://issues.apache.org/jira/browse/SPARK-35265
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.1
>Reporter: liuzhenjie
>Priority: Major
>
> from pyspark.sql.functions import lit, abs, concat, hash,col
> df = df.withColumn('partition_id', lit(-2147483648))
>  df = df.withColumn('abs_id', abs(col('partition_id')))
>  df.select('abs_id','partition_id').show()
>  
> when the number is  -2147483648，method abs return negative 
>  +---++
> | abs_id        |partition_id |
> +---++
> |-2147483648| -2147483648|
> |-2147483648| -2147483648|
> |-2147483648| -2147483648|
> |-2147483648| -2147483648|
> |-2147483648| -2147483648|
> +---++
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34786) read parquet uint64 as decimal

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335108#comment-17335108
 ] 

Apache Spark commented on SPARK-34786:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/32390

> read parquet uint64 as decimal
> --
>
> Key: SPARK-34786
> URL: https://issues.apache.org/jira/browse/SPARK-34786
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently Spark can't read parquet uint64 as it doesn't fit the Spark long 
> type. We can read uint64 as decimal as a workaround.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35265) abs return negative

2021-04-28 Thread liuzhenjie (Jira)

liuzhenjie created SPARK-35265:
--

 Summary: abs return negative
 Key: SPARK-35265
 URL: https://issues.apache.org/jira/browse/SPARK-35265
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: liuzhenjie


from pyspark.sql.functions import lit, abs, concat, hash,col

df = df.withColumn('partition_id', lit(-2147483648))
df = df.withColumn('abs_id', abs(col('partition_id')))
df.select('abs_id','partition_id').show()

 

when the number is  -2147483648，method abs return negative 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35265) abs return negative

2021-04-28 Thread liuzhenjie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuzhenjie updated SPARK-35265:
---
Description: 
from pyspark.sql.functions import lit, abs, concat, hash,col

df = df.withColumn('partition_id', lit(-2147483648))
 df = df.withColumn('abs_id', abs(col('partition_id')))
 df.select('abs_id','partition_id').show()

 

when the number is  -2147483648，method abs return negative 

 +---++
| abs_id        |partition_id |
+---++
|-2147483648| -2147483648|
|-2147483648| -2147483648|
|-2147483648| -2147483648|
|-2147483648| -2147483648|
|-2147483648| -2147483648|
+---++

 

  was:
from pyspark.sql.functions import lit, abs, concat, hash,col

df = df.withColumn('partition_id', lit(-2147483648))
df = df.withColumn('abs_id', abs(col('partition_id')))
df.select('abs_id','partition_id').show()

 

when the number is  -2147483648，method abs return negative 

 

 


> abs return negative
> ---
>
> Key: SPARK-35265
> URL: https://issues.apache.org/jira/browse/SPARK-35265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: liuzhenjie
>Priority: Major
>
> from pyspark.sql.functions import lit, abs, concat, hash,col
> df = df.withColumn('partition_id', lit(-2147483648))
>  df = df.withColumn('abs_id', abs(col('partition_id')))
>  df.select('abs_id','partition_id').show()
>  
> when the number is  -2147483648，method abs return negative 
>  +---++
> | abs_id        |partition_id |
> +---++
> |-2147483648| -2147483648|
> |-2147483648| -2147483648|
> |-2147483648| -2147483648|
> |-2147483648| -2147483648|
> |-2147483648| -2147483648|
> +---++
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35264) Support AQE side broadcastJoin threshold

2021-04-28 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-35264:

Description: 
The main idea here is that make join config isolation between normal planner 
and aqe planner which shared the same code path.
 In order to achieve this we use a specific join hint in advance during AQE 
framework and then at JoinSelection side it will take and follow the inserted 
hint.

For now we only support select strategy for equi join, and follow this order
 1. mark join as broadcast hash join if possible
 2. mark join as shuffled hash join if possible


 Note that, we don't override join strategy if user specifies a join hint.

 

  was:
The main idea here is that make join config isolation between normal planner 
and aqe planner which shared the same code path.
In order to achieve this we use a specific join hint in advance during AQE 
framework and then at JoinSelection side it will take and follow the inserted 
hint.

For now we only support select strategy for equi join, and follow this order
 1. mark join as broadcast hash join if possible
 2. mark join as shuffled hash join if possible
Note that, we don't override join strategy if user specifies a join hint.

 


> Support AQE side broadcastJoin threshold
> 
>
> Key: SPARK-35264
> URL: https://issues.apache.org/jira/browse/SPARK-35264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Major
>
> The main idea here is that make join config isolation between normal planner 
> and aqe planner which shared the same code path.
>  In order to achieve this we use a specific join hint in advance during AQE 
> framework and then at JoinSelection side it will take and follow the inserted 
> hint.
> For now we only support select strategy for equi join, and follow this order
>  1. mark join as broadcast hash join if possible
>  2. mark join as shuffled hash join if possible
>  Note that, we don't override join strategy if user specifies a join hint.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35264) Support AQE side broadcastJoin threshold

2021-04-28 Thread ulysses you (Jira)

ulysses you created SPARK-35264:
---

 Summary: Support AQE side broadcastJoin threshold
 Key: SPARK-35264
 URL: https://issues.apache.org/jira/browse/SPARK-35264
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: ulysses you


The main idea here is that make join config isolation between normal planner 
and aqe planner which shared the same code path.
In order to achieve this we use a specific join hint in advance during AQE 
framework and then at JoinSelection side it will take and follow the inserted 
hint.

For now we only support select strategy for equi join, and follow this order
 1. mark join as broadcast hash join if possible
 2. mark join as shuffled hash join if possible
Note that, we don't override join strategy if user specifies a join hint.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35227) Replace Bintray with the new repository service for the spark-packages resolver in SparkSubmit

2021-04-28 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335094#comment-17335094
 ] 

L. C. Hsieh commented on SPARK-35227:
-

This issue was resolved by https://github.com/apache/spark/pull/32346.

> Replace Bintray with the new repository service for the spark-packages 
> resolver in SparkSubmit
> --
>
> Key: SPARK-35227
> URL: https://issues.apache.org/jira/browse/SPARK-35227
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.4.7, 3.0.3, 3.1.2, 3.2.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
> Fix For: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>
>
> As Bintray is being shut down, we have setup a new repository service at 
> repos.spark-packages.org. We need to replace Bintray with the new service for 
> the spark-packages resolver in SparkSumit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35227) Replace Bintray with the new repository service for the spark-packages resolver in SparkSubmit

2021-04-28 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35227:

Issue Type: Improvement  (was: Task)

> Replace Bintray with the new repository service for the spark-packages 
> resolver in SparkSubmit
> --
>
> Key: SPARK-35227
> URL: https://issues.apache.org/jira/browse/SPARK-35227
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.7, 3.0.3, 3.1.2, 3.2.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
> Fix For: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>
>
> As Bintray is being shut down, we have setup a new repository service at 
> repos.spark-packages.org. We need to replace Bintray with the new service for 
> the spark-packages resolver in SparkSumit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35227) Replace Bintray with the new repository service for the spark-packages resolver in SparkSubmit

2021-04-28 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-35227:
---

Assignee: Bo Zhang

> Replace Bintray with the new repository service for the spark-packages 
> resolver in SparkSubmit
> --
>
> Key: SPARK-35227
> URL: https://issues.apache.org/jira/browse/SPARK-35227
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.4.7, 3.0.3, 3.1.2, 3.2.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
> Fix For: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>
>
> As Bintray is being shut down, we have setup a new repository service at 
> repos.spark-packages.org. We need to replace Bintray with the new service for 
> the spark-packages resolver in SparkSumit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35227) Replace Bintray with the new repository service for the spark-packages resolver in SparkSubmit

2021-04-28 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-35227.
-
Resolution: Fixed

> Replace Bintray with the new repository service for the spark-packages 
> resolver in SparkSubmit
> --
>
> Key: SPARK-35227
> URL: https://issues.apache.org/jira/browse/SPARK-35227
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.4.7, 3.0.3, 3.1.2, 3.2.0
>Reporter: Bo Zhang
>Priority: Major
> Fix For: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>
>
> As Bintray is being shut down, we have setup a new repository service at 
> repos.spark-packages.org. We need to replace Bintray with the new service for 
> the spark-packages resolver in SparkSumit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2: sqlConf parameter is null

2021-04-28 Thread lynn (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Summary: PartitionReaderFactory's Implemention Class of DataSourceV2: 
sqlConf parameter is null  (was: PartitionReaderFactory's Implemention Class of 
DataSourceV2 sqlConf parameter is null)

> PartitionReaderFactory's Implemention Class of DataSourceV2: sqlConf 
> parameter is null
> --
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: lynn
>Priority: Major
> Attachments: spark-sqlconf-isnull.png
>
>
> The codes of "MyPartitionReaderFactory" :
> {code:scala}
> // Implemention Class
> package com.lynn.spark.sql.v2
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.catalyst.InternalRow
> import 
> com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE,
>  MY_VECTORIZED_READER_ENABLED}
> import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
> PartitionReaderFactory}
> import org.apache.spark.sql.internal.SQLConf
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.vectorized.ColumnarBatch
> import org.apache.spark.sql.internal.SQLConf.buildConf
> case class MyPartitionReaderFactory(sqlConf: SQLConf,
> dataSchema: StructType,
> readSchema: StructType)
>   extends PartitionReaderFactory with Logging {
>   val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
>   val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)
>   override def createReader(partition: InputPartition): 
> PartitionReader[InternalRow] = {
> MyRowReader(batchSize, dataSchema, readSchema)
>   }
>   override def createColumnarReader(partition: InputPartition): 
> PartitionReader[ColumnarBatch] = {
> if(!supportColumnarReads(partition))
>   throw new UnsupportedOperationException("Cannot create columnar 
> reader.")
>MyColumnReader(batchSize, dataSchema, readSchema)
>   }
>   override def supportColumnarReads(partition: InputPartition) = 
> enableVectorized
> }
> object MyPartitionReaderFactory {
>   val MY_VECTORIZED_READER_ENABLED =
> buildConf("spark.sql.my.enableVectorizedReader")
>   .doc("Enables vectorized my source scan.")
>   .version("1.0.0")
>   .booleanConf
>   .createWithDefault(false)
>   val MY_VECTORIZED_READER_BATCH_SIZE =
> buildConf("spark.sql.my.columnarReaderBatchSize")
>   .doc("The number of rows to include in a my source vectorized reader 
> batch. The number should " +
> "be carefully chosen to minimize overhead and avoid OOMs in reading 
> data.")
>   .version("1.0.0")
>   .intConf
>   .createWithDefault(4096)
> }
> {code}
> The driver construct a RDD instance(DataSourceRDD), the sqlConf parameter 
> pass to the MyPartitionReaderFactory  is not null.
> But when the executor deserialize the RDD, the sqlConf parameter is null.
> The codes as follows:
> {code:scala}
> // RunTask.scala
> override def runTask(context: TaskContext): U = {
> // Deserialize the RDD and the func using the broadcast variables.
> val threadMXBean = ManagementFactory.getThreadMXBean
> val deserializeStartTimeNs = System.nanoTime()
> val deserializeStartCpuTime = if 
> (threadMXBean.isCurrentThreadCpuTimeSupported) {
>   threadMXBean.getCurrentThreadCpuTime
> } else 0L
> val ser = SparkEnv.get.closureSerializer.newInstance()
>//  the rdd 
> val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => 
> U)](
>   ByteBuffer.wrap(taskBinary.value), 
> Thread.currentThread.getContextClassLoader)
> _executorDeserializeTimeNs = System.nanoTime() - deserializeStartTimeNs
> _executorDeserializeCpuTime = if 
> (threadMXBean.isCurrentThreadCpuTimeSupported) {
>   threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
> } else 0L
> func(context, rdd.iterator(partition, context))
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34705) Add code-gen for all join types of sort merge join

2021-04-28 Thread Cheng Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335022#comment-17335022
 ] 

Cheng Su commented on SPARK-34705:
--

[~advancedxy] - We saw ~10% CPU performance improvement for targeted queries. I 
think it makes sense to update the benchmark after feature is merged. 

> Add code-gen for all join types of sort merge join
> --
>
> Key: SPARK-34705
> URL: https://issues.apache.org/jira/browse/SPARK-34705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> Currently sort merge join only supports inner join type 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L374]
>  ). We added code-gen for other join types internally in our fork and saw 
> obvious CPU performance improvement. Create this Jira to propose to merge 
> back to upstream.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334987#comment-17334987
 ] 

Apache Spark commented on SPARK-35263:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/32389

> Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
> ---
>
> Key: SPARK-35263
> URL: https://issues.apache.org/jira/browse/SPARK-35263
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Tests
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like:
> {code}
> val iterator = new ShuffleBlockFetcherIterator(
>   taskContext,
>   transfer,
>   blockManager,
>   blocksByAddress,
>   (_, in) => in,
>   48 * 1024 * 1024,
>   Int.MaxValue,
>   Int.MaxValue,
>   Int.MaxValue,
>   true,
>   false,
>   metrics,
>   false)
> {code}
> It's challenging to tell what the interesting parts are vs. what is just 
> being set to some default/unused value.
> Similarly but not as bad, there are 10 calls like:
> {code}
> verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), 
> any())
> {code}
> and 7 like
> {code}
> when(transfer.fetchBlocks(any(), any(), any(), any(), any(), 
> any())).thenAnswer ...
> {code}
> This can result in about 10% reduction in both lines and characters in the 
> file:
> {code}
> # Before
> > wc 
> > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> 10633950   43201 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> # After
> > wc 
> > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
>  9283609   39053 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> {code}
> It also helps readability:
> {code}
> val iterator = createShuffleBlockIteratorWithDefaults(
>   transfer,
>   blocksByAddress,
>   maxBytesInFlight = 1000L
> )
> {code}
> Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're 
> interested in here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35263:


Assignee: (was: Apache Spark)

> Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
> ---
>
> Key: SPARK-35263
> URL: https://issues.apache.org/jira/browse/SPARK-35263
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Tests
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like:
> {code}
> val iterator = new ShuffleBlockFetcherIterator(
>   taskContext,
>   transfer,
>   blockManager,
>   blocksByAddress,
>   (_, in) => in,
>   48 * 1024 * 1024,
>   Int.MaxValue,
>   Int.MaxValue,
>   Int.MaxValue,
>   true,
>   false,
>   metrics,
>   false)
> {code}
> It's challenging to tell what the interesting parts are vs. what is just 
> being set to some default/unused value.
> Similarly but not as bad, there are 10 calls like:
> {code}
> verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), 
> any())
> {code}
> and 7 like
> {code}
> when(transfer.fetchBlocks(any(), any(), any(), any(), any(), 
> any())).thenAnswer ...
> {code}
> This can result in about 10% reduction in both lines and characters in the 
> file:
> {code}
> # Before
> > wc 
> > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> 10633950   43201 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> # After
> > wc 
> > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
>  9283609   39053 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> {code}
> It also helps readability:
> {code}
> val iterator = createShuffleBlockIteratorWithDefaults(
>   transfer,
>   blocksByAddress,
>   maxBytesInFlight = 1000L
> )
> {code}
> Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're 
> interested in here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334986#comment-17334986
 ] 

Apache Spark commented on SPARK-35263:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/32389

> Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
> ---
>
> Key: SPARK-35263
> URL: https://issues.apache.org/jira/browse/SPARK-35263
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Tests
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like:
> {code}
> val iterator = new ShuffleBlockFetcherIterator(
>   taskContext,
>   transfer,
>   blockManager,
>   blocksByAddress,
>   (_, in) => in,
>   48 * 1024 * 1024,
>   Int.MaxValue,
>   Int.MaxValue,
>   Int.MaxValue,
>   true,
>   false,
>   metrics,
>   false)
> {code}
> It's challenging to tell what the interesting parts are vs. what is just 
> being set to some default/unused value.
> Similarly but not as bad, there are 10 calls like:
> {code}
> verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), 
> any())
> {code}
> and 7 like
> {code}
> when(transfer.fetchBlocks(any(), any(), any(), any(), any(), 
> any())).thenAnswer ...
> {code}
> This can result in about 10% reduction in both lines and characters in the 
> file:
> {code}
> # Before
> > wc 
> > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> 10633950   43201 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> # After
> > wc 
> > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
>  9283609   39053 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> {code}
> It also helps readability:
> {code}
> val iterator = createShuffleBlockIteratorWithDefaults(
>   transfer,
>   blocksByAddress,
>   maxBytesInFlight = 1000L
> )
> {code}
> Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're 
> interested in here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35263:


Assignee: Apache Spark

> Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
> ---
>
> Key: SPARK-35263
> URL: https://issues.apache.org/jira/browse/SPARK-35263
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Tests
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Assignee: Apache Spark
>Priority: Major
>
> {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like:
> {code}
> val iterator = new ShuffleBlockFetcherIterator(
>   taskContext,
>   transfer,
>   blockManager,
>   blocksByAddress,
>   (_, in) => in,
>   48 * 1024 * 1024,
>   Int.MaxValue,
>   Int.MaxValue,
>   Int.MaxValue,
>   true,
>   false,
>   metrics,
>   false)
> {code}
> It's challenging to tell what the interesting parts are vs. what is just 
> being set to some default/unused value.
> Similarly but not as bad, there are 10 calls like:
> {code}
> verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), 
> any())
> {code}
> and 7 like
> {code}
> when(transfer.fetchBlocks(any(), any(), any(), any(), any(), 
> any())).thenAnswer ...
> {code}
> This can result in about 10% reduction in both lines and characters in the 
> file:
> {code}
> # Before
> > wc 
> > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> 10633950   43201 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> # After
> > wc 
> > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
>  9283609   39053 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> {code}
> It also helps readability:
> {code}
> val iterator = createShuffleBlockIteratorWithDefaults(
>   transfer,
>   blocksByAddress,
>   maxBytesInFlight = 1000L
> )
> {code}
> Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're 
> interested in here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35262) Memory leak when dataset is being persisted

2021-04-28 Thread Igor Amelin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Amelin updated SPARK-35262:

Priority: Critical  (was: Major)

> Memory leak when dataset is being persisted
> ---
>
> Key: SPARK-35262
> URL: https://issues.apache.org/jira/browse/SPARK-35262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Igor Amelin
>Priority: Critical
>
> If a Java- or Scala-application with SparkSession runs a long time and 
> persists a lot of datasets, it can crash because of a memory leak.
>  I've noticed the following. When we have a dataset and persist it, the 
> SparkSession used to load that dataset is cloned in CacheManager, and this 
> clone is added as a listener to `listenersPlusTimers` in `ListenerBus`. But 
> this clone isn't removed from the list of listeners after that, e.g. 
> unpersisting the dataset. If we persist a lot of datasets, the SparkSession 
> is cloned and added to `ListenerBus` many times. This leads to a memory leak 
> since the `listenersPlusTimers` list become very large.
> I've found out that the SparkSession is cloned is CacheManager when the 
> parameters `spark.sql.sources.bucketing.autoBucketedScan.enabled` and 
> `spark.sql.adaptive.enabled` are true. The first one is true by default, and 
> this default behavior leads to the problem. When auto bucketed scan is 
> disabled, the SparkSession isn't cloned, and there are no duplicates in 
> ListenerBus, so the memory leak doesn't occur.
> Here is a small Java application to reproduce the memory leak: 
> [https://github.com/iamelin/spark-memory-leak]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code

2021-04-28 Thread Erik Krogen (Jira)

Erik Krogen created SPARK-35263:
---

 Summary: Refactor ShuffleBlockFetcherIteratorSuite to reduce 
duplicated code
 Key: SPARK-35263
 URL: https://issues.apache.org/jira/browse/SPARK-35263
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Tests
Affects Versions: 3.1.1
Reporter: Erik Krogen


{{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like:
{code}
val iterator = new ShuffleBlockFetcherIterator(
  taskContext,
  transfer,
  blockManager,
  blocksByAddress,
  (_, in) => in,
  48 * 1024 * 1024,
  Int.MaxValue,
  Int.MaxValue,
  Int.MaxValue,
  true,
  false,
  metrics,
  false)
{code}
It's challenging to tell what the interesting parts are vs. what is just being 
set to some default/unused value.

Similarly but not as bad, there are 10 calls like:
{code}
verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), any())
{code}
and 7 like
{code}
when(transfer.fetchBlocks(any(), any(), any(), any(), any(), any())).thenAnswer 
...
{code}

This can result in about 10% reduction in both lines and characters in the file:
{code}
# Before
> wc 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
10633950   43201 
core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala

# After
> wc 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
 9283609   39053 
core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
{code}

It also helps readability:
{code}
val iterator = createShuffleBlockIteratorWithDefaults(
  transfer,
  blocksByAddress,
  maxBytesInFlight = 1000L
)
{code}
Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're 
interested in here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35262) Memory leak when dataset is being persisted

2021-04-28 Thread Igor Amelin (Jira)

Igor Amelin created SPARK-35262:
---

 Summary: Memory leak when dataset is being persisted
 Key: SPARK-35262
 URL: https://issues.apache.org/jira/browse/SPARK-35262
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Igor Amelin


If a Java- or Scala-application with SparkSession runs a long time and persists 
a lot of datasets, it can crash because of a memory leak.
 I've noticed the following. When we have a dataset and persist it, the 
SparkSession used to load that dataset is cloned in CacheManager, and this 
clone is added as a listener to `listenersPlusTimers` in `ListenerBus`. But 
this clone isn't removed from the list of listeners after that, e.g. 
unpersisting the dataset. If we persist a lot of datasets, the SparkSession is 
cloned and added to `ListenerBus` many times. This leads to a memory leak since 
the `listenersPlusTimers` list become very large.

I've found out that the SparkSession is cloned is CacheManager when the 
parameters `spark.sql.sources.bucketing.autoBucketedScan.enabled` and 
`spark.sql.adaptive.enabled` are true. The first one is true by default, and 
this default behavior leads to the problem. When auto bucketed scan is 
disabled, the SparkSession isn't cloned, and there are no duplicates in 
ListenerBus, so the memory leak doesn't occur.

Here is a small Java application to reproduce the memory leak: 
[https://github.com/iamelin/spark-memory-leak]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35259) ExternalBlockHandler metrics have misleading unit in the name

2021-04-28 Thread Erik Krogen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-35259:

Description: 
Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics:
{code}
// Time latency for open block request in ms
private final Timer openBlockRequestLatencyMillis = new Timer();
// Time latency for executor registration latency in ms
private final Timer registerExecutorRequestLatencyMillis = new Timer();
// Time latency for processing finalize shuffle merge request latency in ms
private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
{code}
However these Dropwizard Timers by default use nanoseconds 
([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]).
 It's certainly possible to extract milliseconds from them, but it seems 
misleading to have millis in the name here.

{{YarnShuffleServiceMetrics}} currently doesn't expose any incorrectly-named 
metrics since it doesn't export any timing information from these metrics 
(which I am trying to address in SPARK-35258), but these names still result in 
kind of misleading metric names like 
{{finalizeShuffleMergeLatencyMillis_count}} -- a count doesn't have a unit. It 
should be up to the metrics exporter, like {{YarnShuffleServiceMetrics}}, to 
decide the unit and adjust the name accordingly.

  was:
Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics:
{code}
// Time latency for open block request in ms
private final Timer openBlockRequestLatencyMillis = new Timer();
// Time latency for executor registration latency in ms
private final Timer registerExecutorRequestLatencyMillis = new Timer();
// Time latency for processing finalize shuffle merge request latency in ms
private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
{code}
However these Dropwizard Timers by default use nanoseconds 
([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]).
 It's certainly possible to extract milliseconds from them, but it seems 
misleading to have millis in the name here.

{{YarnShuffleServiceMetrics}} currently doesn't expose any incorrectly-named 
metrics since it doesn't export any timing information from these metrics 
(which I am trying to address in SPARK-35258), but these names still result in 
kind of misleading metric names like {{finalizeShuffleMergeLatency_count}} -- a 
count doesn't have a unit. It should be up to the metrics exporter, like 
{{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name 
accordingly.


> ExternalBlockHandler metrics have misleading unit in the name
> -
>
> Key: SPARK-35259
> URL: https://issues.apache.org/jira/browse/SPARK-35259
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics:
> {code}
> // Time latency for open block request in ms
> private final Timer openBlockRequestLatencyMillis = new Timer();
> // Time latency for executor registration latency in ms
> private final Timer registerExecutorRequestLatencyMillis = new Timer();
> // Time latency for processing finalize shuffle merge request latency in 
> ms
> private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
> {code}
> However these Dropwizard Timers by default use nanoseconds 
> ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]).
>  It's certainly possible to extract milliseconds from them, but it seems 
> misleading to have millis in the name here.
> {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrectly-named 
> metrics since it doesn't export any timing information from these metrics 
> (which I am trying to address in SPARK-35258), but these names still result 
> in kind of misleading metric names like 
> {{finalizeShuffleMergeLatencyMillis_count}} -- a count doesn't have a unit. 
> It should be up to the metrics exporter, like {{YarnShuffleServiceMetrics}}, 
> to decide the unit and adjust the name accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35259) ExternalBlockHandler metrics have misleading unit in the name

2021-04-28 Thread Erik Krogen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-35259:

Description: 
Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics:
{code}
// Time latency for open block request in ms
private final Timer openBlockRequestLatencyMillis = new Timer();
// Time latency for executor registration latency in ms
private final Timer registerExecutorRequestLatencyMillis = new Timer();
// Time latency for processing finalize shuffle merge request latency in ms
private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
{code}
However these Dropwizard Timers by default use nanoseconds 
([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]).
 It's certainly possible to extract milliseconds from them, but it seems 
misleading to have millis in the name here.

{{YarnShuffleServiceMetrics}} currently doesn't expose any incorrectly-named 
metrics since it doesn't export any timing information from these metrics 
(which I am trying to address in SPARK-35258), but these names still result in 
kind of misleading metric names like {{finalizeShuffleMergeLatency_count}} -- a 
count doesn't have a unit. It should be up to the metrics exporter, like 
{{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name 
accordingly.

  was:
Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics:
{code}
// Time latency for open block request in ms
private final Timer openBlockRequestLatencyMillis = new Timer();
// Time latency for executor registration latency in ms
private final Timer registerExecutorRequestLatencyMillis = new Timer();
// Time latency for processing finalize shuffle merge request latency in ms
private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
{code}
However these Dropwizard Timers by default use nanoseconds 
([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]).
 It's certainly possible to extract milliseconds from them, but it seems 
misleading to have millis in the name here.

{{YarnShuffleServiceMetrics}} currently doesn't expose any incorrect metrics 
since it doesn't export any timing information from these metrics (which I am 
trying to address in SPARK-35258), but these names still result in kind of 
misleading metric names like {{finalizeShuffleMergeLatency_count}} -- a count 
doesn't have a unit. It should be up to the metrics exporter, like 
{{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name 
accordingly.


> ExternalBlockHandler metrics have misleading unit in the name
> -
>
> Key: SPARK-35259
> URL: https://issues.apache.org/jira/browse/SPARK-35259
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics:
> {code}
> // Time latency for open block request in ms
> private final Timer openBlockRequestLatencyMillis = new Timer();
> // Time latency for executor registration latency in ms
> private final Timer registerExecutorRequestLatencyMillis = new Timer();
> // Time latency for processing finalize shuffle merge request latency in 
> ms
> private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
> {code}
> However these Dropwizard Timers by default use nanoseconds 
> ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]).
>  It's certainly possible to extract milliseconds from them, but it seems 
> misleading to have millis in the name here.
> {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrectly-named 
> metrics since it doesn't export any timing information from these metrics 
> (which I am trying to address in SPARK-35258), but these names still result 
> in kind of misleading metric names like {{finalizeShuffleMergeLatency_count}} 
> -- a count doesn't have a unit. It should be up to the metrics exporter, like 
> {{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name 
> accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35259) ExternalBlockHandler metrics have misleading unit in the name

2021-04-28 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334932#comment-17334932
 ] 

Erik Krogen commented on SPARK-35259:
-

I have a PR for this but it is based on the PR for SPARK-35258 so I will hold 
off posting it for now.

While that goes through -- [~rxin] or [~jlaskowski] -- I see you participated 
in the discussions on SPARK-16405 when these were added, do you have any 
comment here? Maybe I am missing something?

> ExternalBlockHandler metrics have misleading unit in the name
> -
>
> Key: SPARK-35259
> URL: https://issues.apache.org/jira/browse/SPARK-35259
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics:
> {code}
> // Time latency for open block request in ms
> private final Timer openBlockRequestLatencyMillis = new Timer();
> // Time latency for executor registration latency in ms
> private final Timer registerExecutorRequestLatencyMillis = new Timer();
> // Time latency for processing finalize shuffle merge request latency in 
> ms
> private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
> {code}
> However these Dropwizard Timers by default use nanoseconds 
> ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]).
>  It's certainly possible to extract milliseconds from them, but it seems 
> misleading to have millis in the name here.
> {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrect metrics 
> since it doesn't export any timing information from these metrics (which I am 
> trying to address in SPARK-35258), but these names still result in kind of 
> misleading metric names like {{finalizeShuffleMergeLatency_count}} -- a count 
> doesn't have a unit. It should be up to the metrics exporter, like 
> {{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name 
> accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35261) Support static invoke for stateless UDF

2021-04-28 Thread Chao Sun (Jira)

Chao Sun created SPARK-35261:


 Summary: Support static invoke for stateless UDF
 Key: SPARK-35261
 URL: https://issues.apache.org/jira/browse/SPARK-35261
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


For UDFs that are stateless, we should allow users to define "magic method" as 
a static Java method which removes extra costs from dynamic dispatch and gives 
better performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35258) Enhance ESS ExternalBlockHandler with additional block rate-based metrics and histograms

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35258:


Assignee: Apache Spark

> Enhance ESS ExternalBlockHandler with additional block rate-based metrics and 
> histograms
> 
>
> Key: SPARK-35258
> URL: https://issues.apache.org/jira/browse/SPARK-35258
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Assignee: Apache Spark
>Priority: Major
>
> Today the {{ExternalBlockHandler}} component of ESS exposes some useful 
> metrics, but is lacking around metrics for the rate of block transfers. We 
> have {{blockTransferRateBytes}} to tell us the rate of _bytes_, but no metric 
> to tell us the rate of _blocks_, which is especially relevant when running 
> the ESS on HDDs that are sensitive to random reads. Many small block 
> transfers can have a negative impact on performance, but won't show up as a 
> spike in {{blockTransferRateBytes}} since the sizes are small.
> We can also enhance {{YarnShuffleServiceMetrics}} to expose histogram-style 
> metrics from the {{Timer}} instances within {{ExternalBlockHandler}} -- today 
> it is only exposing the count and rate, but not timing information from the 
> {{Snapshot}}.
> These two changes can make it easier to monitor the health of the ESS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35258) Enhance ESS ExternalBlockHandler with additional block rate-based metrics and histograms

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35258:


Assignee: (was: Apache Spark)

> Enhance ESS ExternalBlockHandler with additional block rate-based metrics and 
> histograms
> 
>
> Key: SPARK-35258
> URL: https://issues.apache.org/jira/browse/SPARK-35258
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> Today the {{ExternalBlockHandler}} component of ESS exposes some useful 
> metrics, but is lacking around metrics for the rate of block transfers. We 
> have {{blockTransferRateBytes}} to tell us the rate of _bytes_, but no metric 
> to tell us the rate of _blocks_, which is especially relevant when running 
> the ESS on HDDs that are sensitive to random reads. Many small block 
> transfers can have a negative impact on performance, but won't show up as a 
> spike in {{blockTransferRateBytes}} since the sizes are small.
> We can also enhance {{YarnShuffleServiceMetrics}} to expose histogram-style 
> metrics from the {{Timer}} instances within {{ExternalBlockHandler}} -- today 
> it is only exposing the count and rate, but not timing information from the 
> {{Snapshot}}.
> These two changes can make it easier to monitor the health of the ESS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35258) Enhance ESS ExternalBlockHandler with additional block rate-based metrics and histograms

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334929#comment-17334929
 ] 

Apache Spark commented on SPARK-35258:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/32388

> Enhance ESS ExternalBlockHandler with additional block rate-based metrics and 
> histograms
> 
>
> Key: SPARK-35258
> URL: https://issues.apache.org/jira/browse/SPARK-35258
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> Today the {{ExternalBlockHandler}} component of ESS exposes some useful 
> metrics, but is lacking around metrics for the rate of block transfers. We 
> have {{blockTransferRateBytes}} to tell us the rate of _bytes_, but no metric 
> to tell us the rate of _blocks_, which is especially relevant when running 
> the ESS on HDDs that are sensitive to random reads. Many small block 
> transfers can have a negative impact on performance, but won't show up as a 
> spike in {{blockTransferRateBytes}} since the sizes are small.
> We can also enhance {{YarnShuffleServiceMetrics}} to expose histogram-style 
> metrics from the {{Timer}} instances within {{ExternalBlockHandler}} -- today 
> it is only exposing the count and rate, but not timing information from the 
> {{Snapshot}}.
> These two changes can make it easier to monitor the health of the ESS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34981) Implement V2 function resolution and evaluation

2021-04-28 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34981:
-
Parent: SPARK-35260
Issue Type: Sub-task  (was: Improvement)

> Implement V2 function resolution and evaluation 
> 
>
> Key: SPARK-34981
> URL: https://issues.apache.org/jira/browse/SPARK-34981
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims 
> at implementing the function resolution (in analyzer) and evaluation by 
> wrapping them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35260) DataSourceV2 Function Catalog implementation

2021-04-28 Thread Chao Sun (Jira)

Chao Sun created SPARK-35260:


 Summary: DataSourceV2 Function Catalog implementation
 Key: SPARK-35260
 URL: https://issues.apache.org/jira/browse/SPARK-35260
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


This tracks the implementation and follow-up work for V2 Function Catalog 
introduced in SPARK-27658



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35244) invoke should throw the original exception

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334925#comment-17334925
 ] 

Apache Spark commented on SPARK-35244:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32387

> invoke should throw the original exception
> --
>
> Key: SPARK-35244
> URL: https://issues.apache.org/jira/browse/SPARK-35244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35259) ExternalBlockHandler metrics have incorrect unit in the name

2021-04-28 Thread Erik Krogen (Jira)

Erik Krogen created SPARK-35259:
---

 Summary: ExternalBlockHandler metrics have incorrect unit in the 
name
 Key: SPARK-35259
 URL: https://issues.apache.org/jira/browse/SPARK-35259
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.1.1
Reporter: Erik Krogen


Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics:
{code}
// Time latency for open block request in ms
private final Timer openBlockRequestLatencyMillis = new Timer();
// Time latency for executor registration latency in ms
private final Timer registerExecutorRequestLatencyMillis = new Timer();
// Time latency for processing finalize shuffle merge request latency in ms
private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
{code}
However these Dropwizard Timers by default use nanoseconds 
([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]).
 It's certainly possible to extract milliseconds from them, but it seems 
misleading to have millis in the name here.

{{YarnShuffleServiceMetrics}} currently doesn't expose any incorrect metrics 
since it doesn't export any timing information from these metrics (which I am 
trying to address in SPARK-35258), but these names still result in kind of 
misleading metric names like {{finalizeShuffleMergeLatency_count}} -- a count 
doesn't have a unit. It should be up to the metrics exporter, like 
{{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name 
accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35259) ExternalBlockHandler metrics have misleading unit in the name

2021-04-28 Thread Erik Krogen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-35259:

Summary: ExternalBlockHandler metrics have misleading unit in the name  
(was: ExternalBlockHandler metrics have incorrect unit in the name)

> ExternalBlockHandler metrics have misleading unit in the name
> -
>
> Key: SPARK-35259
> URL: https://issues.apache.org/jira/browse/SPARK-35259
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> Today {{ExternalBlockHandler}} exposes a few {{Timer}} metrics:
> {code}
> // Time latency for open block request in ms
> private final Timer openBlockRequestLatencyMillis = new Timer();
> // Time latency for executor registration latency in ms
> private final Timer registerExecutorRequestLatencyMillis = new Timer();
> // Time latency for processing finalize shuffle merge request latency in 
> ms
> private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
> {code}
> However these Dropwizard Timers by default use nanoseconds 
> ([documentation|https://metrics.dropwizard.io/3.2.3/getting-started.html#timers]).
>  It's certainly possible to extract milliseconds from them, but it seems 
> misleading to have millis in the name here.
> {{YarnShuffleServiceMetrics}} currently doesn't expose any incorrect metrics 
> since it doesn't export any timing information from these metrics (which I am 
> trying to address in SPARK-35258), but these names still result in kind of 
> misleading metric names like {{finalizeShuffleMergeLatency_count}} -- a count 
> doesn't have a unit. It should be up to the metrics exporter, like 
> {{YarnShuffleServiceMetrics}}, to decide the unit and adjust the name 
> accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35258) Enhance ESS ExternalBlockHandler with additional block rate-based metrics and histograms

2021-04-28 Thread Erik Krogen (Jira)

Erik Krogen created SPARK-35258:
---

 Summary: Enhance ESS ExternalBlockHandler with additional block 
rate-based metrics and histograms
 Key: SPARK-35258
 URL: https://issues.apache.org/jira/browse/SPARK-35258
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, YARN
Affects Versions: 3.1.1
Reporter: Erik Krogen


Today the {{ExternalBlockHandler}} component of ESS exposes some useful 
metrics, but is lacking around metrics for the rate of block transfers. We have 
{{blockTransferRateBytes}} to tell us the rate of _bytes_, but no metric to 
tell us the rate of _blocks_, which is especially relevant when running the ESS 
on HDDs that are sensitive to random reads. Many small block transfers can have 
a negative impact on performance, but won't show up as a spike in 
{{blockTransferRateBytes}} since the sizes are small.

We can also enhance {{YarnShuffleServiceMetrics}} to expose histogram-style 
metrics from the {{Timer}} instances within {{ExternalBlockHandler}} -- today 
it is only exposing the count and rate, but not timing information from the 
{{Snapshot}}.

These two changes can make it easier to monitor the health of the ESS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34887) Port/integrate Koalas dependencies into PySpark

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334918#comment-17334918
 ] 

Apache Spark commented on SPARK-34887:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32386

> Port/integrate Koalas dependencies into PySpark
> ---
>
> Key: SPARK-34887
> URL: https://issues.apache.org/jira/browse/SPARK-34887
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> This JIRA aims to port Koalas dependencies appropriately to PySpark 
> dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34887) Port/integrate Koalas dependencies into PySpark

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34887:


Assignee: (was: Apache Spark)

> Port/integrate Koalas dependencies into PySpark
> ---
>
> Key: SPARK-34887
> URL: https://issues.apache.org/jira/browse/SPARK-34887
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> This JIRA aims to port Koalas dependencies appropriately to PySpark 
> dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34887) Port/integrate Koalas dependencies into PySpark

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34887:


Assignee: Apache Spark

> Port/integrate Koalas dependencies into PySpark
> ---
>
> Key: SPARK-34887
> URL: https://issues.apache.org/jira/browse/SPARK-34887
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> This JIRA aims to port Koalas dependencies appropriately to PySpark 
> dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34887) Port/integrate Koalas dependencies into PySpark

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34887:


Assignee: (was: Apache Spark)

> Port/integrate Koalas dependencies into PySpark
> ---
>
> Key: SPARK-34887
> URL: https://issues.apache.org/jira/browse/SPARK-34887
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> This JIRA aims to port Koalas dependencies appropriately to PySpark 
> dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34887) Port/integrate Koalas dependencies into PySpark

2021-04-28 Thread Xinrong Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334915#comment-17334915
 ] 

Xinrong Meng commented on SPARK-34887:
--

May I work on this ticket?

> Port/integrate Koalas dependencies into PySpark
> ---
>
> Key: SPARK-34887
> URL: https://issues.apache.org/jira/browse/SPARK-34887
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> This JIRA aims to port Koalas dependencies appropriately to PySpark 
> dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34981) Implement V2 function resolution and evaluation

2021-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34981:
---

Assignee: Chao Sun

> Implement V2 function resolution and evaluation 
> 
>
> Key: SPARK-34981
> URL: https://issues.apache.org/jira/browse/SPARK-34981
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims 
> at implementing the function resolution (in analyzer) and evaluation by 
> wrapping them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34981) Implement V2 function resolution and evaluation

2021-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34981.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32082
[https://github.com/apache/spark/pull/32082]

> Implement V2 function resolution and evaluation 
> 
>
> Key: SPARK-34981
> URL: https://issues.apache.org/jira/browse/SPARK-34981
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims 
> at implementing the function resolution (in analyzer) and evaluation by 
> wrapping them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34705) Add code-gen for all join types of sort merge join

2021-04-28 Thread Xianjin YE (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334830#comment-17334830
 ] 

Xianjin YE commented on SPARK-34705:


[~chengsu] could you share some number of the CPU performance improvement?

> Add code-gen for all join types of sort merge join
> --
>
> Key: SPARK-34705
> URL: https://issues.apache.org/jira/browse/SPARK-34705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> Currently sort merge join only supports inner join type 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L374]
>  ). We added code-gen for other join types internally in our fork and saw 
> obvious CPU performance improvement. Create this Jira to propose to merge 
> back to upstream.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18188) Add checksum for block of broadcast

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-18188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334826#comment-17334826
 ] 

Apache Spark commented on SPARK-18188:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/32385

> Add checksum for block of broadcast
> ---
>
> Key: SPARK-18188
> URL: https://issues.apache.org/jira/browse/SPARK-18188
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Major
> Fix For: 2.1.0
>
>
> There is an understanding issue for a long time: 
> https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for 
> the blocks, it's very hard for us to identify where is the bug came from.
> Shuffle blocks are compressed separate (have checksum in it), but broadcast 
> blocks are compressed together, we should add checksum for each of separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35257) Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334798#comment-17334798
 ] 

Apache Spark commented on SPARK-35257:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32384

> Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up
> --
>
> Key: SPARK-35257
> URL: https://issues.apache.org/jira/browse/SPARK-35257
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> HadoopVersionInfoSuite use a separate ivyPath to download the jars, we can 
> use `
> HiveClientBuilder.ivyPath` and  specify environment variables 
> `SPARK_VERSIONS_SUITE_IVY_PATH` like `VersionsSuite` to avoid downloading 
> jars repeatedly and speed up the test.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35257) Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35257:


Assignee: Apache Spark

> Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up
> --
>
> Key: SPARK-35257
> URL: https://issues.apache.org/jira/browse/SPARK-35257
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> HadoopVersionInfoSuite use a separate ivyPath to download the jars, we can 
> use `
> HiveClientBuilder.ivyPath` and  specify environment variables 
> `SPARK_VERSIONS_SUITE_IVY_PATH` like `VersionsSuite` to avoid downloading 
> jars repeatedly and speed up the test.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35257) Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334797#comment-17334797
 ] 

Apache Spark commented on SPARK-35257:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32384

> Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up
> --
>
> Key: SPARK-35257
> URL: https://issues.apache.org/jira/browse/SPARK-35257
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> HadoopVersionInfoSuite use a separate ivyPath to download the jars, we can 
> use `
> HiveClientBuilder.ivyPath` and  specify environment variables 
> `SPARK_VERSIONS_SUITE_IVY_PATH` like `VersionsSuite` to avoid downloading 
> jars repeatedly and speed up the test.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35257) Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35257:


Assignee: (was: Apache Spark)

> Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up
> --
>
> Key: SPARK-35257
> URL: https://issues.apache.org/jira/browse/SPARK-35257
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> HadoopVersionInfoSuite use a separate ivyPath to download the jars, we can 
> use `
> HiveClientBuilder.ivyPath` and  specify environment variables 
> `SPARK_VERSIONS_SUITE_IVY_PATH` like `VersionsSuite` to avoid downloading 
> jars repeatedly and speed up the test.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35257) Let `HadoopVersionInfoSuite` can use SPARK_VERSIONS_SUITE_IVY_PATH to speed up

2021-04-28 Thread Yang Jie (Jira)

Yang Jie created SPARK-35257:


 Summary: Let `HadoopVersionInfoSuite` can use 
SPARK_VERSIONS_SUITE_IVY_PATH to speed up
 Key: SPARK-35257
 URL: https://issues.apache.org/jira/browse/SPARK-35257
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.2.0
Reporter: Yang Jie


HadoopVersionInfoSuite use a separate ivyPath to download the jars, we can use `

HiveClientBuilder.ivyPath` and  specify environment variables 
`SPARK_VERSIONS_SUITE_IVY_PATH` like `VersionsSuite` to avoid downloading jars 
repeatedly and speed up the test.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35256) str_to_map + split performance regression

2021-04-28 Thread Ondrej Kokes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ondrej Kokes updated SPARK-35256:
-
Description: 
I'm seeing almost double the runtime between 3.0.1 and 3.1.1 in my pipeline 
that does mostly str_to_map, split and a few other operations - all 
projections, no joins or aggregations (it's here only to trigger the pipeline). 
I cut it down to the simplest reproducible example I could - anything I remove 
from this changes the runtime difference quite dramatically. (even moving those 
two expressions from f.when to standalone columns makes the difference 
disappear)
{code:java}
import time
import os

import pyspark  
from pyspark.sql import SparkSession

import pyspark.sql.functions as f

if __name__ == '__main__':
print(pyspark.__version__)
spark = SparkSession.builder.getOrCreate()

filename = 'regression.csv'
if not os.path.isfile(filename):
with open(filename, 'wt') as fw:
fw.write('foo\n')
for _ in range(10_000_000):
fw.write('foo=bar=bak=f,o,1:2:3\n')

df = spark.read.option('header', True).csv(filename)
t = time.time()
dd = (df
.withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
.withColumn('extracted',
# without this top level split it is only 50% 
slower, with it
# the runtime almost doubles
f.split(f.split(f.col("my_map")["bar"], ",")[2], 
":")[0]
   )
.select(
f.when(
f.col("extracted").startswith("foo"), f.col("extracted")
).otherwise(
f.concat(f.lit("foo"), f.col("extracted"))
).alias("foo")
)
)
# dd.explain(True)
_ = dd.groupby("foo").count().count()
print("elapsed", time.time() - t)
{code}
Running this in 3.0.1 and 3.1.1 respectively (both installed from PyPI, on my 
local macOS)
{code:java}
3.0.1
elapsed 21.262351036071777
3.1.1
elapsed 40.26582884788513
{code}
(Meaning the transformation took 21 seconds in 3.0.1 and 40 seconds in 3.1.1)

Feel free to make the CSV smaller to get a quicker feedback loop - it scales 
linearly (I developed this with 2M rows).

It might be related to my previous issue - SPARK-32989 - there are similar 
operations, nesting etc. (splitting on the original column, not on a map, makes 
the difference disappear)

I tried dissecting the queries in SparkUI and via explain, but both 3.0.1 and 
3.1.1 produced identical plans.

  was:
I'm seeing almost double the runtime between 3.0.1 and 3.1.1 in my pipeline 
that does mostly str_to_map, split and a few other operations - all 
projections, no joins or aggregations (it's here only to trigger the pipeline). 
I cut it down to the simplest reproducible example I could - anything I remove 
from this changes the runtime difference quite dramatically. (even moving those 
two expressions from f.when to standalone columns makes the difference 
disappear)
{code:java}
import time
import os

import pyspark  
from pyspark.sql import SparkSession

import pyspark.sql.functions as f

if __name__ == '__main__':
print(pyspark.__version__)
spark = SparkSession.builder.getOrCreate()

filename = 'regression.csv'
if not os.path.isfile(filename):
with open(filename, 'wt') as fw:
fw.write('foo\n')
for _ in range(10_000_000):
fw.write('foo=bar=bak=f,o,1:2:3\n')

df = spark.read.option('header', True).csv(filename)
t = time.time()
dd = (df
.withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
.withColumn('extracted',
# without this top level split (so just 
`f.split(f.col("my_map")["bar"], ",")[2]`), it's only 50% slower, with it it's 
100%
f.split(f.split(f.col("my_map")["bar"], ",")[2], 
":")[0]
   )
.select(
f.when(
f.col("extracted").startswith("foo"), f.col("extracted")
).otherwise(
f.concat(f.lit("foo"), f.col("extracted"))
).alias("foo")
)
)
# dd.explain(True)
_ = dd.groupby("foo").count().count()
print("elapsed", time.time() - t)
{code}
Running this in 3.0.1 and 3.1.1 respectively (both installed from PyPI, on my 
local macOS)
{code:java}
3.0.1
elapsed 21.262351036071777
3.1.1
elapsed 40.26582884788513
{code}
(Meaning the transformation took 21 seconds in 3.0.1 and 40 seconds in 3.1.1)

Feel free to make the CSV smaller to get a quicker feedback loop - it scales 
linearly (I developed this with 2M rows).

It might be related to my previous issue - SPARK-32989 - there are similar 
operations, nesting etc. (splitting on the original column, not on a map, makes 
the difference

[jira] [Created] (SPARK-35256) str_to_map + split performance regression

2021-04-28 Thread Ondrej Kokes (Jira)

Ondrej Kokes created SPARK-35256:


 Summary: str_to_map + split performance regression
 Key: SPARK-35256
 URL: https://issues.apache.org/jira/browse/SPARK-35256
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Ondrej Kokes


I'm seeing almost double the runtime between 3.0.1 and 3.1.1 in my pipeline 
that does mostly str_to_map, split and a few other operations - all 
projections, no joins or aggregations (it's here only to trigger the pipeline). 
I cut it down to the simplest reproducible example I could - anything I remove 
from this changes the runtime difference quite dramatically. (even moving those 
two expressions from f.when to standalone columns makes the difference 
disappear)
{code:java}
import time
import os

import pyspark  
from pyspark.sql import SparkSession

import pyspark.sql.functions as f

if __name__ == '__main__':
print(pyspark.__version__)
spark = SparkSession.builder.getOrCreate()

filename = 'regression.csv'
if not os.path.isfile(filename):
with open(filename, 'wt') as fw:
fw.write('foo\n')
for _ in range(10_000_000):
fw.write('foo=bar=bak=f,o,1:2:3\n')

df = spark.read.option('header', True).csv(filename)
t = time.time()
dd = (df
.withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
.withColumn('extracted',
# without this top level split (so just 
`f.split(f.col("my_map")["bar"], ",")[2]`), it's only 50% slower, with it it's 
100%
f.split(f.split(f.col("my_map")["bar"], ",")[2], 
":")[0]
   )
.select(
f.when(
f.col("extracted").startswith("foo"), f.col("extracted")
).otherwise(
f.concat(f.lit("foo"), f.col("extracted"))
).alias("foo")
)
)
# dd.explain(True)
_ = dd.groupby("foo").count().count()
print("elapsed", time.time() - t)
{code}
Running this in 3.0.1 and 3.1.1 respectively (both installed from PyPI, on my 
local macOS)
{code:java}
3.0.1
elapsed 21.262351036071777
3.1.1
elapsed 40.26582884788513
{code}
(Meaning the transformation took 21 seconds in 3.0.1 and 40 seconds in 3.1.1)

Feel free to make the CSV smaller to get a quicker feedback loop - it scales 
linearly (I developed this with 2M rows).

It might be related to my previous issue - SPARK-32989 - there are similar 
operations, nesting etc. (splitting on the original column, not on a map, makes 
the difference disappear)

I tried dissecting the queries in SparkUI and via explain, but both 3.0.1 and 
3.1.1 produced identical plans.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35255) Automated formatting for Scala Code for Blank Lines.

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334738#comment-17334738
 ] 

Apache Spark commented on SPARK-35255:
--

User 'lipzhu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32383

> Automated formatting for Scala Code for Blank Lines.
> 
>
> Key: SPARK-35255
> URL: https://issues.apache.org/jira/browse/SPARK-35255
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> Based on the databricks' scala-style-code for blanklines.  
> [https://github.com/databricks/scala-style-guide#blanklines]
> Add a configuration to controls whether to enforce a blank line before and/or 
> after a top-level statement spanning a certain number of lines.
> And upgrade mvn-scalafmt plugin to 1.0.4 to enable this configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35255) Automated formatting for Scala Code for Blank Lines.

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35255:


Assignee: (was: Apache Spark)

> Automated formatting for Scala Code for Blank Lines.
> 
>
> Key: SPARK-35255
> URL: https://issues.apache.org/jira/browse/SPARK-35255
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> Based on the databricks' scala-style-code for blanklines.  
> [https://github.com/databricks/scala-style-guide#blanklines]
> Add a configuration to controls whether to enforce a blank line before and/or 
> after a top-level statement spanning a certain number of lines.
> And upgrade mvn-scalafmt plugin to 1.0.4 to enable this configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35255) Automated formatting for Scala Code for Blank Lines.

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35255:


Assignee: Apache Spark

> Automated formatting for Scala Code for Blank Lines.
> 
>
> Key: SPARK-35255
> URL: https://issues.apache.org/jira/browse/SPARK-35255
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Zhu, Lipeng
>Assignee: Apache Spark
>Priority: Major
>
> Based on the databricks' scala-style-code for blanklines.  
> [https://github.com/databricks/scala-style-guide#blanklines]
> Add a configuration to controls whether to enforce a blank line before and/or 
> after a top-level statement spanning a certain number of lines.
> And upgrade mvn-scalafmt plugin to 1.0.4 to enable this configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35255) Automated formatting for Scala Code for Blank Lines.

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334736#comment-17334736
 ] 

Apache Spark commented on SPARK-35255:
--

User 'lipzhu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32383

> Automated formatting for Scala Code for Blank Lines.
> 
>
> Key: SPARK-35255
> URL: https://issues.apache.org/jira/browse/SPARK-35255
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> Based on the databricks' scala-style-code for blanklines.  
> [https://github.com/databricks/scala-style-guide#blanklines]
> Add a configuration to controls whether to enforce a blank line before and/or 
> after a top-level statement spanning a certain number of lines.
> And upgrade mvn-scalafmt plugin to 1.0.4 to enable this configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35255) Automated formatting for Scala Code for Blank Lines.

2021-04-28 Thread Zhu, Lipeng (Jira)

Zhu, Lipeng created SPARK-35255:
---

 Summary: Automated formatting for Scala Code for Blank Lines.
 Key: SPARK-35255
 URL: https://issues.apache.org/jira/browse/SPARK-35255
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.2.0
Reporter: Zhu, Lipeng


Based on the databricks' scala-style-code for blanklines.  
[https://github.com/databricks/scala-style-guide#blanklines]

Add a configuration to controls whether to enforce a blank line before and/or 
after a top-level statement spanning a certain number of lines.

And upgrade mvn-scalafmt plugin to 1.0.4 to enable this configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35254) Upgrade SBT to 1.5.1

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35254:


Assignee: (was: Apache Spark)

> Upgrade SBT to 1.5.1
> 
>
> Key: SPARK-35254
> URL: https://issues.apache.org/jira/browse/SPARK-35254
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> https://github.com/sbt/sbt/releases/tag/v1.5.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35254) Upgrade SBT to 1.5.1

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334687#comment-17334687
 ] 

Apache Spark commented on SPARK-35254:
--

User 'lipzhu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32382

> Upgrade SBT to 1.5.1
> 
>
> Key: SPARK-35254
> URL: https://issues.apache.org/jira/browse/SPARK-35254
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> https://github.com/sbt/sbt/releases/tag/v1.5.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35254) Upgrade SBT to 1.5.1

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35254:


Assignee: Apache Spark

> Upgrade SBT to 1.5.1
> 
>
> Key: SPARK-35254
> URL: https://issues.apache.org/jira/browse/SPARK-35254
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Zhu, Lipeng
>Assignee: Apache Spark
>Priority: Major
>
> https://github.com/sbt/sbt/releases/tag/v1.5.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35254) Upgrade SBT to 1.5.1

2021-04-28 Thread Zhu, Lipeng (Jira)

Zhu, Lipeng created SPARK-35254:
---

 Summary: Upgrade SBT to 1.5.1
 Key: SPARK-35254
 URL: https://issues.apache.org/jira/browse/SPARK-35254
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.2.0
Reporter: Zhu, Lipeng


https://github.com/sbt/sbt/releases/tag/v1.5.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35159) extract doc of hive format

2021-04-28 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-35159:
-
Fix Version/s: 3.1.2
   3.0.3

> extract doc of hive format
> --
>
> Key: SPARK-35159
> URL: https://issues.apache.org/jira/browse/SPARK-35159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> extract doc of hive format



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35229) Spark Job web page is extremely slow while there are more than 1500 events in timeline

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334633#comment-17334633
 ] 

Apache Spark commented on SPARK-35229:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32381

> Spark Job web page is extremely slow while there are more than 1500 events in 
> timeline
> --
>
> Key: SPARK-35229
> URL: https://issues.apache.org/jira/browse/SPARK-35229
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Wang Yuan
>Priority: Blocker
>
> In a spark streaming application, there are 1000+ executors, more than 2000 
> events (executors events, job events) generated, then the jobs/job web page 
> of spark is not able to show. The browser (chrome, firefox, safari) freezes. 
> I had to open another window to open stages, executors pages by their 
> addresses manually.
> The jobs page is the home page, so this is rather annoyed. The problem is 
> that the vis-timeline rending is too slow.
> Their are some suggestion:
> 1) the page should not render timeline when the page is loading unless user 
> clicks the link.
> 2) the executor group and job group should be separated, and user can choose 
> to show one, both or neither.
> 3) the executor group should display executor event by time horizontally. 
> Currently it is displayed executors one line by one line. If more than 100 
> executors, the page is not that good.
> 4)the vis-timeline library is not maintained anymore since 2017. Should be 
> replaced a new one like [https://github.com/visjs/vis-timeline]
> 5) it is also good to get recent events, e.g. 500, and load more if user want 
> see more, however the data could all be loaded once.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35229) Spark Job web page is extremely slow while there are more than 1500 events in timeline

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334632#comment-17334632
 ] 

Apache Spark commented on SPARK-35229:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32381

> Spark Job web page is extremely slow while there are more than 1500 events in 
> timeline
> --
>
> Key: SPARK-35229
> URL: https://issues.apache.org/jira/browse/SPARK-35229
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Wang Yuan
>Priority: Blocker
>
> In a spark streaming application, there are 1000+ executors, more than 2000 
> events (executors events, job events) generated, then the jobs/job web page 
> of spark is not able to show. The browser (chrome, firefox, safari) freezes. 
> I had to open another window to open stages, executors pages by their 
> addresses manually.
> The jobs page is the home page, so this is rather annoyed. The problem is 
> that the vis-timeline rending is too slow.
> Their are some suggestion:
> 1) the page should not render timeline when the page is loading unless user 
> clicks the link.
> 2) the executor group and job group should be separated, and user can choose 
> to show one, both or neither.
> 3) the executor group should display executor event by time horizontally. 
> Currently it is displayed executors one line by one line. If more than 100 
> executors, the page is not that good.
> 4)the vis-timeline library is not maintained anymore since 2017. Should be 
> replaced a new one like [https://github.com/visjs/vis-timeline]
> 5) it is also good to get recent events, e.g. 500, and load more if user want 
> see more, however the data could all be loaded once.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35229) Spark Job web page is extremely slow while there are more than 1500 events in timeline

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35229:


Assignee: Apache Spark

> Spark Job web page is extremely slow while there are more than 1500 events in 
> timeline
> --
>
> Key: SPARK-35229
> URL: https://issues.apache.org/jira/browse/SPARK-35229
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Wang Yuan
>Assignee: Apache Spark
>Priority: Blocker
>
> In a spark streaming application, there are 1000+ executors, more than 2000 
> events (executors events, job events) generated, then the jobs/job web page 
> of spark is not able to show. The browser (chrome, firefox, safari) freezes. 
> I had to open another window to open stages, executors pages by their 
> addresses manually.
> The jobs page is the home page, so this is rather annoyed. The problem is 
> that the vis-timeline rending is too slow.
> Their are some suggestion:
> 1) the page should not render timeline when the page is loading unless user 
> clicks the link.
> 2) the executor group and job group should be separated, and user can choose 
> to show one, both or neither.
> 3) the executor group should display executor event by time horizontally. 
> Currently it is displayed executors one line by one line. If more than 100 
> executors, the page is not that good.
> 4)the vis-timeline library is not maintained anymore since 2017. Should be 
> replaced a new one like [https://github.com/visjs/vis-timeline]
> 5) it is also good to get recent events, e.g. 500, and load more if user want 
> see more, however the data could all be loaded once.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35229) Spark Job web page is extremely slow while there are more than 1500 events in timeline

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35229:


Assignee: (was: Apache Spark)

> Spark Job web page is extremely slow while there are more than 1500 events in 
> timeline
> --
>
> Key: SPARK-35229
> URL: https://issues.apache.org/jira/browse/SPARK-35229
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Wang Yuan
>Priority: Blocker
>
> In a spark streaming application, there are 1000+ executors, more than 2000 
> events (executors events, job events) generated, then the jobs/job web page 
> of spark is not able to show. The browser (chrome, firefox, safari) freezes. 
> I had to open another window to open stages, executors pages by their 
> addresses manually.
> The jobs page is the home page, so this is rather annoyed. The problem is 
> that the vis-timeline rending is too slow.
> Their are some suggestion:
> 1) the page should not render timeline when the page is loading unless user 
> clicks the link.
> 2) the executor group and job group should be separated, and user can choose 
> to show one, both or neither.
> 3) the executor group should display executor event by time horizontally. 
> Currently it is displayed executors one line by one line. If more than 100 
> executors, the page is not that good.
> 4)the vis-timeline library is not maintained anymore since 2017. Should be 
> replaced a new one like [https://github.com/visjs/vis-timeline]
> 5) it is also good to get recent events, e.g. 500, and load more if user want 
> see more, however the data could all be loaded once.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34781) Eliminate LEFT SEMI/ANTI join to its left child side with AQE

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334631#comment-17334631
 ] 

Apache Spark commented on SPARK-34781:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32380

> Eliminate LEFT SEMI/ANTI join to its left child side with AQE
> -
>
> Key: SPARK-34781
> URL: https://issues.apache.org/jira/browse/SPARK-34781
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.2.0
>
>
> In `EliminateJoinToEmptyRelation.scala`, we can extend it to cover more cases 
> for LEFT SEMI and LEFT ANI joins:
>  # Join is left semi join, join right side is non-empty and condition is 
> empty. Eliminate join to its left side.
>  # Join is left anti join, join right side is empty. Eliminate join to its 
> left side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-28 Thread Nick Hryhoriev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334577#comment-17334577
 ] 

Nick Hryhoriev commented on SPARK-11844:


[~Xu_Guang_Lv] My issue is also always reproducible because the file itself is 
damaged.
it fails with any parquet reader, not only spark.
So this means that you write damaged data.
In my case, I write with spark.
And Spark writes corrupted files, silently. without any exception or job fail.
Because the data itself damaged, there is no workaround or patch for reading 
path.
You need to rewrite the corrupted file.

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: parquet.org.apache.thrift.protocol.TProtocolException: don't know 
> what type: 13
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500)
>   at 
> org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158)
>   at 
> parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108)
> {code}
> The next retry was good. Right now, seems not critical. But, let's still

[jira] [Commented] (SPARK-35159) extract doc of hive format

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334569#comment-17334569
 ] 

Apache Spark commented on SPARK-35159:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32379

> extract doc of hive format
> --
>
> Key: SPARK-35159
> URL: https://issues.apache.org/jira/browse/SPARK-35159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> extract doc of hive format



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35159) extract doc of hive format

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334567#comment-17334567
 ] 

Apache Spark commented on SPARK-35159:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32378

> extract doc of hive format
> --
>
> Key: SPARK-35159
> URL: https://issues.apache.org/jira/browse/SPARK-35159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> extract doc of hive format



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35159) extract doc of hive format

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334566#comment-17334566
 ] 

Apache Spark commented on SPARK-35159:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32378

> extract doc of hive format
> --
>
> Key: SPARK-35159
> URL: https://issues.apache.org/jira/browse/SPARK-35159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> extract doc of hive format



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35021) Group exception messages in connector/catalog

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35021:


Assignee: Apache Spark

> Group exception messages in connector/catalog
> -
>
> Key: SPARK-35021
> URL: https://issues.apache.org/jira/browse/SPARK-35021
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> 'sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog'
> || Filename ||   Count ||
> | CatalogManager.scala |   2 |
> | CatalogV2Implicits.scala |   3 |
> | CatalogV2Util.scala  |   8 |
> | LookupCatalog.scala  |   2 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35021) Group exception messages in connector/catalog

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35021:


Assignee: (was: Apache Spark)

> Group exception messages in connector/catalog
> -
>
> Key: SPARK-35021
> URL: https://issues.apache.org/jira/browse/SPARK-35021
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog'
> || Filename ||   Count ||
> | CatalogManager.scala |   2 |
> | CatalogV2Implicits.scala |   3 |
> | CatalogV2Util.scala  |   8 |
> | LookupCatalog.scala  |   2 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35021) Group exception messages in connector/catalog

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334551#comment-17334551
 ] 

Apache Spark commented on SPARK-35021:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/32377

> Group exception messages in connector/catalog
> -
>
> Key: SPARK-35021
> URL: https://issues.apache.org/jira/browse/SPARK-35021
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog'
> || Filename ||   Count ||
> | CatalogManager.scala |   2 |
> | CatalogV2Implicits.scala |   3 |
> | CatalogV2Util.scala  |   8 |
> | LookupCatalog.scala  |   2 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-35021) Group exception messages in connector/catalog

2021-04-28 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-35021:
---
Comment: was deleted

(was: I'm working on.)

> Group exception messages in connector/catalog
> -
>
> Key: SPARK-35021
> URL: https://issues.apache.org/jira/browse/SPARK-35021
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog'
> || Filename ||   Count ||
> | CatalogManager.scala |   2 |
> | CatalogV2Implicits.scala |   3 |
> | CatalogV2Util.scala  |   8 |
> | LookupCatalog.scala  |   2 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35214) OptimizeSkewedJoin support ShuffledHashJoinExec

2021-04-28 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-35214.
--
Fix Version/s: 3.2.0
 Assignee: ulysses you
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/32328

> OptimizeSkewedJoin support ShuffledHashJoinExec
> ---
>
> Key: SPARK-35214
> URL: https://issues.apache.org/jira/browse/SPARK-35214
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, we have already supported all type of join through hint that make 
> it easy to choose the join implementation.
> We would choose `ShuffledHashJoin` if one table is not big but over the 
> broadcast threshold. It's better that we can support optimize it in 
> `OptimizeSkewedJoin`.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33976) Add a dedicated SQL document page for the TRANSFORM-related functionality,

2021-04-28 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33976:
-
Fix Version/s: 3.1.2
   3.0.3

> Add a dedicated SQL document page for the TRANSFORM-related functionality,
> --
>
> Key: SPARK-33976
> URL: https://issues.apache.org/jira/browse/SPARK-33976
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> Add doc about transform 
> https://github.com/apache/spark/pull/30973#issuecomment-753715318



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35021) Group exception messages in connector/catalog

2021-04-28 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334524#comment-17334524
 ] 

jiaan.geng commented on SPARK-35021:


I'm working on.

> Group exception messages in connector/catalog
> -
>
> Key: SPARK-35021
> URL: https://issues.apache.org/jira/browse/SPARK-35021
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog'
> || Filename ||   Count ||
> | CatalogManager.scala |   2 |
> | CatalogV2Implicits.scala |   3 |
> | CatalogV2Util.scala  |   8 |
> | LookupCatalog.scala  |   2 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35229) Spark Job web page is extremely slow while there are more than 1500 events in timeline

2021-04-28 Thread Kousuke Saruta (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334511#comment-17334511
 ] 

Kousuke Saruta commented on SPARK-35229:


[~tiehexue]Thanks for the report.
I'll try to mitigate this issue by limiting the maximum number of the 
jobs/executors.
The maximum number of the tasks are already limited by 
spark.ui.timeline.tasks.maximum so I'll try with the similar approach.


> Spark Job web page is extremely slow while there are more than 1500 events in 
> timeline
> --
>
> Key: SPARK-35229
> URL: https://issues.apache.org/jira/browse/SPARK-35229
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Wang Yuan
>Priority: Blocker
>
> In a spark streaming application, there are 1000+ executors, more than 2000 
> events (executors events, job events) generated, then the jobs/job web page 
> of spark is not able to show. The browser (chrome, firefox, safari) freezes. 
> I had to open another window to open stages, executors pages by their 
> addresses manually.
> The jobs page is the home page, so this is rather annoyed. The problem is 
> that the vis-timeline rending is too slow.
> Their are some suggestion:
> 1) the page should not render timeline when the page is loading unless user 
> clicks the link.
> 2) the executor group and job group should be separated, and user can choose 
> to show one, both or neither.
> 3) the executor group should display executor event by time horizontally. 
> Currently it is displayed executors one line by one line. If more than 100 
> executors, the page is not that good.
> 4)the vis-timeline library is not maintained anymore since 2017. Should be 
> replaced a new one like [https://github.com/visjs/vis-timeline]
> 5) it is also good to get recent events, e.g. 500, and load more if user want 
> see more, however the data could all be loaded once.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33976) Add a dedicated SQL document page for the TRANSFORM-related functionality,

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334491#comment-17334491
 ] 

Apache Spark commented on SPARK-33976:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32376

> Add a dedicated SQL document page for the TRANSFORM-related functionality,
> --
>
> Key: SPARK-33976
> URL: https://issues.apache.org/jira/browse/SPARK-33976
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Add doc about transform 
> https://github.com/apache/spark/pull/30973#issuecomment-753715318



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35253) Upgrade Janino from 3.0.x to 3.1.x

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334488#comment-17334488
 ] 

Apache Spark commented on SPARK-35253:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32374

> Upgrade Janino from 3.0.x to 3.1.x
> --
>
> Key: SPARK-35253
> URL: https://issues.apache.org/jira/browse/SPARK-35253
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> From the [change log|http://janino-compiler.github.io/janino/changelog.html], 
>  the janino 3.0.x line has been deprecated,  we can use 3.1.x line instead of 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35253) Upgrade Janino from 3.0.x to 3.1.x

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35253:


Assignee: (was: Apache Spark)

> Upgrade Janino from 3.0.x to 3.1.x
> --
>
> Key: SPARK-35253
> URL: https://issues.apache.org/jira/browse/SPARK-35253
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> From the [change log|http://janino-compiler.github.io/janino/changelog.html], 
>  the janino 3.0.x line has been deprecated,  we can use 3.1.x line instead of 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35253) Upgrade Janino from 3.0.x to 3.1.x

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334486#comment-17334486
 ] 

Apache Spark commented on SPARK-35253:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32374

> Upgrade Janino from 3.0.x to 3.1.x
> --
>
> Key: SPARK-35253
> URL: https://issues.apache.org/jira/browse/SPARK-35253
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> From the [change log|http://janino-compiler.github.io/janino/changelog.html], 
>  the janino 3.0.x line has been deprecated,  we can use 3.1.x line instead of 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35253) Upgrade Janino from 3.0.x to 3.1.x

2021-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35253:


Assignee: Apache Spark

> Upgrade Janino from 3.0.x to 3.1.x
> --
>
> Key: SPARK-35253
> URL: https://issues.apache.org/jira/browse/SPARK-35253
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> From the [change log|http://janino-compiler.github.io/janino/changelog.html], 
>  the janino 3.0.x line has been deprecated,  we can use 3.1.x line instead of 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33976) Add a dedicated SQL document page for the TRANSFORM-related functionality,

2021-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334485#comment-17334485
 ] 

Apache Spark commented on SPARK-33976:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32375

> Add a dedicated SQL document page for the TRANSFORM-related functionality,
> --
>
> Key: SPARK-33976
> URL: https://issues.apache.org/jira/browse/SPARK-33976
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Add doc about transform 
> https://github.com/apache/spark/pull/30973#issuecomment-753715318



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35253) Upgrade Janino from 3.0.x to 3.1.x

2021-04-28 Thread Yang Jie (Jira)

Yang Jie created SPARK-35253:


 Summary: Upgrade Janino from 3.0.x to 3.1.x
 Key: SPARK-35253
 URL: https://issues.apache.org/jira/browse/SPARK-35253
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 3.2.0
Reporter: Yang Jie


>From the [change log|http://janino-compiler.github.io/janino/changelog.html],  
>the janino 3.0.x line has been deprecated,  we can use 3.1.x line instead of 
>it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35244) invoke should throw the original exception

2021-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-35244:

Fix Version/s: 3.1.2
   3.0.3

> invoke should throw the original exception
> --
>
> Key: SPARK-35244
> URL: https://issues.apache.org/jira/browse/SPARK-35244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35085) Get columns operation should handle ANSI interval column properly

2021-04-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35085.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32345
[https://github.com/apache/spark/pull/32345]

> Get columns operation should handle ANSI interval column properly
> -
>
> Key: SPARK-35085
> URL: https://issues.apache.org/jira/browse/SPARK-35085
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> # Write tests for ANSI intervals similar to test("get columns operation 
> should handle interval column properly")
> # views can contain ANSI interval columns which should be handled properly 
> via SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35085) Get columns operation should handle ANSI interval column properly

2021-04-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-35085:


Assignee: jiaan.geng

> Get columns operation should handle ANSI interval column properly
> -
>
> Key: SPARK-35085
> URL: https://issues.apache.org/jira/browse/SPARK-35085
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: jiaan.geng
>Priority: Major
>
> # Write tests for ANSI intervals similar to test("get columns operation 
> should handle interval column properly")
> # views can contain ANSI interval columns which should be handled properly 
> via SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

99 matches

Mail list logo