from:"Thomas Graves \(Jira\)"

[jira] [Updated] (SPARK-35881) [SQL] AQE does not support columnar execution for the final query stage

2021-07-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-35881:
--
Fix Version/s: 3.2.0

> [SQL] AQE does not support columnar execution for the final query stage
> ---
>
> Key: SPARK-35881
> URL: https://issues.apache.org/jira/browse/SPARK-35881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.2.0, 3.3.0
>
>
> In AdaptiveSparkPlanExec, a query is broken down into stages and these stages 
> are executed until the entire query has been executed. These stages can be 
> row-based or columnar. However, the final stage, produced by the private 
> getFinalPhysicalPlan method is always assumed to be row-based. The only way 
> to execute the final stage is by calling the various doExecute methods on 
> AdaptiveSparkPlanExec, and doExecuteColumnar is not implemented. The 
> supportsColumnar method also always returns false.
> In the RAPIDS Accelerator for Apache Spark, we currently call the private 
> getFinalPhysicalPlan method using reflection and then determine if that plan 
> is columnar or not, and then call the appropriate doExecute method, bypassing 
> the doExecute methods on AdaptiveSparkPlanExec. We would like a supported 
> mechanism for executing a columnar AQE plan so that we do not need to use 
> reflection.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35881) [SQL] AQE does not support columnar execution for the final query stage

2021-07-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-35881.
---
Fix Version/s: 3.3.0
   3.2.0
   Resolution: Fixed

> [SQL] AQE does not support columnar execution for the final query stage
> ---
>
> Key: SPARK-35881
> URL: https://issues.apache.org/jira/browse/SPARK-35881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.2.0, 3.3.0
>
>
> In AdaptiveSparkPlanExec, a query is broken down into stages and these stages 
> are executed until the entire query has been executed. These stages can be 
> row-based or columnar. However, the final stage, produced by the private 
> getFinalPhysicalPlan method is always assumed to be row-based. The only way 
> to execute the final stage is by calling the various doExecute methods on 
> AdaptiveSparkPlanExec, and doExecuteColumnar is not implemented. The 
> supportsColumnar method also always returns false.
> In the RAPIDS Accelerator for Apache Spark, we currently call the private 
> getFinalPhysicalPlan method using reflection and then determine if that plan 
> is columnar or not, and then call the appropriate doExecute method, bypassing 
> the doExecute methods on AdaptiveSparkPlanExec. We would like a supported 
> mechanism for executing a columnar AQE plan so that we do not need to use 
> reflection.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35881) [SQL] AQE does not support columnar execution for the final query stage

2021-07-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-35881:
-

Assignee: Andy Grove

> [SQL] AQE does not support columnar execution for the final query stage
> ---
>
> Key: SPARK-35881
> URL: https://issues.apache.org/jira/browse/SPARK-35881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> In AdaptiveSparkPlanExec, a query is broken down into stages and these stages 
> are executed until the entire query has been executed. These stages can be 
> row-based or columnar. However, the final stage, produced by the private 
> getFinalPhysicalPlan method is always assumed to be row-based. The only way 
> to execute the final stage is by calling the various doExecute methods on 
> AdaptiveSparkPlanExec, and doExecuteColumnar is not implemented. The 
> supportsColumnar method also always returns false.
> In the RAPIDS Accelerator for Apache Spark, we currently call the private 
> getFinalPhysicalPlan method using reflection and then determine if that plan 
> is columnar or not, and then call the appropriate doExecute method, bypassing 
> the doExecute methods on AdaptiveSparkPlanExec. We would like a supported 
> mechanism for executing a columnar AQE plan so that we do not need to use 
> reflection.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-07-19 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383529#comment-17383529
 ] 

Thomas Graves commented on SPARK-25075:
---

Just wanted to check the plans for scala 2.13 in 3.2.  It looks like scala 2.12 
will still be the default, correct?

Are we planning on releasing the Spark tgz artifacts for 2.13 and 2.12 or only 
2.12?

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32333) Drop references to Master

2021-07-09 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378093#comment-17378093
 ] 

Thomas Graves commented on SPARK-32333:
---

Getting back to this, now that spark 3.2 branch is cut, perhaps we can target 
for 3.3.

>From the discussion thread on spark-dev mailing list Leader was mentioned the 
>most, Scheduler a second. 


One reason against controller, coordinator, application manager, primary as it 
perhaps implies being needed and if the standalone master goes down the apps 
are unaffected.

Based on the feedback, I propose "Leader" based on feedback and it being short.

> Drop references to Master
> -
>
> Key: SPARK-32333
> URL: https://issues.apache.org/jira/browse/SPARK-32333
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> We have a lot of references to "master" in the code base. It will be 
> beneficial to remove references to problematic language that can alienate 
> potential community members. 
> SPARK-32004 removed references to slave
>  
> Here is a IETF draft to fix up some of the most egregious examples
> (master/slave, whitelist/backlist) with proposed alternatives.
> https://tools.ietf.org/id/draft-knodel-terminology-00.html#rfc.section.1.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33031) scheduler with blacklisting doesn't appear to pick up new executor added

2021-06-29 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371592#comment-17371592
 ] 

Thomas Graves commented on SPARK-33031:
---

I do not think its resolved but haven't tried it lately.  it sounds like its 
just a UI issue so if you want to try it out and still see the problem, feel 
free to work on it.

> scheduler with blacklisting doesn't appear to pick up new executor added
> 
>
> Key: SPARK-33031
> URL: https://issues.apache.org/jira/browse/SPARK-33031
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a test with blacklisting  standalone mode and all the executors 
> were initially blacklisted.  Then one of the executors died and we got 
> allocated another one. The scheduler did not appear to pick up the new one 
> and try to schedule on it though.
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }
> rdd.collect(){code}
>  
> Note that I tried both with and without dynamic allocation enabled.
>  
> You can see screen shot related on 
> https://issues.apache.org/jira/browse/SPARK-33029



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN

2021-06-25 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-35672.
---
Fix Version/s: 3.2.0
 Assignee: Erik Krogen
   Resolution: Fixed

> Spark fails to launch executors with very large user classpath lists on YARN
> 
>
> Key: SPARK-35672
> URL: https://issues.apache.org/jira/browse/SPARK-35672
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.1.2
> Environment: Linux RHEL7
> Spark 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.2.0
>
>
> When running Spark on YARN, the {{user-class-path}} argument to 
> {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to 
> executor processes. The argument is specified once for each JAR, and the URIs 
> are fully-qualified, so the paths can be quite long. With large user JAR 
> lists (say 1000+), this can result in system-level argument length limits 
> being exceeded, typically manifesting as the error message:
> {code}
> /bin/bash: Argument list too long
> {code}
> A [Google 
> search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22]
>  indicates that this is not a theoretical problem and afflicts real users, 
> including ours. This issue was originally observed on Spark 2.3, but has been 
> confirmed to exist in the master branch as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35391) Memory leak in ExecutorAllocationListener breaks dynamic allocation under high load

2021-06-21 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-35391:
-

Assignee: Vasily Kolpakov

> Memory leak in ExecutorAllocationListener breaks dynamic allocation under 
> high load
> ---
>
> Key: SPARK-35391
> URL: https://issues.apache.org/jira/browse/SPARK-35391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Vasily Kolpakov
>Assignee: Vasily Kolpakov
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> ExecutorAllocationListener doesn't clean up data properly. 
> ExecutorAllocationListener performs progressively slower and eventually fails 
> to process events in time.
> There are two problems:
>  * a bug (typo?) in totalRunningTasksPerResourceProfile() method
>  getOrElseUpdate() is used instead of getOrElse().
>  If spark-dynamic-executor-allocation thread calls schedule() after a 
> SparkListenerTaskEnd event for the last task in a stage
>  but before SparkListenerStageCompleted event for the stage, then 
> stageAttemptToNumRunningTask will not be cleaned up properly.
>  * resourceProfileIdToStageAttempt clean-up is broken
>  If a SparkListenerTaskEnd event for the last task in a stage was processed 
> before SparkListenerStageCompleted for that stage,
>  then resourceProfileIdToStageAttempt will not be cleaned up properly.
>  
> Bugs were introduced in this commit: 
> https://github.com/apache/spark/commit/496f6ac86001d284cbfb7488a63dd3a168919c0f
>  .
> Steps to reproduce:
>  # Launch standalone master and worker with 
> 'spark.shuffle.service.enabled=true'
>  # Run spark-shell with --conf 'spark.shuffle.service.enabled=true' --conf 
> 'spark.dynamicAllocation.enabled=true' and paste this script
> {code:java}
> for (_ <- 0 until 10) {
> Seq(1, 2, 3, 4, 5).toDF.repartition(100).agg("value" -> "sum").show()
> }
> {code}
>  # make a heap dump and examine 
> ExecutorAllocationListener.totalRunningTasksPerResourceProfile and 
> ExecutorAllocationListener.resourceProfileIdToStageAttempt fields
> Expected: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) are empty
> Actual: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) contain 
> non-relevant data
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35391) Memory leak in ExecutorAllocationListener breaks dynamic allocation under high load

2021-06-21 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-35391.
---
Fix Version/s: 3.1.3
   3.2.0
   Resolution: Fixed

> Memory leak in ExecutorAllocationListener breaks dynamic allocation under 
> high load
> ---
>
> Key: SPARK-35391
> URL: https://issues.apache.org/jira/browse/SPARK-35391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Vasily Kolpakov
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> ExecutorAllocationListener doesn't clean up data properly. 
> ExecutorAllocationListener performs progressively slower and eventually fails 
> to process events in time.
> There are two problems:
>  * a bug (typo?) in totalRunningTasksPerResourceProfile() method
>  getOrElseUpdate() is used instead of getOrElse().
>  If spark-dynamic-executor-allocation thread calls schedule() after a 
> SparkListenerTaskEnd event for the last task in a stage
>  but before SparkListenerStageCompleted event for the stage, then 
> stageAttemptToNumRunningTask will not be cleaned up properly.
>  * resourceProfileIdToStageAttempt clean-up is broken
>  If a SparkListenerTaskEnd event for the last task in a stage was processed 
> before SparkListenerStageCompleted for that stage,
>  then resourceProfileIdToStageAttempt will not be cleaned up properly.
>  
> Bugs were introduced in this commit: 
> https://github.com/apache/spark/commit/496f6ac86001d284cbfb7488a63dd3a168919c0f
>  .
> Steps to reproduce:
>  # Launch standalone master and worker with 
> 'spark.shuffle.service.enabled=true'
>  # Run spark-shell with --conf 'spark.shuffle.service.enabled=true' --conf 
> 'spark.dynamicAllocation.enabled=true' and paste this script
> {code:java}
> for (_ <- 0 until 10) {
> Seq(1, 2, 3, 4, 5).toDF.repartition(100).agg("value" -> "sum").show()
> }
> {code}
>  # make a heap dump and examine 
> ExecutorAllocationListener.totalRunningTasksPerResourceProfile and 
> ExecutorAllocationListener.resourceProfileIdToStageAttempt fields
> Expected: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) are empty
> Actual: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) contain 
> non-relevant data
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35074) spark.jars.xxx configs should be moved to config/package.scala

2021-06-07 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-35074:
-

Assignee: dc-heros

> spark.jars.xxx configs should be moved to config/package.scala
> --
>
> Key: SPARK-35074
> URL: https://issues.apache.org/jira/browse/SPARK-35074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Assignee: dc-heros
>Priority: Trivial
> Fix For: 3.2.0
>
>
> Currently {{spark.jars.xxx}} property keys (e.g. {{spark.jars.ivySettings}} 
> and {{spark.jars.packages}}) are hardcoded in multiple places within Spark 
> code across multiple modules. We should define them in 
> {{config/package.scala}} and reference them in all other places.
> This came up during reviews of SPARK-34472 at 
> https://github.com/apache/spark/pull/31591#discussion_r584848624



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35074) spark.jars.xxx configs should be moved to config/package.scala

2021-06-07 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-35074.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

> spark.jars.xxx configs should be moved to config/package.scala
> --
>
> Key: SPARK-35074
> URL: https://issues.apache.org/jira/browse/SPARK-35074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Trivial
> Fix For: 3.2.0
>
>
> Currently {{spark.jars.xxx}} property keys (e.g. {{spark.jars.ivySettings}} 
> and {{spark.jars.packages}}) are hardcoded in multiple places within Spark 
> code across multiple modules. We should define them in 
> {{config/package.scala}} and reference them in all other places.
> This came up during reviews of SPARK-34472 at 
> https://github.com/apache/spark/pull/31591#discussion_r584848624



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35093) AQE columnar mismatch on exchange reuse

2021-05-19 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-35093.
---
Fix Version/s: 3.2.0
   3.1.2
   3.0.3
 Assignee: Andy Grove
   Resolution: Fixed

> AQE columnar mismatch on exchange reuse
> ---
>
> Key: SPARK-35093
> URL: https://issues.apache.org/jira/browse/SPARK-35093
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that 
> are semantically equal.
> This is done by comparing the canonicalized plan for two Exchange nodes to 
> see if they are the same.
> Unfortunately this does not take into account the fact that two exchanges 
> with the same canonical plan might be replaced by a plugin in a way that 
> makes them not compatible. For example, a plugin could create one version 
> with supportsColumnar=true and another with supportsColumnar=false. It is not 
> valid to re-use exchanges if there is a supportsColumnar mismatch.
> I have tested a fix for this and will put up a PR once I figure out how to 
> write the tests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34472) SparkContext.addJar with an ivy path fails in cluster mode with a custom ivySettings file

2021-04-20 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-34472.
---
Fix Version/s: 3.2.0
 Assignee: Shardul Mahadik
   Resolution: Fixed

> SparkContext.addJar with an ivy path fails in cluster mode with a custom 
> ivySettings file
> -
>
> Key: SPARK-34472
> URL: https://issues.apache.org/jira/browse/SPARK-34472
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Assignee: Shardul Mahadik
>Priority: Major
> Fix For: 3.2.0
>
>
> SPARK-33084 introduced support for Ivy paths in {{sc.addJar}} or Spark SQL 
> {{ADD JAR}}. If we use a custom ivySettings file using 
> {{spark.jars.ivySettings}}, it is loaded at 
> [https://github.com/apache/spark/blob/b26e7b510bbaee63c4095ab47e75ff2a70e377d7/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1280.]
>  However, this file is only accessible on the client machine. In cluster 
> mode, this file is not available on the driver and so {{addJar}} fails.
> {code:sh}
> spark-submit --master yarn --deploy-mode cluster --class IvyAddJarExample 
> --conf spark.jars.ivySettings=/path/to/ivySettings.xml example.jar
> {code}
> {code}
> java.lang.IllegalArgumentException: requirement failed: Ivy settings file 
> /path/to/ivySettings.xml does not exist
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.deploy.SparkSubmitUtils$.loadIvySettings(SparkSubmit.scala:1331)
>   at 
> org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:176)
>   at 
> org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:156)
>   at 
> org.apache.spark.sql.internal.SessionResourceLoader.resolveJars(SessionState.scala:166)
>   at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:133)
>   at 
> org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40)
>  {code}
> We should ship the ivySettings file to the driver so that {{addJar}} is able 
> to find it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34877) Add Spark AM Log link in case of master as yarn and deploy mode as client

2021-04-20 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-34877.
---
Fix Version/s: 3.2.0
 Assignee: Saurabh Chawla
   Resolution: Fixed

> Add Spark AM Log link in case of master as yarn and deploy mode as client
> -
>
> Key: SPARK-34877
> URL: https://issues.apache.org/jira/browse/SPARK-34877
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.1.1
>Reporter: Saurabh Chawla
>Assignee: Saurabh Chawla
>Priority: Minor
> Fix For: 3.2.0
>
>
> On Running Spark job with yarn and deployment mode as client, Spark Driver 
> and Spark Application master launch in two separate containers. In various 
> scenarios there is need to see Spark Application master logs to see the 
> resource allocation, Decommissioning status and other information shared 
> between yarn RM and Spark Application master.
> Till now the only way to check this by finding the container id of the AM and 
> check the logs either using Yarn utility or Yarn RM Application History 
> server. 
> This Jira is for adding the spark AM log link for spark job running in the 
> client mode for yarn. Instead of searching the container id and then find the 
> logs. We can directly check in the Spark UI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

2021-04-19 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325252#comment-17325252
 ] 

Thomas Graves commented on SPARK-35108:
---

seems like a correctness issue in some cases so marking it as such until to 
investigate.  [~hyukjin.kwon] [~cloud_fan]

> Pickle produces incorrect key labels for GenericRowWithSchema (data 
> corruption)
> ---
>
> Key: SPARK-35108
> URL: https://issues.apache.org/jira/browse/SPARK-35108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: test.py, test.sh
>
>
> I think this also shows up for all versions of Spark that pickle the data 
> when doing a collect from python.
> When you do a collect in python java will do a collect and convert the 
> UnsafeRows into GenericRowWithSchema instances before it sends them to the 
> Pickler. The Pickler, by default, will try to dedupe objects using hashCode 
> and .equals for the object.  But .equals and .hashCode for 
> GenericRowWithSchema only looks at the data, not the schema. But when we 
> pickle the row the keys from the schema are written out.
> This can result in data corruption, sort of, in a few cases where a row has 
> the same number of elements as a struct within the row does, or a sub-struct 
> within another struct. 
> If the data happens to be the same, the keys for the resulting row or struct 
> can be wrong.
> My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

2021-04-19 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-35108:
--
Labels: correctness  (was: )

> Pickle produces incorrect key labels for GenericRowWithSchema (data 
> corruption)
> ---
>
> Key: SPARK-35108
> URL: https://issues.apache.org/jira/browse/SPARK-35108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Major
>  Labels: correctness
> Attachments: test.py, test.sh
>
>
> I think this also shows up for all versions of Spark that pickle the data 
> when doing a collect from python.
> When you do a collect in python java will do a collect and convert the 
> UnsafeRows into GenericRowWithSchema instances before it sends them to the 
> Pickler. The Pickler, by default, will try to dedupe objects using hashCode 
> and .equals for the object.  But .equals and .hashCode for 
> GenericRowWithSchema only looks at the data, not the schema. But when we 
> pickle the row the keys from the schema are written out.
> This can result in data corruption, sort of, in a few cases where a row has 
> the same number of elements as a struct within the row does, or a sub-struct 
> within another struct. 
> If the data happens to be the same, the keys for the resulting row or struct 
> can be wrong.
> My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

2021-04-19 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-35108:
--
Priority: Blocker  (was: Major)

> Pickle produces incorrect key labels for GenericRowWithSchema (data 
> corruption)
> ---
>
> Key: SPARK-35108
> URL: https://issues.apache.org/jira/browse/SPARK-35108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: test.py, test.sh
>
>
> I think this also shows up for all versions of Spark that pickle the data 
> when doing a collect from python.
> When you do a collect in python java will do a collect and convert the 
> UnsafeRows into GenericRowWithSchema instances before it sends them to the 
> Pickler. The Pickler, by default, will try to dedupe objects using hashCode 
> and .equals for the object.  But .equals and .hashCode for 
> GenericRowWithSchema only looks at the data, not the schema. But when we 
> pickle the row the keys from the schema are written out.
> This can result in data corruption, sort of, in a few cases where a row has 
> the same number of elements as a struct within the row does, or a sub-struct 
> within another struct. 
> If the data happens to be the same, the keys for the resulting row or struct 
> can be wrong.
> My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-12 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319679#comment-17319679
 ] 

Thomas Graves commented on SPARK-34989:
---

I just saw this go, all the performance numbers here are in % improvement.  
What kind of raw times are you seeing for query compilation?

Sounds like a nice improvement, I haven't had any issues with query compilation 
times so I'm curious what the numbers are and I assume people are seeing issues 
with this ? Its a pretty major changes to the base TreeNode class so anyone 
extending those is now broken.

> Improve the performance of mapChildren and withNewChildren methods
> --
>
> Key: SPARK-34989
> URL: https://issues.apache.org/jira/browse/SPARK-34989
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Major
> Fix For: 3.2.0
>
>
> One of the main performance bottlenecks in query compilation is 
> overly-generic tree transformation methods, namely {{mapChildren}} and 
> {{withNewChildren}} (defined in {{TreeNode}}). These methods have an 
> overly-generic implementation to iterate over the children and rely on 
> reflection to create new instances. We have observed that, especially for 
> queries with large query plans, a significant amount of CPU cycles are wasted 
> in these methods. In this PR we make these methods more efficient, by 
> delegating the iteration and instantiation to concrete node types. The 
> benchmarks show that we can expect significant performance improvement in 
> total query compilation time in queries with large query plans (from 30-80%) 
> and about 20% on average.
> h4. Problem detail
> The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To 
> be more specific, this method:
>  * iterates over all the fields of a node using Scala’s product iterator. 
> While the iteration is not reflection-based, thanks to the Scala compiler 
> generating code for {{Product}}, we create many anonymous functions and visit 
> many nested structures (recursive calls).
>  The anonymous functions (presumably compiled to Java anonymous inner 
> classes) also show up quite high on the list in the object allocation 
> profiles, so we are putting unnecessary pressure on GC here.
>  * does a lot of comparisons. Basically for each element returned from the 
> product iterator, we check if it is a child (contained in the list of 
> children) and then transform it. We can avoid that by just iterating over 
> children, but in the current implementation, we need to gather all the fields 
> (only transform the children) so that we can instantiate the object using the 
> reflection.
>  * creates objects using reflection, by delegating to the {{makeCopy}} 
> method, which is several orders of magnitude slower than using the 
> constructor.
> h4. Solution
> The proposed solution in this PR is rather straightforward: we rewrite the 
> {{mapChildren}} method using the {{children}} and {{withNewChildren}} 
> methods. The default {{withNewChildren}} method suffers from the same 
> problems as {{mapChildren}} and we need to make it more efficient by 
> specializing it in concrete classes. Similar to how each concrete query plan 
> node already defines its children, it should also define how they can be 
> constructed given a new list of children. Actually, the implementation is 
> quite simple in most cases and is a one-liner thanks to the copy method 
> present in Scala case classes. Note that we cannot abstract over the copy 
> method, it’s generated by the compiler for case classes if no other type 
> higher in the hierarchy defines it. For most concrete nodes, the 
> implementation of {{withNewChildren}} looks like this:
>  
> {{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
> copy(children = newChildren)}}
> The current {{withNewChildren}} method has two properties that we should 
> preserve:
>  * It returns the same instance if the provided children are the same as its 
> children, i.e., it preserves referential equality.
>  * It copies tags and maintains the origin links when a new copy is created.
> These properties are hard to enforce in the concrete node type 
> implementation. Therefore, we propose a template method 
> {{withNewChildrenInternal}} that should be rewritten by the concrete classes 
> and let the {{withNewChildren}} method take care of referential equality and 
> copying:
> {{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
> {}}
>  {{  if (childrenFastEquals(children, newChildren)) {}}
>  {{    this}}
>  {{  } else {}}
>  {{    CurrentOrigin.withOrigin(origin) {}}
>  {{      val res =

[jira] [Resolved] (SPARK-34828) YARN Shuffle Service: Support configurability of aux service name and service-specific config overrides

2021-03-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-34828.
---
Fix Version/s: 3.2.0
 Assignee: Erik Krogen
   Resolution: Fixed

> YARN Shuffle Service: Support configurability of aux service name and 
> service-specific config overrides
> ---
>
> Key: SPARK-34828
> URL: https://issues.apache.org/jira/browse/SPARK-34828
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.2.0
>
>
> In some cases it may be desirable to run multiple instances of the Spark 
> Shuffle Service which are using different versions of Spark. This can be 
> helpful, for example, when running a YARN cluster with a mixed workload of 
> applications running multiple Spark versions, since a given version of the 
> shuffle service is not always compatible with other versions of Spark. (See 
> SPARK-27780 for more detail on this)
> YARN versions since 2.9.0 support the ability to run shuffle services within 
> an isolated classloader (see YARN-4577), meaning multiple Spark versions can 
> coexist within a single NodeManager.
> To support this from the Spark side, we need to make two enhancements:
> * Make the name of the shuffle service configurable. Currently it is 
> hard-coded to be {{spark_shuffle}} on both the client and server side. The 
> server-side name is not actually used anywhere, as it is the value within the 
> {{yarn.nodemanager.aux-services}} which is considered by the NodeManager to 
> be definitive name. However, if you change this in the configs, the 
> hard-coded name within the client will no longer match. So, this needs to be 
> configurable.
> * Add a way to separately configure the two shuffle service instances. Since 
> the configurations such as the port number are taken from the NodeManager 
> config, they will both try to use the same port, which obviously won't work. 
> So, we need to provide a way to selectively configure the two shuffle service 
> instances. I will go into details on my proposal for how to achieve this 
> within the PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (ARROW-9019) [Python] hdfs fails to connect to for HDFS 3.x cluster

2021-03-11 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299894#comment-17299894
 ] 

Thomas Graves commented on ARROW-9019:
--

Note I was able to finally test this and on dataproc at least setting the 
classpath did work around the issue.  It must be a jar file order issue.

> [Python] hdfs fails to connect to for HDFS 3.x cluster
> --
>
> Key: ARROW-9019
> URL: https://issues.apache.org/jira/browse/ARROW-9019
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Thomas Graves
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: filesystem, hdfs
>
> I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an 
> error that looks like a protobuf or jar mismatch problem with Hadoop. The 
> same code works on a Hadoop 2.9 cluster.
>  
> I'm wondering if there is something special I need to do or if pyarrow 
> doesn't support Hadoop 3.x yet?
> Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1.
>  
>     import pyarrow as pa
>     hdfs_kwargs = dict(host="namenodehost",
>                       port=9000,
>                       user="tgraves",
>                       driver='libhdfs',
>                       kerb_ticket=None,
>                       extra_conf=None)
>     fs = pa.hdfs.connect(**hdfs_kwargs)
>     res = fs.exists("/user/tgraves")
>  
> Error that I get on Hadoop 3.x is:
>  
> dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error:
> ClassCastException: 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to 
> org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException:
>  
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-9019) [Python] hdfs fails to connect to for HDFS 3.x cluster

2021-03-11 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299894#comment-17299894
 ] 

Thomas Graves edited comment on ARROW-9019 at 3/11/21, 9:55 PM:


Note I was able to finally test this and on dataproc at least setting the 
classpath did work around the issue.  It must be a jar file order issue.  In 
this case though I set it and manually started pyspark.


was (Author: tgraves):
Note I was able to finally test this and on dataproc at least setting the 
classpath did work around the issue.  It must be a jar file order issue.

> [Python] hdfs fails to connect to for HDFS 3.x cluster
> --
>
> Key: ARROW-9019
> URL: https://issues.apache.org/jira/browse/ARROW-9019
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Thomas Graves
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: filesystem, hdfs
>
> I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an 
> error that looks like a protobuf or jar mismatch problem with Hadoop. The 
> same code works on a Hadoop 2.9 cluster.
>  
> I'm wondering if there is something special I need to do or if pyarrow 
> doesn't support Hadoop 3.x yet?
> Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1.
>  
>     import pyarrow as pa
>     hdfs_kwargs = dict(host="namenodehost",
>                       port=9000,
>                       user="tgraves",
>                       driver='libhdfs',
>                       kerb_ticket=None,
>                       extra_conf=None)
>     fs = pa.hdfs.connect(**hdfs_kwargs)
>     res = fs.exists("/user/tgraves")
>  
> Error that I get on Hadoop 3.x is:
>  
> dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error:
> ClassCastException: 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to 
> org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException:
>  
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9019) [Python] hdfs fails to connect to for HDFS 3.x cluster

2021-03-11 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299881#comment-17299881
 ] 

Thomas Graves commented on ARROW-9019:
--

[~bradmiro] I don't really understand how that fixes the issue, the hadoop 
classpath is already included when a container launchs on yarn, in this case I 
launched Spark on yarn and the hadoop classpath should already be there.  Now 
the only thing I can think of is if this caused the order of things in the 
classpath to change

> [Python] hdfs fails to connect to for HDFS 3.x cluster
> --
>
> Key: ARROW-9019
> URL: https://issues.apache.org/jira/browse/ARROW-9019
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Thomas Graves
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: filesystem, hdfs
>
> I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an 
> error that looks like a protobuf or jar mismatch problem with Hadoop. The 
> same code works on a Hadoop 2.9 cluster.
>  
> I'm wondering if there is something special I need to do or if pyarrow 
> doesn't support Hadoop 3.x yet?
> Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1.
>  
>     import pyarrow as pa
>     hdfs_kwargs = dict(host="namenodehost",
>                       port=9000,
>                       user="tgraves",
>                       driver='libhdfs',
>                       kerb_ticket=None,
>                       extra_conf=None)
>     fs = pa.hdfs.connect(**hdfs_kwargs)
>     res = fs.exists("/user/tgraves")
>  
> Error that I get on Hadoop 3.x is:
>  
> dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error:
> ClassCastException: 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to 
> org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException:
>  
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2021-03-05 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296340#comment-17296340
 ] 

Thomas Graves commented on SPARK-34645:
---

[~dongjoon] [~holden]. have you seen this at all?

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
> 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system stopped.
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system shutdown complete.
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2021-03-05 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296060#comment-17296060
 ] 

Thomas Graves commented on SPARK-33288:
---

It looks like the version you are trying to launch against is incompatible.  ie 
your backend is using spark 3.0 and your client is using spark 3.1.1.

Use the same version for both.

 

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling

2021-03-01 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27495.
---
   Fix Version/s: 3.1.1
Target Version/s:   (was: 3.2.0)
  Resolution: Fixed

the major functionality is all in Spark 3.1.1, the other issues linked are 
minor changes or improvements

> SPIP: Support Stage level resource configuration and scheduling
> ---
>
> Key: SPARK-27495
> URL: https://issues.apache.org/jira/browse/SPARK-27495
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>  Labels: SPIP
> Fix For: 3.1.1
>
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> Objectives:
>  # Allow users to specify task and executor resource requirements at the 
> stage level. 
>  # Spark will use the stage level requirements to acquire the necessary 
> resources/executors and schedule tasks based on the per stage requirements.
> Many times users have different resource requirements for different stages of 
> their application so they want to be able to configure resources at the stage 
> level. For instance, you have a single job that has 2 stages. The first stage 
> does some  ETL which requires a lot of tasks, each with a small amount of 
> memory and 1 core each. Then you have a second stage where you feed that ETL 
> data into an ML algorithm. The second stage only requires a few executors but 
> each executor needs a lot of memory, GPUs, and many cores.  This feature 
> allows the user to specify the task and executor resource requirements for 
> the ETL Stage and then change them for the ML stage of the job. 
> Resources include cpu, memory (on heap, overhead, pyspark, and off heap), and 
> extra Resources (GPU/FPGA/etc). It has the potential to allow for other 
> things like limiting the number of tasks per stage, specifying other 
> parameters for things like shuffle, etc. Initially I would propose we only 
> support resources as they are now. So Task resources would be cpu and other 
> resources (GPU, FPGA), that way we aren't adding in extra scheduling things 
> at this point.  Executor resources would be cpu, memory, and extra 
> resources(GPU,FPGA, etc). Changing the executor resources will rely on 
> dynamic allocation being enabled.
> Main use cases:
>  # ML use case where user does ETL and feeds it into an ML algorithm where 
> it’s using the RDD API. This should work with barrier scheduling as well once 
> it supports dynamic allocation.
>  # This adds the framework/api for Spark's own internal use.  In the future 
> (not covered by this SPIP), Catalyst could control the stage level resources 
> as it finds the need to change it between stages for different optimizations. 
> For instance, with the new columnar plugin to the query planner we can insert 
> stages into the plan that would change running something on the CPU in row 
> format to running it on the GPU in columnar format. This API would allow the 
> planner to make sure the stages that run on the GPU get the corresponding GPU 
> resources it needs to run. Another possible use case for catalyst is that it 
> would allow catalyst to add in more optimizations to where the user doesn’t 
> need to configure container sizes at all. If the optimizer/planner can handle 
> that for the user, everyone wins.
> This SPIP focuses on the RDD API but we don’t exclude the Dataset API. I 
> think the DataSet API will require more changes because it specifically hides 
> the RDD from the users via the plans and catalyst can optimize the plan and 
> insert things into the plan. The only way I’ve found to make this work with 
> the Dataset API would be modifying all the plans to be able to get the 
> resource requirements down into where it creates the RDDs, which I believe 
> would be a lot of change.  If other people know better options, it would be 
> great to hear them.
> *Q2.* What problem is this proposal NOT designed to solve?
> The initial implementation is not going to add Dataset APIs.
> We are starting with allowing users to specify a specific set of 
> task/executor resources and plan to design it to be extendable, but the first 
> implementation will not support changing generic SparkConf configs and only 
> specific limited resources.
> This initial version will have a programmatic API for specifying the resource 
> requirements per stage, we can add the ability to perhaps have profiles in 
> the configs later if its useful.
> *Q3.* How is it done today, and what are the limits of current practice?
> Currently this is either done by having multiple spark jobs or requesting 
> containers with the max

[jira] [Commented] (SPARK-29329) maven incremental builds not working

2021-03-01 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293005#comment-17293005
 ] 

Thomas Graves commented on SPARK-29329:
---

Yeah I know 4.3.0 made things better. Its still unfortunately not as good as it 
was but I don't think there is anything we can do about it.

> maven incremental builds not working
> 
>
> Key: SPARK-29329
> URL: https://issues.apache.org/jira/browse/SPARK-29329
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It looks like since we Upgraded scala-maven-plugin to 4.2.0 
> https://issues.apache.org/jira/browse/SPARK-28759 spark incremental builds 
> stop working.  Everytime you build its building all files, which takes 
> forever.
> It would be nice to fix this.
>  
> To reproduce, just build spark once ( I happened to be using the command 
> below):
> build/mvn -Phadoop-3.2 -Phive-thriftserver -Phive -Pyarn -Pkinesis-asl 
> -Pkubernetes -Pmesos -Phadoop-cloud -Pspark-ganglia-lgpl package -DskipTests
> Then build it again and you will see that it compiles all the files and takes 
> 15-30 minutes. With incremental it skips all unnecessary files and takes 
> closer to 5 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34461) Collect API feedback and maybe revise some APIs

2021-02-18 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17286517#comment-17286517
 ] 

Thomas Graves commented on SPARK-34461:
---

So the discussions in the original SPIP and prs went back and forth a bit as to 
whether everything should be task/executor in the name or not. ie we could have 
had one class that just had a ton of functions that were all setExecutorX, 
setTaskX, etc. but we decided against that to separate it and hopefully make 
things more extendible.

I don't have super strong feelings but at the same time why require the extra 
characters to be typed?  The resourceProfileBuilder is going to require a 
ResourceRequest. 

I think it would be great to get more feedback from users before changing. 

> Collect API feedback and maybe revise some APIs
> ---
>
> Key: SPARK-34461
> URL: https://issues.apache.org/jira/browse/SPARK-34461
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
> For example, 
> `ResourceProfileBuilder.require` uses the same API name for both task and 
> executor. Probably it's better to distinguish them by differentiate the name, 
> e.g., taskRequire / executorRequire



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33739) Jobs committed through the S3A Magic committer don't report the bytes written

2021-02-18 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-33739.
---
Fix Version/s: 3.2.0
 Assignee: Steve Loughran
   Resolution: Fixed

> Jobs committed through the S3A Magic committer don't report the bytes written
> -
>
> Key: SPARK-33739
> URL: https://issues.apache.org/jira/browse/SPARK-33739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.2.0
>
>
> The spark statistics tracking doesn't correctly assess the size of the 
> uploaded files as it only calls getFileStatus on the zero byte objects -not 
> the yet-to-manifest files. Which, given they don't exist yet, isn't easy to 
> do.
> HADOOP-17414 will attach the final length as a custom header to the marker 
> object, and implement getXAttr in the S3A FS to probe for it.
> BasicWriteStatsTracker can probe for this custom Xattr if the size of the 
> generated file is 0 bytes; if found and parseable use that as the declared 
> length of the output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27658) Catalog API to load functions

2021-02-09 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281984#comment-17281984
 ] 

Thomas Graves commented on SPARK-27658:
---

can we link the SPIP to this 

> Catalog API to load functions
> -
>
> Key: SPARK-27658
> URL: https://issues.apache.org/jira/browse/SPARK-27658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ryan Blue
>Priority: Major
>
> SPARK-24252 added an API that catalog plugins can implement to expose table 
> operations. Catalogs should also be able to provide function implementations 
> to Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-22 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17270199#comment-17270199
 ] 

Thomas Graves commented on SPARK-34167:
---

[~razajafri] can you finish detailing your finding as to the flow of what 
breaks this?  You state that spark tries to read it as an INT but the backtrace 
is in readLong.  You started to detail the schema differences but can you 
followup with how that leads to reading as an INT even though its calling 
readLong?

 

For example you state "The *{{VectorizedParquetRecordReader}}* reads in the 
parquet file correctly " but that is where the null pointer exception is 
thrown. If you could detail it more it would help in understanding it.

> Reading parquet with Decimal(8,2) written as a Decimal64 blows up
> -
>
> Key: SPARK-34167
> URL: https://issues.apache.org/jira/browse/SPARK-34167
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Raza Jafri
>Priority: Major
> Attachments: 
> part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, 
> part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet
>
>
> When reading a parquet file written with Decimals with precision < 10 as a 
> 64-bit representation, Spark tries to read it as an INT and fails
>  
> Steps to reproduce:
> Read the attached file that has a single Decimal(8,2) column with 10 values
> {code:java}
> scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show
> ...
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> ...
> {code}
>  
>  
> Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the 
> parquet file correctly because its basing the read on the 
>

[jira] [Resolved] (SPARK-33741) Add minimum threshold speculation config

2021-01-13 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-33741.
---
Fix Version/s: 3.2.0
 Assignee: Sanket Chintapalli
   Resolution: Fixed

> Add minimum threshold speculation config
> 
>
> Key: SPARK-33741
> URL: https://issues.apache.org/jira/browse/SPARK-33741
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Sanket Reddy
>Assignee: Sanket Chintapalli
>Priority: Minor
> Fix For: 3.2.0
>
>
> When we turn on speculation with default configs we have the last 10% of the 
> tasks subject to speculation. There are a lot of stages where the stage runs 
> for few seconds to minutes. Also in general we don't want to speculate tasks 
> that run within a specific interval. By setting a minimum threshold for 
> speculation gives us better control



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (ARROW-9019) [Python] hdfs fails to connect to for HDFS 3.x cluster

2021-01-11 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262875#comment-17262875
 ] 

Thomas Graves commented on ARROW-9019:
--

ping on this again, any information or ideas on working around or fixing?

> [Python] hdfs fails to connect to for HDFS 3.x cluster
> --
>
> Key: ARROW-9019
> URL: https://issues.apache.org/jira/browse/ARROW-9019
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Thomas Graves
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: filesystem, hdfs
>
> I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an 
> error that looks like a protobuf or jar mismatch problem with Hadoop. The 
> same code works on a Hadoop 2.9 cluster.
>  
> I'm wondering if there is something special I need to do or if pyarrow 
> doesn't support Hadoop 3.x yet?
> Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1.
>  
>     import pyarrow as pa
>     hdfs_kwargs = dict(host="namenodehost",
>                       port=9000,
>                       user="tgraves",
>                       driver='libhdfs',
>                       kerb_ticket=None,
>                       extra_conf=None)
>     fs = pa.hdfs.connect(**hdfs_kwargs)
>     res = fs.exists("/user/tgraves")
>  
> Error that I get on Hadoop 3.x is:
>  
> dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error:
> ClassCastException: 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to 
> org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException:
>  
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (SPARK-33741) Add minimum threshold speculation config

2021-01-07 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260852#comment-17260852
 ] 

Thomas Graves commented on SPARK-33741:
---

probably dup of https://issues.apache.org/jira/browse/SPARK-29910

> Add minimum threshold speculation config
> 
>
> Key: SPARK-33741
> URL: https://issues.apache.org/jira/browse/SPARK-33741
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> When we turn on speculation with default configs we have the last 10% of the 
> tasks subject to speculation. There are a lot of stages where the stage runs 
> for few seconds to minutes. Also in general we don't want to speculate tasks 
> that run within a specific interval. By setting a minimum threshold for 
> speculation gives us better control



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2021-01-06 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259815#comment-17259815
 ] 

Thomas Graves commented on SPARK-32165:
---

In my opinion, I'd rather not have to go trolling through multiple Jira and PRs 
for a simple description of what it causes, for more details references to 
those is fine.  I think each Jira and PR should have a basic description.

> SessionState leaks SparkListener with multiple SparkSession
> ---
>
> Key: SPARK-32165
> URL: https://issues.apache.org/jira/browse/SPARK-32165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Copied from 
> [https://github.com/apache/spark/pull/28128#issuecomment-653102770]
>  
> {code:java}
>   test("SPARK-31354: SparkContext only register one SparkSession 
> ApplicationEnd listener") {
> val conf = new SparkConf()
>   .setMaster("local")
>   .setAppName("test-app-SPARK-31354-1")
> val context = new SparkContext(conf)
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postFirstCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postSecondCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> assert(postFirstCreation == postSecondCreation)
>   }
> {code}
> The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2021-01-06 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259753#comment-17259753
 ] 

Thomas Graves commented on SPARK-32165:
---

could we add a bit more detail here as to what problem is or what this causes?  
I assume by leak you mean this causes memory leak but no other function issues?

> SessionState leaks SparkListener with multiple SparkSession
> ---
>
> Key: SPARK-32165
> URL: https://issues.apache.org/jira/browse/SPARK-32165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Copied from 
> [https://github.com/apache/spark/pull/28128#issuecomment-653102770]
>  
> {code:java}
>   test("SPARK-31354: SparkContext only register one SparkSession 
> ApplicationEnd listener") {
> val conf = new SparkConf()
>   .setMaster("local")
>   .setAppName("test-app-SPARK-31354-1")
> val context = new SparkContext(conf)
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postFirstCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postSecondCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> assert(postFirstCreation == postSecondCreation)
>   }
> {code}
> The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33031) scheduler with blacklisting doesn't appear to pick up new executor added

2021-01-04 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258498#comment-17258498
 ] 

Thomas Graves commented on SPARK-33031:
---

ah that could be the case, but if that is true we probably need to fix 
something in the UI to indicate that

> scheduler with blacklisting doesn't appear to pick up new executor added
> 
>
> Key: SPARK-33031
> URL: https://issues.apache.org/jira/browse/SPARK-33031
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a test with blacklisting  standalone mode and all the executors 
> were initially blacklisted.  Then one of the executors died and we got 
> allocated another one. The scheduler did not appear to pick up the new one 
> and try to schedule on it though.
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }
> rdd.collect(){code}
>  
> Note that I tried both with and without dynamic allocation enabled.
>  
> You can see screen shot related on 
> https://issues.apache.org/jira/browse/SPARK-33029



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33614) Fix the constant folding rule to skip it if the expression fails to execute

2020-12-09 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246678#comment-17246678
 ] 

Thomas Graves commented on SPARK-33614:
---

can you add a description here with more details. conditions and example of 
when this is needed would be nice.

> Fix the constant folding rule to skip it if the expression fails to execute
> ---
>
> Key: SPARK-33614
> URL: https://issues.apache.org/jira/browse/SPARK-33614
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33715) Error about cores being limiting resource confusing

2020-12-08 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-33715:
-

 Summary: Error about cores being limiting resource confusing
 Key: SPARK-33715
 URL: https://issues.apache.org/jira/browse/SPARK-33715
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Thomas Graves


If you misconfigure your resources and cores is not the limiting resource we 
print the following error:

20/12/08 20:23:37 ERROR Main: Failed to initialize Spark session.
java.lang.IllegalArgumentException: The number of slots on an executor has to 
be limited by the number of cores, otherwise you waste resources and dynamic 
allocation doesn't work properly. Your configuration has core/task cpu slots = 
8 and gpu = 1. Please adjust your configuration so that all resources require 
same number of executor slots.

I recieved reports that this was confusing to users. Specifically the sentence 
"Your configuration has core/task cpu slots = 8 and gpu = 1" so I think we can 
improve that message.

Note this only affects 3.0.0 and 3.0.1, 3.1.0 changed this functionality.

To reproduce just run spark 3.0.1 with something like:

$SPARK_HOME/bin/spark-shell --master yarn --executor-cores 8 --conf 
spark.executor.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling

2020-12-08 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245951#comment-17245951
 ] 

Thomas Graves commented on SPARK-27495:
---

the main functionality is all in, it probably makes more sense to resolve this 
in 3.1.0 and split the others off into followups, if not objections, I'll split 
it.

> SPIP: Support Stage level resource configuration and scheduling
> ---
>
> Key: SPARK-27495
> URL: https://issues.apache.org/jira/browse/SPARK-27495
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>  Labels: SPIP
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> Objectives:
>  # Allow users to specify task and executor resource requirements at the 
> stage level. 
>  # Spark will use the stage level requirements to acquire the necessary 
> resources/executors and schedule tasks based on the per stage requirements.
> Many times users have different resource requirements for different stages of 
> their application so they want to be able to configure resources at the stage 
> level. For instance, you have a single job that has 2 stages. The first stage 
> does some  ETL which requires a lot of tasks, each with a small amount of 
> memory and 1 core each. Then you have a second stage where you feed that ETL 
> data into an ML algorithm. The second stage only requires a few executors but 
> each executor needs a lot of memory, GPUs, and many cores.  This feature 
> allows the user to specify the task and executor resource requirements for 
> the ETL Stage and then change them for the ML stage of the job. 
> Resources include cpu, memory (on heap, overhead, pyspark, and off heap), and 
> extra Resources (GPU/FPGA/etc). It has the potential to allow for other 
> things like limiting the number of tasks per stage, specifying other 
> parameters for things like shuffle, etc. Initially I would propose we only 
> support resources as they are now. So Task resources would be cpu and other 
> resources (GPU, FPGA), that way we aren't adding in extra scheduling things 
> at this point.  Executor resources would be cpu, memory, and extra 
> resources(GPU,FPGA, etc). Changing the executor resources will rely on 
> dynamic allocation being enabled.
> Main use cases:
>  # ML use case where user does ETL and feeds it into an ML algorithm where 
> it’s using the RDD API. This should work with barrier scheduling as well once 
> it supports dynamic allocation.
>  # This adds the framework/api for Spark's own internal use.  In the future 
> (not covered by this SPIP), Catalyst could control the stage level resources 
> as it finds the need to change it between stages for different optimizations. 
> For instance, with the new columnar plugin to the query planner we can insert 
> stages into the plan that would change running something on the CPU in row 
> format to running it on the GPU in columnar format. This API would allow the 
> planner to make sure the stages that run on the GPU get the corresponding GPU 
> resources it needs to run. Another possible use case for catalyst is that it 
> would allow catalyst to add in more optimizations to where the user doesn’t 
> need to configure container sizes at all. If the optimizer/planner can handle 
> that for the user, everyone wins.
> This SPIP focuses on the RDD API but we don’t exclude the Dataset API. I 
> think the DataSet API will require more changes because it specifically hides 
> the RDD from the users via the plans and catalyst can optimize the plan and 
> insert things into the plan. The only way I’ve found to make this work with 
> the Dataset API would be modifying all the plans to be able to get the 
> resource requirements down into where it creates the RDDs, which I believe 
> would be a lot of change.  If other people know better options, it would be 
> great to hear them.
> *Q2.* What problem is this proposal NOT designed to solve?
> The initial implementation is not going to add Dataset APIs.
> We are starting with allowing users to specify a specific set of 
> task/executor resources and plan to design it to be extendable, but the first 
> implementation will not support changing generic SparkConf configs and only 
> specific limited resources.
> This initial version will have a programmatic API for specifying the resource 
> requirements per stage, we can add the ability to perhaps have profiles in 
> the configs later if its useful.
> *Q3.* How is it done today, and what are the limits of current practice?
> Currently this is either done by having multiple spark jobs or requesting 
> containers with the max resources needed for any part

[jira] [Updated] (SPARK-33504) The application log in the Spark history server contains sensitive attributes such as password that should be redated instead of plain text

2020-12-02 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-33504:
--
Fix Version/s: (was: 3.0.2)

> The application log in the Spark history server contains sensitive attributes 
> such as password that should be redated instead of plain text
> ---
>
> Key: SPARK-33504
> URL: https://issues.apache.org/jira/browse/SPARK-33504
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Spark 3.0.1
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: SparkListenerEnvironmentUpdate log shows ok.png, 
> SparkListenerStageSubmitted-log-wrong.png, SparkListernerJobStart-wrong.png
>
>
> We found the secure attributes in SparkListenerJobStart and 
> SparkListenerStageSubmitted events would not been redated, resulting in 
> sensitive attributes can be viewd directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33504) The application log in the Spark history server contains sensitive attributes such as password that should be redated instead of plain text

2020-12-02 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-33504.
---
Fix Version/s: 3.1.0
   3.0.2
 Assignee: akiyamaneko
   Resolution: Fixed

> The application log in the Spark history server contains sensitive attributes 
> such as password that should be redated instead of plain text
> ---
>
> Key: SPARK-33504
> URL: https://issues.apache.org/jira/browse/SPARK-33504
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Spark 3.0.1
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
> Attachments: SparkListenerEnvironmentUpdate log shows ok.png, 
> SparkListenerStageSubmitted-log-wrong.png, SparkListernerJobStart-wrong.png
>
>
> We found the secure attributes in SparkListenerJobStart and 
> SparkListenerStageSubmitted events would not been redated, resulting in 
> sensitive attributes can be viewd directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33544) explode should not filter when used with CreateArray

2020-11-24 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238371#comment-17238371
 ] 

Thomas Graves commented on SPARK-33544:
---

I'm working on a patch for this.

> explode should not filter when used with CreateArray
> 
>
> Key: SPARK-33544
> URL: https://issues.apache.org/jira/browse/SPARK-33544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to 
> insert a filter for not null and size > 0 when using inner explode/inline. 
> This is fine in most cases but the extra filter is not needed if the explode 
> is with a create array and not using Literals (it already handles LIterals).  
> When this happens you know that the values aren't null and it has a size.  It 
> already handles the empty array.
> for instance:
> val df = someDF.selectExpr("number", "explode(array(word, col3))")
> So in this case we shouldn't be inserting the extra Filter and that filter 
> can get pushed down into like a parquet reader as well. This is just causing 
> extra overhead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33544) explode should not filter when used with CreateArray

2020-11-24 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-33544:
-

 Summary: explode should not filter when used with CreateArray
 Key: SPARK-33544
 URL: https://issues.apache.org/jira/browse/SPARK-33544
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Thomas Graves


https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to 
insert a filter for not null and size > 0 when using inner explode/inline. This 
is fine in most cases but the extra filter is not needed if the explode is with 
a create array and not using Literals (it already handles LIterals).  When this 
happens you know that the values aren't null and it has a size.  It already 
handles the empty array.

for instance:

val df = someDF.selectExpr("number", "explode(array(word, col3))")

So in this case we shouldn't be inserting the extra Filter and that filter can 
get pushed down into like a parquet reader as well. This is just causing extra 
overhead.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33462) ResourceProfile use Int for memory in ExecutorResourcesOrDefaults

2020-11-16 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-33462:
-

 Summary: ResourceProfile use Int for memory in 
ExecutorResourcesOrDefaults
 Key: SPARK-33462
 URL: https://issues.apache.org/jira/browse/SPARK-33462
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Thomas Graves


A followup from SPARK-33288, since memory is in MB we can just store as Int 
rather then Long in ExecutorResourcesOrDefaults.

 

See https://github.com/apache/spark/pull/30375#issuecomment-728270233



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2020-11-13 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-33288.
---
Fix Version/s: 3.1.0
 Assignee: Thomas Graves
   Resolution: Fixed

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33447) Stage level scheduling, allow specifying other spark configs via ResourceProfile

2020-11-13 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-33447:
-

 Summary: Stage level scheduling, allow specifying other spark 
configs via ResourceProfile
 Key: SPARK-33447
 URL: https://issues.apache.org/jira/browse/SPARK-33447
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Thomas Graves


With the addition of stage level scheduling and ResourceProfiles, we only allow 
certain resources to specified currently.  There are many other configs that 
users may want to change between changes. We should look at adding in perhaps 
SparkConf settings.

One very specific one that was brought up in review was ability to change yarn 
queue (if that is even possible in yarn) between stages because someone might 
want ot use one queue for ETL and a separate queue for ML where there are nodes 
with GPU.  Or perhaps node labels separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33406) k8s pyspark shell doesn't honor pyspark memory setting

2020-11-09 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-33406:
-

 Summary: k8s pyspark shell doesn't honor pyspark memory setting
 Key: SPARK-33406
 URL: https://issues.apache.org/jira/browse/SPARK-33406
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Thomas Graves


When running the interactive pyspark shell in client mode on k8s, it doesn't 
honor thespark.executor.pyspark.memory memory setting when requesting 
containers from k8s. 

If you run it in cluster mode and specify a python script then it does work.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31711) Register the executor source with the metrics system when running in local mode.

2020-11-04 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-31711.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

> Register the executor source with the metrics system when running in local 
> mode.
> 
>
> Key: SPARK-31711
> URL: https://issues.apache.org/jira/browse/SPARK-31711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.1.0
>
>
> The Apache Spark metrics system provides many useful insights on the Spark 
> workload. In particular, the executor source metrics 
> (https://github.com/apache/spark/blob/master/docs/monitoring.md#component-instance--executor)
>  provide detailed info, including the number of active tasks, some I/O 
> metrics, and task metrics details. Executor source metrics, contrary to other 
> sources (for example ExecutorMetrics source), are not yet available when 
> running in local mode.
> This JIRA proposes to register the executor source with the Spark metrics 
> system when running in local mode, as this can be very useful when testing 
> and troubleshooting Spark workloads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31711) Register the executor source with the metrics system when running in local mode.

2020-11-04 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-31711:
-

Assignee: Luca Canali

> Register the executor source with the metrics system when running in local 
> mode.
> 
>
> Key: SPARK-31711
> URL: https://issues.apache.org/jira/browse/SPARK-31711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>
> The Apache Spark metrics system provides many useful insights on the Spark 
> workload. In particular, the executor source metrics 
> (https://github.com/apache/spark/blob/master/docs/monitoring.md#component-instance--executor)
>  provide detailed info, including the number of active tasks, some I/O 
> metrics, and task metrics details. Executor source metrics, contrary to other 
> sources (for example ExecutorMetrics source), are not yet available when 
> running in local mode.
> This JIRA proposes to register the executor source with the Spark metrics 
> system when running in local mode, as this can be very useful when testing 
> and troubleshooting Spark workloads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-10-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-32037.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Assignee: Thomas Graves
>Priority: Minor
> Fix For: 3.1.0
>
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-10-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-32037:
-

Assignee: Thomas Graves

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Assignee: Thomas Graves
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32591) Add better api docs for stage level scheduling Resources

2020-10-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32591:
--
Description: 
A question came up when we added offheap memory to be able to set in a 
ResourceProfile executor resources.

[https://github.com/apache/spark/pull/28972/]

Based on that discussion we should add better api docs to explain what each one 
does. Perhaps point to the corresponding configuration .  Also specify the 
default config is used if not specified in the profiles for those that fallback.

  was:
A question came up when we added offheap memory to be able to set in a 
ResourceProfile executor resources.

[https://github.com/apache/spark/pull/28972/]

Based on that discussion we should add better api docs to explain what each one 
does. Perhaps point to the corresponding configuration .


> Add better api docs for stage level scheduling Resources
> 
>
> Key: SPARK-32591
> URL: https://issues.apache.org/jira/browse/SPARK-32591
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> A question came up when we added offheap memory to be able to set in a 
> ResourceProfile executor resources.
> [https://github.com/apache/spark/pull/28972/]
> Based on that discussion we should add better api docs to explain what each 
> one does. Perhaps point to the corresponding configuration .  Also specify 
> the default config is used if not specified in the profiles for those that 
> fallback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2020-10-29 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222932#comment-17222932
 ] 

Thomas Graves commented on SPARK-33288:
---

yes. Really when I say hanging it means k8s won't be able to give you an 
executor to match the resources.

The same thing can happen right now if my cluster doesn't have the resources I 
request.  Lets say I have a 1 node k8s cluster each with 24 cores. If I ask for 
executor with 64 cores spark hangs waiting to get that executor and k8s will 
never be able to give it to you.

Its just in the case of stage level scheduling, the reason you might not get an 
executor would be because other executors int he same application are still 
running because they have shuffle data on them. And like you say if you don't 
set the timeout (defaults to infinity) it will "hang".

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2020-10-29 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222917#comment-17222917
 ] 

Thomas Graves commented on SPARK-33288:
---

note there will be a couple of caveats here in that if you have shuffle data on 
all the nodes from one stage and you go to change the resource profile for 
another stage if your cluster doesn't have enough nodes that could end up 
hanging.   With normal yarn dynamic allocation with external shuffle, the other 
executors will go away since no shuffle data on them so you don't have this 
issue as much.  

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2020-10-29 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-33288:
-

 Summary: Support k8s cluster manager with stage level scheduling
 Key: SPARK-33288
 URL: https://issues.apache.org/jira/browse/SPARK-33288
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Thomas Graves


Kubernetes supports dynamic allocation via the 
{{spark.dynamicAllocation.shuffleTracking.enabled}}

{{config, we can add support for stage level scheduling when that is turned on. 
 }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33088) Enhance ExecutorPlugin API to include methods for task start and end events

2020-10-16 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-33088:
-

Assignee: Samuel Souza

> Enhance ExecutorPlugin API to include methods for task start and end events
> ---
>
> Key: SPARK-33088
> URL: https://issues.apache.org/jira/browse/SPARK-33088
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Samuel Souza
>Assignee: Samuel Souza
>Priority: Major
> Fix For: 3.1.0
>
>
> On [SPARK-24918|https://issues.apache.org/jira/browse/SPARK-24918]'s 
> [SIPP|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/view#|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit#],
>  it was raised to potentially add methods to ExecutorPlugin interface on task 
> start and end:
> {quote}The basic interface can just be a marker trait, as that allows a 
> plugin to monitor general characteristics of the JVM (eg. monitor memory or 
> take thread dumps).   Optionally, we could include methods for task start and 
> end events.   This would allow more control on monitoring – eg., you could 
> start polling thread dumps only if there was a task from a particular stage 
> that had been taking too long. But anything task related is a bit trickier to 
> decide the right api. Should the task end event also get the failure reason? 
> Should those events get called in the same thread as the task runner, or in 
> another thread?
> {quote}
> The ask is to add exactly that. I've put up a draft PR [in our fork of 
> spark|https://github.com/palantir/spark/pull/713] and I'm happy to push it 
> upstream. Also happy to receive comments on what's the right interface to 
> expose - not opinionated on that front, tried to expose the simplest 
> interface for now.
> The main reason for this ask is to propagate tracing information from the 
> driver to the executors 
> ([SPARK-21962|https://issues.apache.org/jira/browse/SPARK-21962] has some 
> context). On 
> [HADOOP-15566|https://issues.apache.org/jira/browse/HADOOP-15566] I see we're 
> discussing how to add tracing to the Apache ecosystem, but my problem is 
> slightly different: I want to use this interface to propagate tracing 
> information to my framework of choice. If the Hadoop issue gets solved we'll 
> have a framework to communicate tracing information inside the Apache 
> ecosystem, but it's highly unlikely that all Spark users will use the same 
> common framework. Therefore we should still provide plugin interfaces where 
> the tracing information can be propagated appropriately.
> To give more color, in our case the tracing information is [stored in a 
> thread 
> local|https://github.com/palantir/tracing-java/blob/4.9.0/tracing/src/main/java/com/palantir/tracing/Tracer.java#L61],
>  therefore it needs to be set in the same thread which is executing the task. 
> [*]
> While our framework is specific, I imagine such an interface could be useful 
> in general. Happy to hear your thoughts about it.
> [*] Something I did not mention was how to propagate the tracing information 
> from the driver to the executors. For that I intend to use 1. the driver's 
> localProperties, which 2. will be eventually propagated to the executors' 
> TaskContext, which 3. I'll be able to access from the methods above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33031) scheduler with blacklisting doesn't appear to pick up new executor added

2020-09-29 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204265#comment-17204265
 ] 

Thomas Graves commented on SPARK-33031:
---

I tried this again on yarn and now it seems to be working there so it might 
only be standalone mode.

> scheduler with blacklisting doesn't appear to pick up new executor added
> 
>
> Key: SPARK-33031
> URL: https://issues.apache.org/jira/browse/SPARK-33031
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a test with blacklisting  standalone mode and all the executors 
> were initially blacklisted.  Then one of the executors died and we got 
> allocated another one. The scheduler did not appear to pick up the new one 
> and try to schedule on it though.
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }
> rdd.collect(){code}
>  
> Note that I tried both with and without dynamic allocation enabled.
>  
> You can see screen shot related on 
> https://issues.apache.org/jira/browse/SPARK-33029



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33031) scheduler with blacklisting doesn't appear to pick up new executor added

2020-09-29 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-33031:
--
Description: 
I was running a test with blacklisting  standalone mode and all the executors 
were initially blacklisted.  Then one of the executors died and we got 
allocated another one. The scheduler did not appear to pick up the new one and 
try to schedule on it though.

You can reproduce this by starting a master and slave on a single node, then 
launch a shell like where you will get multiple executors (in this case I got 3)

$SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
--conf spark.blacklist.enabled=true

>From shell run:
{code:java}
import org.apache.spark.TaskContext
val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
 val context = TaskContext.get()
 if (context.attemptNumber() < 2) {
 throw new Exception("test attempt num")
 }
 it
}
rdd.collect(){code}
 

Note that I tried both with and without dynamic allocation enabled.

 

You can see screen shot related on 
https://issues.apache.org/jira/browse/SPARK-33029

  was:
I was running a test with blacklisting on yarn (and standalone mode) and all 
the executors were initially blacklisted.  Then one of the executors died and 
we got allocated another one. The scheduler did not appear to pick up the new 
one and try to schedule on it though.

You can reproduce this by starting a master and slave on a single node, then 
launch a shell like where you will get multiple executors (in this case I got 3)

$SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
--conf spark.blacklist.enabled=true

>From shell run:
{code:java}
import org.apache.spark.TaskContext
val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
 val context = TaskContext.get()
 if (context.attemptNumber() < 2) {
 throw new Exception("test attempt num")
 }
 it
}
rdd.collect(){code}
 

Note that I tried both with and without dynamic allocation enabled.

 

You can see screen shot related on 
https://issues.apache.org/jira/browse/SPARK-33029


> scheduler with blacklisting doesn't appear to pick up new executor added
> 
>
> Key: SPARK-33031
> URL: https://issues.apache.org/jira/browse/SPARK-33031
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a test with blacklisting  standalone mode and all the executors 
> were initially blacklisted.  Then one of the executors died and we got 
> allocated another one. The scheduler did not appear to pick up the new one 
> and try to schedule on it though.
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }
> rdd.collect(){code}
>  
> Note that I tried both with and without dynamic allocation enabled.
>  
> You can see screen shot related on 
> https://issues.apache.org/jira/browse/SPARK-33029



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33031) scheduler with blacklisting doesn't appear to pick up new executor added

2020-09-29 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-33031:
--
Description: 
I was running a test with blacklisting on yarn (and standalone mode) and all 
the executors were initially blacklisted.  Then one of the executors died and 
we got allocated another one. The scheduler did not appear to pick up the new 
one and try to schedule on it though.

You can reproduce this by starting a master and slave on a single node, then 
launch a shell like where you will get multiple executors (in this case I got 3)

$SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
--conf spark.blacklist.enabled=true

>From shell run:
{code:java}
import org.apache.spark.TaskContext
val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
 val context = TaskContext.get()
 if (context.attemptNumber() < 2) {
 throw new Exception("test attempt num")
 }
 it
}
rdd.collect(){code}
 

Note that I tried both with and without dynamic allocation enabled.

 

You can see screen shot related on 
https://issues.apache.org/jira/browse/SPARK-33029

  was:
I was running a test with blacklisting on yarn (and standalone mode) and all 
the executors were initially blacklisted.  Then one of the executors died and 
we got allocated another one. The scheduler did not appear to pick up the new 
one and try to schedule on it though.

You can reproduce this by starting a master and slave on a single node, then 
launch a shell like where you will get multiple executors (in this case I got 3)

$SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
--conf spark.blacklist.enabled=true

>From shell run:
{code:java}
import org.apache.spark.TaskContext
val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
 val context = TaskContext.get()
 if (context.attemptNumber() < 2) {
 throw new Exception("test attempt num")
 }
 it
}{code}
 

Note that I tried both with and without dynamic allocation enabled.

 

You can see screen shot related on 
https://issues.apache.org/jira/browse/SPARK-33029


> scheduler with blacklisting doesn't appear to pick up new executor added
> 
>
> Key: SPARK-33031
> URL: https://issues.apache.org/jira/browse/SPARK-33031
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a test with blacklisting on yarn (and standalone mode) and all 
> the executors were initially blacklisted.  Then one of the executors died and 
> we got allocated another one. The scheduler did not appear to pick up the new 
> one and try to schedule on it though.
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }
> rdd.collect(){code}
>  
> Note that I tried both with and without dynamic allocation enabled.
>  
> You can see screen shot related on 
> https://issues.apache.org/jira/browse/SPARK-33029



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33031) scheduler with blacklisting doesn't appear to pick up new executor added

2020-09-29 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-33031:
--
Affects Version/s: 3.0.0

> scheduler with blacklisting doesn't appear to pick up new executor added
> 
>
> Key: SPARK-33031
> URL: https://issues.apache.org/jira/browse/SPARK-33031
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a test with blacklisting on yarn (and standalone mode) and all 
> the executors were initially blacklisted.  Then one of the executors died and 
> we got allocated another one. The scheduler did not appear to pick up the new 
> one and try to schedule on it though.
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }{code}
>  
> Note that I tried both with and without dynamic allocation enabled.
>  
> You can see screen shot related on 
> https://issues.apache.org/jira/browse/SPARK-33029



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33031) scheduler with blacklisting doesn't appear to pick up new executor added

2020-09-29 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-33031:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> scheduler with blacklisting doesn't appear to pick up new executor added
> 
>
> Key: SPARK-33031
> URL: https://issues.apache.org/jira/browse/SPARK-33031
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a test with blacklisting on yarn (and standalone mode) and all 
> the executors were initially blacklisted.  Then one of the executors died and 
> we got allocated another one. The scheduler did not appear to pick up the new 
> one and try to schedule on it though.
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }{code}
>  
> Note that I tried both with and without dynamic allocation enabled.
>  
> You can see screen shot related on 
> https://issues.apache.org/jira/browse/SPARK-33029



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted

2020-09-29 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204259#comment-17204259
 ] 

Thomas Graves commented on SPARK-33029:
---

NOte I filed https://issues.apache.org/jira/browse/SPARK-33031 for the issue of 
it not using the other executor as it seems more critical 

> Standalone mode blacklist executors page UI marks driver as blacklisted
> ---
>
> Key: SPARK-33029
> URL: https://issues.apache.org/jira/browse/SPARK-33029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2020-09-29 at 1.52.09 PM.png, Screen Shot 
> 2020-09-29 at 1.53.37 PM.png
>
>
> I am running a spark shell on a 1 node standalone cluster.  I noticed that 
> the executors page ui was marking the driver as blacklisted for the stage 
> that is running.  Attached a screen shot.
> Also, in my case one of the executors died and it doesn't seem like the 
> schedule rpicked up the new one.  It doesn't show up on the stages page and 
> just shows it as active but none of the tasks ran there.
>  
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
>  
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33031) scheduler with blacklisting doesn't appear to pick up new executor added

2020-09-29 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-33031:
-

 Summary: scheduler with blacklisting doesn't appear to pick up new 
executor added
 Key: SPARK-33031
 URL: https://issues.apache.org/jira/browse/SPARK-33031
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 3.0.0
Reporter: Thomas Graves


I was running a test with blacklisting on yarn (and standalone mode) and all 
the executors were initially blacklisted.  Then one of the executors died and 
we got allocated another one. The scheduler did not appear to pick up the new 
one and try to schedule on it though.

You can reproduce this by starting a master and slave on a single node, then 
launch a shell like where you will get multiple executors (in this case I got 3)

$SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
--conf spark.blacklist.enabled=true

>From shell run:
{code:java}
import org.apache.spark.TaskContext
val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
 val context = TaskContext.get()
 if (context.attemptNumber() < 2) {
 throw new Exception("test attempt num")
 }
 it
}{code}
 

Note that I tried both with and without dynamic allocation enabled.

 

You can see screen shot related on 
https://issues.apache.org/jira/browse/SPARK-33029



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted

2020-09-29 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-33029:
--
Description: 
I am running a spark shell on a 1 node standalone cluster.  I noticed that the 
executors page ui was marking the driver as blacklisted for the stage that is 
running.  Attached a screen shot.

Also, in my case one of the executors died and it doesn't seem like the 
schedule rpicked up the new one.  It doesn't show up on the stages page and 
just shows it as active but none of the tasks ran there.

 

You can reproduce this by starting a master and slave on a single node, then 
launch a shell like where you will get multiple executors (in this case I got 3)

$SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
--conf spark.blacklist.enabled=true

 

>From shell run:
{code:java}
import org.apache.spark.TaskContext
val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
 val context = TaskContext.get()
 if (context.attemptNumber() < 2) {
 throw new Exception("test attempt num")
 }
 it
}{code}

  was:
I am running a spark shell on a 1 node standalone cluster.  I noticed that the 
executors page ui was marking the driver as blacklisted for the stage that is 
running.  Attached a screen shot.

Also, in my case one of the executors died and it doesn't seem like the 
schedule rpicked up the new one.  It doesn't show up on the stages page and 
just shows it as active but none of the tasks ran there.


> Standalone mode blacklist executors page UI marks driver as blacklisted
> ---
>
> Key: SPARK-33029
> URL: https://issues.apache.org/jira/browse/SPARK-33029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2020-09-29 at 1.52.09 PM.png, Screen Shot 
> 2020-09-29 at 1.53.37 PM.png
>
>
> I am running a spark shell on a 1 node standalone cluster.  I noticed that 
> the executors page ui was marking the driver as blacklisted for the stage 
> that is running.  Attached a screen shot.
> Also, in my case one of the executors died and it doesn't seem like the 
> schedule rpicked up the new one.  It doesn't show up on the stages page and 
> just shows it as active but none of the tasks ran there.
>  
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
>  
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted

2020-09-29 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-33029:
--
Attachment: Screen Shot 2020-09-29 at 1.53.37 PM.png

> Standalone mode blacklist executors page UI marks driver as blacklisted
> ---
>
> Key: SPARK-33029
> URL: https://issues.apache.org/jira/browse/SPARK-33029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2020-09-29 at 1.52.09 PM.png, Screen Shot 
> 2020-09-29 at 1.53.37 PM.png
>
>
> I am running a spark shell on a 1 node standalone cluster.  I noticed that 
> the executors page ui was marking the driver as blacklisted for the stage 
> that is running.  Attached a screen shot.
> Also, in my case one of the executors died and it doesn't seem like the 
> schedule rpicked up the new one.  It doesn't show up on the stages page and 
> just shows it as active but none of the tasks ran there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted

2020-09-29 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-33029:
--
Attachment: Screen Shot 2020-09-29 at 1.52.09 PM.png

> Standalone mode blacklist executors page UI marks driver as blacklisted
> ---
>
> Key: SPARK-33029
> URL: https://issues.apache.org/jira/browse/SPARK-33029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2020-09-29 at 1.52.09 PM.png
>
>
> I am running a spark shell on a 1 node standalone cluster.  I noticed that 
> the executors page ui was marking the driver as blacklisted for the stage 
> that is running.  Attached a screen shot.
> Also, in my case one of the executors died and it doesn't seem like the 
> schedule rpicked up the new one.  It doesn't show up on the stages page and 
> just shows it as active but none of the tasks ran there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted

2020-09-29 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-33029:
-

 Summary: Standalone mode blacklist executors page UI marks driver 
as blacklisted
 Key: SPARK-33029
 URL: https://issues.apache.org/jira/browse/SPARK-33029
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves


I am running a spark shell on a 1 node standalone cluster.  I noticed that the 
executors page ui was marking the driver as blacklisted for the stage that is 
running.  Attached a screen shot.

Also, in my case one of the executors died and it doesn't seem like the 
schedule rpicked up the new one.  It doesn't show up on the stages page and 
just shows it as active but none of the tasks ran there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32935) File source V2: support bucketing

2020-09-18 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198368#comment-17198368
 ] 

Thomas Graves commented on SPARK-32935:
---

sorry looks like my update went in same time as yours and overwrote yours, 
fixed description to have both.  It should be both writing and reading correct 
[~Gengliang.Wang]

> File source V2: support bucketing
> -
>
> Key: SPARK-32935
> URL: https://issues.apache.org/jira/browse/SPARK-32935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Datasource V2 does not currently support bucketed reads similar to Datasource 
> V1 does.  See DatasourceScanExec and config 
> spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.
>  
> Support writing file data source with bucketing 
> {code:java} 
> fileDf.write.bucketBy(...).sortBy(..)... 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32935) File source V2: support bucketing

2020-09-18 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32935:
--
Description: 
Datasource V2 does not currently support bucketed reads or writes similar to 
Datasource V1 does.  See DatasourceScanExec and config 

spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.

 

Support writing file data source with bucketing looks like:
{code:java}
 
fileDf.write.bucketBy(...).sortBy(..)... 
{code}

  was:
Datasource V2 does not currently support bucketed reads similar to Datasource 
V1 does.  See DatasourceScanExec and config 

spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.

 

Support writing file data source with bucketing 

{code:java} 
fileDf.write.bucketBy(...).sortBy(..)... 
{code}


> File source V2: support bucketing
> -
>
> Key: SPARK-32935
> URL: https://issues.apache.org/jira/browse/SPARK-32935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Datasource V2 does not currently support bucketed reads or writes similar to 
> Datasource V1 does.  See DatasourceScanExec and config 
> spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.
>  
> Support writing file data source with bucketing looks like:
> {code:java}
>  
> fileDf.write.bucketBy(...).sortBy(..)... 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32935) File source V2: support bucketing

2020-09-18 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32935:
--
Description: 
Datasource V2 does not currently support bucketed reads similar to Datasource 
V1 does.  See DatasourceScanExec and config 

spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.

 

Support writing file data source with bucketing 

{code:java} 
fileDf.write.bucketBy(...).sortBy(..)... 
{code}

  was:
Datasource V2 does not currently support bucketed reads similar to Datasource 
V1 does.  See DatasourceScanExec and config 

spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.


> File source V2: support bucketing
> -
>
> Key: SPARK-32935
> URL: https://issues.apache.org/jira/browse/SPARK-32935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Datasource V2 does not currently support bucketed reads similar to Datasource 
> V1 does.  See DatasourceScanExec and config 
> spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.
>  
> Support writing file data source with bucketing 
> {code:java} 
> fileDf.write.bucketBy(...).sortBy(..)... 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32935) File source V2: support bucketing

2020-09-18 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32935:
--
Description: 
Datasource V2 does not currently support bucketed reads similar to Datasource 
V1 does.  See DatasourceScanExec and config 

spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.

  was:
Support writing file data source with bucketing

{code:java}
fileDf.write.bucketBy(...).sortBy(..)...
{code}



> File source V2: support bucketing
> -
>
> Key: SPARK-32935
> URL: https://issues.apache.org/jira/browse/SPARK-32935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Datasource V2 does not currently support bucketed reads similar to Datasource 
> V1 does.  See DatasourceScanExec and config 
> spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27589) Spark file source V2

2020-09-18 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198339#comment-17198339
 ] 

Thomas Graves commented on SPARK-27589:
---

thanks for confirming and filing the jira, wanted to make sure I wasn't missing 
something.

> Spark file source V2
> 
>
> Key: SPARK-27589
> URL: https://issues.apache.org/jira/browse/SPARK-27589
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Re-implement file sources with data source V2 API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27589) Spark file source V2

2020-09-17 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197669#comment-17197669
 ] 

Thomas Graves commented on SPARK-27589:
---

I'm guessing my question got missed - does it currently support bucketing or do 
we have a Jira for it?

> Spark file source V2
> 
>
> Key: SPARK-27589
> URL: https://issues.apache.org/jira/browse/SPARK-27589
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Re-implement file sources with data source V2 API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27589) Spark file source V2

2020-09-16 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197143#comment-17197143
 ] 

Thomas Graves commented on SPARK-27589:
---

somewhat related, I was looking through the v2 code for parquet and I don't see 
anything for bucketing, is bucketing supported with the V2 api?

> Spark file source V2
> 
>
> Key: SPARK-27589
> URL: https://issues.apache.org/jira/browse/SPARK-27589
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Re-implement file sources with data source V2 API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32898) totalExecutorRunTimeMs is too big

2020-09-16 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196972#comment-17196972
 ] 

Thomas Graves commented on SPARK-32898:
---

[~linhongliu-db] can you please provide more of a description. You say this was 
too big, did it cause an error for your job or you just noticed the time was to 
big?  Do you have a reproducible case?

You have some details there about what might be wrong with  taskStartTimeNs 
possibly not initialized, if you can give more details there in generally that 
would be great as its a bit hard to follow your description.  If you have spent 
the time to debug you and have a fix in mind please feel free to put up a pull 
request.

> totalExecutorRunTimeMs is too big
> -
>
> Key: SPARK-32898
> URL: https://issues.apache.org/jira/browse/SPARK-32898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Linhong Liu
>Priority: Major
>
> This might be because of incorrectly calculating executorRunTimeMs in 
> Executor.scala
> The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can 
> be called when taskStartTimeNs is not set yet (it is 0).
> As of now in master branch, here is the problematic code: 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470]
>  
> There is a throw exception before this line. The catch branch still updates 
> the metric.
> However the query shows as SUCCESSful in QPL. Maybe this task is speculative. 
> Not sure.
>  
> submissionTime in LiveExecutionData may also have similar problem.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32894) Timestamp cast in exernal orc table

2020-09-16 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32894:
--
Summary: Timestamp cast in exernal orc table  (was: Timestamp cast in 
exernal ocr table)

> Timestamp cast in exernal orc table
> ---
>
> Key: SPARK-32894
> URL: https://issues.apache.org/jira/browse/SPARK-32894
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0
> Java 1.8
> Hadoop 3.3.0
> Hive 3.1.2
> Python 3.7 (from pyspark)
>Reporter: Grigory Skvortsov
>Priority: Major
>
> I have the external hive table stored as orc. I want to work with timestamp 
> column in my table using pyspark.
> For example, I try this:
>  spark.sql('select id, time_ from mydb.table1`).show()
>  
>  Py4JJavaError: An error occurred while calling o2877.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 19, 172.29.14.241, executor 1): java.lang.ClassCastException: 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long
> at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
> at 
> org.apache.spark.sql.catalyst.expressions.MutableLong.update(SpecificInternalRow.scala:148)
> at 
> org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:228)
> at 
> org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53(HiveInspectors.scala:730)
> at 
> org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53$adapted(HiveInspectors.scala:730)
> at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$.$anonfun$unwrapOrcStructs$4(OrcFileFormat.scala:351)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:96)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:127)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950)
> at scala.Option.foreach(Option.scala:407)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:950)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2203)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2152)
> at 
>

[jira] [Updated] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-16 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32635:
--
Labels: correct  (was: )

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Blocker
>  Labels: correct
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-16 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32635:
--
Labels: correctness  (was: correct)

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Blocker
>  Labels: correctness
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-16 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32635:
--
Priority: Blocker  (was: Major)

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Blocker
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-09-14 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195531#comment-17195531
 ] 

Thomas Graves commented on SPARK-32037:
---

it is a good point about blocklist being typoed (but I would hope would be 
caught in reviews) but if you are looking at amount of change it is only 1 
character.  Also I don't really see how BlocklistTracker sounds any worse then 
BlacklistTracker.  Both might be a bit weird. HealthTracker might be better 
there although would be better if we could give context to what health and in 
this case its either node or executor which is hard to give a name to that 
includes both.  Like you pointed out then you have TaskSetHealthTracker - which 
isn't really right because its tracking the health of the node/executor for 
that taskset not the taskset itself.

If you look at the description to the config denied seems a bit weird to me:

_If set to "true", prevent Spark from scheduling tasks on executors that have 
been blacklisted due to too many task failures. The blacklisting algorithm can 
be further controlled by the other "spark.blacklist" configuration options._

If we look at the options in the context of this sentence...:

executor that have been denied due to too many task failures

executors that have been blocked due to too many task failures

executors that have been excluded due to to many task failures

The last 2 definitely make more sense in that context.  Now you could 
definitely re-write the sentence for denied, but the other thing is that 
executors can be removed from the list so denied/allowed or removed from denied 
doesn't make as much sense to me in this context.  block or exclude make more 
sense to me if they can go active again (blocked/unblocked or 
excluded/included).  

Naming things is always a pain.  I think based on all the feedback if no one 
has strong objections I will go with "blocklist".  I'll start to make the 
changes and should start to see in the context of this if it doesn't make 
sense.  Perhaps we can do a mix of things where the BlacklistTracker would be 
renamed HealthTracker but other things internally are referred to as blocklist 
or blocked.

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32411) GPU Cluster Fail

2020-09-11 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194485#comment-17194485
 ] 

Thomas Graves commented on SPARK-32411:
---

[~chitralverma] if you are still having an issue please file an issue in 
[https://github.com/NVIDIA/spark-rapids/issues]

> GPU Cluster Fail
> 
>
> Key: SPARK-32411
> URL: https://issues.apache.org/jira/browse/SPARK-32411
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 3.0.0
> Environment: Ihave a Apache Spark 3.0 cluster consisting of machines 
> with multiple nvidia-gpus and I connect my jupyter notebook to the cluster 
> using pyspark,
>Reporter: Vinh Tran
>Priority: Major
>
> I'm having a difficult time getting a GPU cluster started on Apache Spark 
> 3.0. It was hard to find documentation on this, but I stumbled on a NVIDIA 
> github page for Rapids which suggested the following additional edits to the 
> spark-defaults.conf:
> {code:java}
> spark.task.resource.gpu.amount 0.25
> spark.executor.resource.gpu.discoveryScript 
> ./usr/local/spark/getGpusResources.sh{code}
> I have a Apache Spark 3.0 cluster consisting of machines with multiple 
> nvidia-gpus and I connect my jupyter notebook to the cluster using pyspark, 
> however it results in the following error: 
> {code:java}
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : org.apache.spark.SparkException: You must specify an amount for gpu
>   at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$parseResourceRequest$1(ResourceUtils.scala:142)
>   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
>   at 
> org.apache.spark.resource.ResourceUtils$.parseResourceRequest(ResourceUtils.scala:142)
>   at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$parseAllResourceRequests$1(ResourceUtils.scala:159)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:159)
>   at 
> org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:238)
>   at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>   at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> After this, I then tried adding another line to the conf per the instructions 
> which results in no errors, however when I log in to the Web UI at 
> localhost:8080, under Running Applications, the state remains at waiting.
> {code:java}
> spark.task.resource.gpu.amount  2
> spark.executor.resource.gpu.discoveryScript
> ./usr/local/spark/getGpusResources.sh
> spark.executor.resource.gpu.amount  1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32824) The error is confusing when resource .amount not provided

2020-09-08 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-32824:
-

 Summary: The error is confusing when resource .amount not provided 
 Key: SPARK-32824
 URL: https://issues.apache.org/jira/browse/SPARK-32824
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves


If the user forgets to specify the .amount when specifying a resource, the 
error that comes out is confusing, we should improve.

 

$ $SPARK_HOME/bin/spark-shell  --master spark://host9:7077 --conf 
spark.executor.resource.gpu=1

 
{code:java}
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.propertiesSetting default log level to 
"WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error initializing 
SparkContext.java.lang.StringIndexOutOfBoundsException: String index out of 
range: -1 at java.lang.String.substring(String.java:1967) at 
org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) 
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150)
 at 
org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158)
 at 
org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773) 
at 
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
 at org.apache.spark.SparkContext.(SparkContext.scala:528) at 
org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930)
 at scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) 
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at 
$line3.$read$$iw$$iw.(:15) at 
$line3.$read$$iw.(:42) at $line3.$read.(:44){code}
'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32823) Standalone Master UI resources in use wrong

2020-09-08 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-32823:
-

 Summary: Standalone Master UI resources in use wrong
 Key: SPARK-32823
 URL: https://issues.apache.org/jira/browse/SPARK-32823
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Thomas Graves


I was using the standalone deployment with workers with GPUs and the master ui 
was wrong for:
 * *Resources in use:* 0 / 4 gpu

In this case I had 2 workers, each with 4 gpus, so this total should have been 
8.  It seems like its just looking at a single worker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32823) Standalone Master UI resources in use wrong

2020-09-08 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192256#comment-17192256
 ] 

Thomas Graves commented on SPARK-32823:
---

I'm looking into this.

> Standalone Master UI resources in use wrong
> ---
>
> Key: SPARK-32823
> URL: https://issues.apache.org/jira/browse/SPARK-32823
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was using the standalone deployment with workers with GPUs and the master 
> ui was wrong for:
>  * *Resources in use:* 0 / 4 gpu
> In this case I had 2 workers, each with 4 gpus, so this total should have 
> been 8.  It seems like its just looking at a single worker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-08-25 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184059#comment-17184059
 ] 

Thomas Graves commented on SPARK-32037:
---

I started a thread on dev to get feedback: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Renaming-blacklisting-feature-input-td29950.html]

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32333) Drop references to Master

2020-08-25 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184051#comment-17184051
 ] 

Thomas Graves commented on SPARK-32333:
---

I send email to the dev list to get feedback, some other suggestions in that 
email here:
A few name possibilities:
 - ApplicationManager
 - StandaloneClusterManager
 - Coordinator
 - Primary
 - Controller
 
That chain can be found here: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Removing-references-to-Master-td29948.html]

> Drop references to Master
> -
>
> Key: SPARK-32333
> URL: https://issues.apache.org/jira/browse/SPARK-32333
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> We have a lot of references to "master" in the code base. It will be 
> beneficial to remove references to problematic language that can alienate 
> potential community members. 
> SPARK-32004 removed references to slave
>  
> Here is a IETF draft to fix up some of the most egregious examples
> (master/slave, whitelist/backlist) with proposed alternatives.
> https://tools.ietf.org/id/draft-knodel-terminology-00.html#rfc.section.1.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181867#comment-17181867
 ] 

Thomas Graves commented on SPARK-32672:
---

[~cltlfcjin]  Please do not be changing priority just because people are not 
committers. You should first evaluate what they are reporting.  If you don't 
think its a blocker then we should state why the reason.

I looked at this after it was filed and added correctness tag and it was 
already marked as Blocker so I didn't need to change it.  As you can see from 
[https://spark.apache.org/contributing.html,|https://spark.apache.org/contributing.html]
 correctness issues should be marked as a blocker at least until it's 
investigated and discussed. 

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181468#comment-17181468
 ] 

Thomas Graves commented on SPARK-32672:
---

[~cloud_fan] [~ruifengz]

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32672:
--
Affects Version/s: 3.0.1

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32672:
--
Labels: correctness  (was: )

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-19 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180772#comment-17180772
 ] 

Thomas Graves edited comment on SPARK-32640 at 8/19/20, 7:14 PM:
-

hmm, interesting, this is how my test was reproducing with paysark:

 

 
{code:java}
special_cases = [0.0, -0.0, 1.0, -1.0]
special_cases.append(float('nan'))
from pyspark.sql.types import *
df = spark.createDataFrame(special_cases, DoubleType())
df.selectExpr('log(value)').show()
{code}
 

+---+
|LOG(E(), value)|

+---+
|null|

+---+

 

>>> df.show()
 +-+
|value|

+-+
|0.0|
|-0.0|
|1.0|
|-1.0|
|NaN|

+-+

>>> df.printSchema()
 root
|– value: double (nullable = true)|

 


was (Author: tgraves):
hmm, interesting, this is how my test was reproducing with paysark:

 

'''special_cases = [0.0, -0.0, 1.0, -1.0]

special_cases.append(float('nan'))

from pyspark.sql.types import *

df = spark.createDataFrame(special_cases, DoubleType())

df.selectExpr('log(value)').show()

'''
 +---+
|LOG(E(), value)|

+---+
|null|

+---+

 

>>> df.show()
 +-+
|value|

+-+
|0.0|
|-0.0|
|1.0|
|-1.0|
|NaN|

+-+

>>> df.printSchema()
 root
|– value: double (nullable = true)|

 

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-19 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180772#comment-17180772
 ] 

Thomas Graves edited comment on SPARK-32640 at 8/19/20, 7:14 PM:
-

hmm, interesting, this is how my test was reproducing with paysark:

 

'''special_cases = [0.0, -0.0, 1.0, -1.0]

special_cases.append(float('nan'))

from pyspark.sql.types import *

df = spark.createDataFrame(special_cases, DoubleType())

df.selectExpr('log(value)').show()

'''
 +---+
|LOG(E(), value)|

+---+
|null|

+---+

 

>>> df.show()
 +-+
|value|

+-+
|0.0|
|-0.0|
|1.0|
|-1.0|
|NaN|

+-+

>>> df.printSchema()
 root
|– value: double (nullable = true)|

 


was (Author: tgraves):
hmm, interesting, this is how my test was reproducing with paysark:

special_cases = [0.0, -0.0, 1.0, -1.0]

special_cases.append(float('nan'))

from pyspark.sql.types import *

df = spark.createDataFrame(special_cases, DoubleType())

df.selectExpr('log(value)').show()
+---+
|LOG(E(), value)|
+---+
| null|
+---+

 

>>> df.show()
+-+
|value|
+-+
| 0.0|
| -0.0|
| 1.0|
| -1.0|
| NaN|
+-+

>>> df.printSchema()
root
 |-- value: double (nullable = true)

 

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-19 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180772#comment-17180772
 ] 

Thomas Graves commented on SPARK-32640:
---

hmm, interesting, this is how my test was reproducing with paysark:

special_cases = [0.0, -0.0, 1.0, -1.0]

special_cases.append(float('nan'))

from pyspark.sql.types import *

df = spark.createDataFrame(special_cases, DoubleType())

df.selectExpr('log(value)').show()
+---+
|LOG(E(), value)|
+---+
| null|
+---+

 

>>> df.show()
+-+
|value|
+-+
| 0.0|
| -0.0|
| 1.0|
| -1.0|
| NaN|
+-+

>>> df.printSchema()
root
 |-- value: double (nullable = true)

 

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-19 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180533#comment-17180533
 ] 

Thomas Graves commented on SPARK-32640:
---

The problem is with 3.1.0 not, 3.0.0. The description shows the columns input 
data and output

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-17 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32640:
--
Description: 
I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
this but I thought NaN was correct.

Spark 3.1.0 Example:

>>> df.selectExpr(["value", "log1p(value)"]).show()

+--+-+
|    value|  LOG1P(value)|

+--+-+
|-3.4028235E38|  null|
|3.4028235E38|88.72283906194683|
|  0.0|   0.0|
| -0.0|  -0.0|
|  1.0|0.6931471805599453|
| -1.0|  null|
|  NaN|  null|

+--+-+

 

Spark 3.0.0 example:

 

+-+--+
| value| LOG1P(value)|
+-+--+
|-3.4028235E38| null|
| 3.4028235E38| 88.72283906194683|
| 0.0| 0.0|
| -0.0| -0.0|
| 1.0|0.6931471805599453|
| -1.0| null|
| NaN| NaN|
+-+--+

 

Note it also does the same for log1p, log2, log10

  was:
I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
this but I thought NaN was correct.

Spark 3.1.0 Example:

>>> df.selectExpr(["value", "log1p(value)"]).show()

+-+--+

|    value|  LOG1P(value)|

+-+--+

|-3.4028235E38|  null|

| 3.4028235E38| 88.72283906194683|

|  0.0|   0.0|

| -0.0|  -0.0|

|  1.0|0.6931471805599453|

| -1.0|  null|

|  NaN|  null|

+-+--+

 

Spark 3.0.0 example:

+-+-+
| value| LOG(E(), value)|
+-+-+
|-3.4028235E38| null|
| 3.4028235E38|88.72283906194683|
| 0.0| null|
| -0.0| null|
| 1.0| 0.0|
| -1.0| null|
| NaN| NaN|
+-+-+

 

Note it also does the same for log1p, log2, log10


> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-17 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-32640:
-

 Summary: Spark 3.1 log(NaN) returns null instead of NaN
 Key: SPARK-32640
 URL: https://issues.apache.org/jira/browse/SPARK-32640
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Thomas Graves


I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
this but I thought NaN was correct.

Spark 3.1.0 Example:

>>> df.selectExpr(["value", "log1p(value)"]).show()

+-+--+

|    value|  LOG1P(value)|

+-+--+

|-3.4028235E38|  null|

| 3.4028235E38| 88.72283906194683|

|  0.0|   0.0|

| -0.0|  -0.0|

|  1.0|0.6931471805599453|

| -1.0|  null|

|  NaN|  null|

+-+--+

 

Spark 3.0.0 example:

+-+-+
| value| LOG(E(), value)|
+-+-+
|-3.4028235E38| null|
| 3.4028235E38|88.72283906194683|
| 0.0| null|
| -0.0| null|
| 1.0| 0.0|
| -1.0| null|
| NaN| NaN|
+-+-+

 

Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32591) Add better api docs for stage level scheduling Resources

2020-08-11 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-32591:
-

 Summary: Add better api docs for stage level scheduling Resources
 Key: SPARK-32591
 URL: https://issues.apache.org/jira/browse/SPARK-32591
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.1.0
Reporter: Thomas Graves


A question came up when we added offheap memory to be able to set in a 
ResourceProfile executor resources.

[https://github.com/apache/spark/pull/28972/]

Based on that discussion we should add better api docs to explain what each one 
does. Perhaps point to the corresponding configuration .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-08-04 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170817#comment-17170817
 ] 

Thomas Graves commented on SPARK-32037:
---

allowlist and blocklist have been used by others. Seems we may only need 
blocklist.  I'm hesitant with healthtracker as it could be used for other 
health checks but it does sound better.

[https://github.com/golang/go/commit/608cdcaede1e7133dc994b5e8894272c2dce744b]

[https://9to5google.com/2020/06/12/google-android-chrome-blacklist-blocklist-more-inclusive/]

[https://bugzilla.mozilla.org/show_bug.cgi?id=1571734]

 

DenyList:

https://issues.apache.org/jira/browse/GEODE-5685

[https://github.com/nodejs/node/pull/33813]

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32333) Drop references to Master

2020-08-03 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170451#comment-17170451
 ] 

Thomas Graves commented on SPARK-32333:
---

What about renaming Master to be ApplicationManager or StandaloneClusterManager?

> Drop references to Master
> -
>
> Key: SPARK-32333
> URL: https://issues.apache.org/jira/browse/SPARK-32333
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> We have a lot of references to "master" in the code base. It will be 
> beneficial to remove references to problematic language that can alienate 
> potential community members. 
> SPARK-32004 removed references to slave
>  
> Here is a IETF draft to fix up some of the most egregious examples
> (master/slave, whitelist/backlist) with proposed alternatives.
> https://tools.ietf.org/id/draft-knodel-terminology-00.html#rfc.section.1.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32332) AQE doesn't adequately allow for Columnar Processing extension

2020-07-31 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32332:
--
Fix Version/s: 3.0.1

> AQE doesn't adequately allow for Columnar Processing extension 
> ---
>
> Key: SPARK-32332
> URL: https://issues.apache.org/jira/browse/SPARK-32332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> In SPARK-27396 we added support to extended Columnar Processing. We did the 
> initial work as to what we thought was sufficient but adaptive query 
> execution was being developed at the same time.
> We have discovered that the changes made to AQE are not sufficient for users 
> to properly extend it for columnar processing because AQE hardcodes to look 
> for specific classes/execs.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 3205 matches

Mail list logo