date:20170803

[jira] [Created] (SPARK-21635) ACOS(2) and ASIN(2) should be null

2017-08-03 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-21635:
---

 Summary: ACOS(2) and ASIN(2) should be null
 Key: SPARK-21635
 URL: https://issues.apache.org/jira/browse/SPARK-21635
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Yuming Wang


ACOS(2) and ASIN(2) should be null, I have create a patch for Hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-08-03 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113974#comment-16113974
 ] 

Felix Cheung commented on SPARK-15799:
--

we submitted 2.2.0 release to CRAN and got some comment that we hope to resolve 
(or get an exception, if we could)..

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21634) Change OneRowRelation from a case object to case class

2017-08-03 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-21634:
---

 Summary: Change OneRowRelation from a case object to case class
 Key: SPARK-21634
 URL: https://issues.apache.org/jira/browse/SPARK-21634
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Reynold Xin
Assignee: Reynold Xin


OneRowRelation is the only plan that is a case object, which causes some issues 
with makeCopy using a 0-arg constructor. This patch changes it from a case 
object to a case class.

This blocks SPARK-21619.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21626) The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.

2017-08-03 Thread Gu Chao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gu Chao updated SPARK-21626:

Summary: The short-circuit local reads feature cannot be used because 
libhadoop cannot be loaded.  (was: "WARN NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable")

> The short-circuit local reads feature cannot be used because libhadoop cannot 
> be loaded.
> 
>
> Key: SPARK-21626
> URL: https://issues.apache.org/jira/browse/SPARK-21626
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Gu Chao
>
> After starting spark-shell, It outputs:
> {code:none}
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/08/04 11:24:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/08/04 11:24:44 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus" is already 
> registered. Ensure you dont have multiple JAR versions of the same plugin in 
> the classpath. The URL "file:/opt/spark/jars/datanucleus-core-3.2.10.jar" is 
> already registered, and you are trying to register an identical plugin 
> located at URL 
> "file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-core-3.2.10.jar."
> 17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is 
> already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/opt/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-api-jdo-3.2.6.jar."
> 17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/opt/spark/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and 
> you are trying to register an identical plugin located at URL 
> "file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-rdbms-3.2.9.jar."
> 17/08/04 11:24:51 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Spark context Web UI available at http://192.168.50.11:4040
> Spark context available as 'sc' (master = spark://hadoop:7077, app id = 
> app-20170804112442-0001).
> Spark session available as 'spark'.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21626) "WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable"

2017-08-03 Thread Gu Chao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gu Chao updated SPARK-21626:

Description: 
After starting spark-shell, It outputs:

{code:none}
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/08/04 11:24:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/08/04 11:24:44 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus" is already 
registered. Ensure you dont have multiple JAR versions of the same plugin in 
the classpath. The URL "file:/opt/spark/jars/datanucleus-core-3.2.10.jar" is 
already registered, and you are trying to register an identical plugin located 
at URL "file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-core-3.2.10.jar."
17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is 
already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
"file:/opt/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and 
you are trying to register an identical plugin located at URL 
"file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-api-jdo-3.2.6.jar."
17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" 
is already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
"file:/opt/spark/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and 
you are trying to register an identical plugin located at URL 
"file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-rdbms-3.2.9.jar."
17/08/04 11:24:51 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Spark context Web UI available at http://192.168.50.11:4040
Spark context available as 'sc' (master = spark://hadoop:7077, app id = 
app-20170804112442-0001).
Spark session available as 'spark'.
{code}



  was:
After starting spark-shell, It outputs:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/08/04 11:24:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/08/04 11:24:44 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus" is already 
registered. Ensure you dont have multiple JAR versions of the same plugin in 
the classpath. The URL "file:/opt/spark/jars/datanucleus-core-3.2.10.jar" is 
already registered, and you are trying to register an identical plugin located 
at URL "file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-core-3.2.10.jar."
17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is 
already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
"file:/opt/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and 
you are trying to register an identical plugin located at URL 
"file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-api-jdo-3.2.6.jar."
17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" 
is already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
"file:/opt/spark/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and 
you are trying to register an identical plugin located at URL 
"file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-rdbms-3.2.9.jar."
17/08/04 11:24:51 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Spark context Web UI available at http://192.168.50.11:4040
Spark context available as 'sc' (master = spark://hadoop:7077, app id = 
app-20170804112442-0001).
Spark session available as 'spark'.



> "WARN NativeCodeLoader: Unable to load native-hadoop library for your 
> platform... using builtin-java classes where applicable"
> --
>
> Key: SPARK-21626
> URL: https://issues.apache.org/jira/browse/SPARK-21626
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Gu Chao
>
> After starting spark-shell, It outputs:
> {code:none}
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/08/04 11:24:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes

[jira] [Updated] (SPARK-21626) "WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable"

2017-08-03 Thread Gu Chao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gu Chao updated SPARK-21626:

Description: 
After starting spark-shell, It outputs:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/08/04 11:24:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/08/04 11:24:44 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus" is already 
registered. Ensure you dont have multiple JAR versions of the same plugin in 
the classpath. The URL "file:/opt/spark/jars/datanucleus-core-3.2.10.jar" is 
already registered, and you are trying to register an identical plugin located 
at URL "file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-core-3.2.10.jar."
17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is 
already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
"file:/opt/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and 
you are trying to register an identical plugin located at URL 
"file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-api-jdo-3.2.6.jar."
17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" 
is already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
"file:/opt/spark/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and 
you are trying to register an identical plugin located at URL 
"file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-rdbms-3.2.9.jar."
17/08/04 11:24:51 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Spark context Web UI available at http://192.168.50.11:4040
Spark context available as 'sc' (master = spark://hadoop:7077, app id = 
app-20170804112442-0001).
Spark session available as 'spark'.


  was:
After starting spark-shell, It output:
17/08/03 18:24:16 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable



> "WARN NativeCodeLoader: Unable to load native-hadoop library for your 
> platform... using builtin-java classes where applicable"
> --
>
> Key: SPARK-21626
> URL: https://issues.apache.org/jira/browse/SPARK-21626
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Gu Chao
>
> After starting spark-shell, It outputs:
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/08/04 11:24:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/08/04 11:24:44 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus" is already 
> registered. Ensure you dont have multiple JAR versions of the same plugin in 
> the classpath. The URL "file:/opt/spark/jars/datanucleus-core-3.2.10.jar" is 
> already registered, and you are trying to register an identical plugin 
> located at URL 
> "file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-core-3.2.10.jar."
> 17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is 
> already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/opt/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-api-jdo-3.2.6.jar."
> 17/08/04 11:24:48 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/opt/spark/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and 
> you are trying to register an identical plugin located at URL 
> "file:/opt/spark-2.2.0-bin-hadoop2.6/jars/datanucleus-rdbms-3.2.9.jar."
> 17/08/04 11:24:51 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Spark context Web UI available at http://192.168.50.11:4040
> Spark context available as 'sc' (master = spark://hadoop:7077, app id = 
> app-20170804112442-0001).
> Spark session available as 'spark'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (SPARK-21633) Unary Transformer in Python

2017-08-03 Thread Ajay Saini (JIRA)

Ajay Saini created SPARK-21633:
--

 Summary: Unary Transformer in Python
 Key: SPARK-21633
 URL: https://issues.apache.org/jira/browse/SPARK-21633
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: Ajay Saini


Currently, the abstract class UnaryTransformer is only implemented in Scala. In 
order to make Pyspark easier to extend with custom transformers, it would be 
helpful to have the implementation of UnaryTransformer in Python as well.

This task involves:
- implementing the class UnaryTransformer in Python
- testing the transform() functionality of the class to make sure it works



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21618) http(s) not accepted in spark-submit jar uri

2017-08-03 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113806#comment-16113806
 ] 

Saisai Shao commented on SPARK-21618:
-

[~benmayne] If you try the master branch of Spark with SPARK-21012 in, the jars 
could be downloaded from http(s) url, please take a try.

> http(s) not accepted in spark-submit jar uri
> 
>
> Key: SPARK-21618
> URL: https://issues.apache.org/jira/browse/SPARK-21618
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1, 2.2.0
> Environment: pre-built for hadoop 2.6 and 2.7 on mac and ubuntu 
> 16.04. 
>Reporter: Ben Mayne
>Priority: Minor
>  Labels: documentation
>
> The documentation suggests I should be able to use an http(s) uri for a jar 
> in spark-submit, but I haven't been successful 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
> {noformat}
> benmayne@Benjamins-MacBook-Pro ~ $ spark-submit --deploy-mode client --master 
> local[2] --class class.name.Test https://test.com/path/to/jar.jar
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Exception in thread "main" java.io.IOException: No FileSystem for scheme: 
> https
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at 
> org.apache.spark.deploy.SparkSubmit$.downloadFile(SparkSubmit.scala:865)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:316)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> benmayne@Benjamins-MacBook-Pro ~ $
> {noformat}
> If I replace the path with a valid hdfs path 
> (hdfs:///user/benmayne/valid-jar.jar), it works as expected. I've seen the 
> same behavior across 2.2.0 (hadoop 2.6 & 2.7 on mac and ubuntu) and on 2.1.1 
> on ubuntu. 
> this is the example that I'm trying to replicate from 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management:
>  
> > Spark uses the following URL scheme to allow different strategies for 
> > disseminating jars:
> > file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file 
> > server, and every executor pulls the file from the driver HTTP server.
> > hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as 
> > expected
> {noformat}
> # Run on a Mesos cluster in cluster deploy mode with supervise
> ./bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master mesos://207.184.161.138:7077 \
>   --deploy-mode cluster \
>   --supervise \
>   --executor-memory 20G \
>   --total-executor-cores 100 \
>   http://path/to/examples.jar \
>   1000
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21624) Optimize communication cost of RF/GBT/DT

2017-08-03 Thread Peng Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113793#comment-16113793
 ] 

Peng Meng commented on SPARK-21624:
---

Thanks [~mlnick], use Vector and compress is reasonable. I will submit a PR and 
show the performance data. Thanks.

> Optimize communication cost of RF/GBT/DT
> 
>
> Key: SPARK-21624
> URL: https://issues.apache.org/jira/browse/SPARK-21624
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> {quote}The implementation of RF is bound by either  the cost of statistics 
> computation on workers or by communicating the sufficient statistics.{quote}
> The statistics are stored in allStats:
> {code:java}
>   /**
>* Flat array of elements.
>* Index for start of stats for a (feature, bin) is:
>*   index = featureOffsets(featureIndex) + binIndex * statsSize
>*/
>   private var allStats: Array[Double] = new Array[Double](allStatsSize)
> {code}
> The size of allStats maybe very large, and it can be very spare, especially 
> on the nodes that near the leave of the tree. 
> I have changed allStats from Array to SparseVector,  my tests show the 
> communication is down by about 50%.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21632) There is no need to make attempts for createDirectory if the dir had existed

2017-08-03 Thread liuzhaokun (JIRA)

liuzhaokun created SPARK-21632:
--

 Summary: There is no need to make attempts for createDirectory if 
the dir had existed 
 Key: SPARK-21632
 URL: https://issues.apache.org/jira/browse/SPARK-21632
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: liuzhaokun
Priority: Minor


There is no need to make attempts for createDirectory if the dir had existed.So 
I think we should log it,and Jump out of the loop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21627) analyze hive table compute stats for columns with mixed case exception

2017-08-03 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113778#comment-16113778
 ] 

Liang-Chi Hsieh commented on SPARK-21627:
-

I think it is just solved by SPARK-21599.

> analyze hive table compute stats for columns with mixed case exception
> --
>
> Key: SPARK-21627
> URL: https://issues.apache.org/jira/browse/SPARK-21627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>
> {code}
> sql("create table tabel1(b int) partitioned by (partColumn int)")
> sql("analyze table tabel1 compute statistics for columns partColumn, b")
> {code}
> {code}
> java.util.NoSuchElementException: key not found: partColumn
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:648)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:647)
>   at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:647)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375)
>   at 
> org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$$anonfun$47.apply(Dataset.scala:3036)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3035)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:636)
>   ... 39 elided
> {code}
> Looks like regression introduced by https://github.com/apache/spark/pull/18248
> In {{HiveExternalCatalog.alterTableState}} {{colNameTypeMap}} contains lower 
> case column names.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21630) Pmod should not throw a divide by zero exception

2017-08-03 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113777#comment-16113777
 ] 

Liang-Chi Hsieh commented on SPARK-21630:
-

Maybe duplicate to SPARK-21205?

> Pmod should not throw a divide by zero exception
> 
>
> Key: SPARK-21630
> URL: https://issues.apache.org/jira/browse/SPARK-21630
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Herman van Hovell
>
> Pmod currently throws a divide by zero exception when the right input is 0. 
> It should - like Divide or Remainder - probably return null.
> Here is a small reproducer:
> {noformat}
> scala> sql("select pmod(id, 0) from range(10)").show
> 17/08/03 22:36:43 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.ArithmeticException: / by zero
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21631) Building Spark with SBT unsuccessful when source code in Mllib is modified, But with MVN is ok

2017-08-03 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113775#comment-16113775
 ] 

Liang-Chi Hsieh commented on SPARK-21631:
-

Looks like just it is not compliant with Spark code style?

> Building Spark with SBT unsuccessful when source code in Mllib is modified, 
> But with MVN is ok
> --
>
> Key: SPARK-21631
> URL: https://issues.apache.org/jira/browse/SPARK-21631
> Project: Spark
>  Issue Type: Bug
>  Components: Build, MLlib
>Affects Versions: 2.1.1
> Environment: ubuntu 14.04
> Spark 2.1.1
> MVN 3.3.9
> scala 2.11.8
>Reporter: Sean Wong
>
> I added 
> import org.apache.spark.internal.Logging
> at the head of LinearRegression.scala file
> Then, I try to build Spark using SBT.
> However, here is the error:
> *[info] Done packaging.
> java.lang.RuntimeException: errors exist
> at scala.sys.package$.error(package.scala:27)
> at org.scalastyle.sbt.Tasks$.onHasErrors$1(Plugin.scala:132)
> at 
> org.scalastyle.sbt.Tasks$.doScalastyleWithConfig$1(Plugin.scala:187)
> at org.scalastyle.sbt.Tasks$.doScalastyle(Plugin.scala:195)
> at 
> SparkBuild$$anonfun$cachedScalaStyle$1$$anonfun$17.apply(SparkBuild.scala:205)
> at 
> SparkBuild$$anonfun$cachedScalaStyle$1$$anonfun$17.apply(SparkBuild.scala:192)
> at sbt.FileFunction$$anonfun$cached$1.apply(Tracked.scala:235)
> at sbt.FileFunction$$anonfun$cached$1.apply(Tracked.scala:235)
> at 
> sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3$$anonfun$apply$4.apply(Tracked.scala:249)
> at 
> sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3$$anonfun$apply$4.apply(Tracked.scala:245)
> at sbt.Difference.apply(Tracked.scala:224)
> at sbt.Difference.apply(Tracked.scala:206)
> at 
> sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3.apply(Tracked.scala:245)
> at 
> sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3.apply(Tracked.scala:244)
> at sbt.Difference.apply(Tracked.scala:224)
> at sbt.Difference.apply(Tracked.scala:200)
> at sbt.FileFunction$$anonfun$cached$2.apply(Tracked.scala:244)
> at sbt.FileFunction$$anonfun$cached$2.apply(Tracked.scala:242)
> at SparkBuild$$anonfun$cachedScalaStyle$1.apply(SparkBuild.scala:212)
> at SparkBuild$$anonfun$cachedScalaStyle$1.apply(SparkBuild.scala:187)
> at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
> at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
> at sbt.std.Transform$$anon$4.work(System.scala:63)
> at 
> sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
> at 
> sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
> at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
> at sbt.Execute.work(Execute.scala:237)
> at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
> at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
> at 
> sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
> at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> [error] (mllib/*:scalaStyleOnCompile) errors exist*
> After this, I switch to use MVN to build Spark, Everything is ok and the 
> building is successful.
> So is this a bug for SBT building? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20870) Update the output of spark-sql -H

2017-08-03 Thread Bravo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113773#comment-16113773
 ] 

Bravo Zhang commented on SPARK-20870:
-

Hi [~smilegator],
I can't find the code handling the help message in spark.
Is it managed in Hive project? 
https://github.com/apache/hive/blob/master/cli/src/java/org/apache/hadoop/hive/cli/OptionsProcessor.java

> Update the output of spark-sql -H
> -
>
> Key: SPARK-20870
> URL: https://issues.apache.org/jira/browse/SPARK-20870
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> When we input `./bin/spark-sql -H`, the output is still based on Hive. We 
> need to check whether all of them are working correctly. If not supported, we 
> need to remove it from the list. 
> Also, update the first line to `usage: spark-sql`
> {noformat}
> usage: hive
> -d,--define

[jira] [Comment Edited] (SPARK-21629) OR nullability is incorrect

2017-08-03 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113752#comment-16113752
 ] 

Takeshi Yamamuro edited comment on SPARK-21629 at 8/4/17 12:58 AM:
---

What's a concrete query and result example?
It is like a sequence below (I think this is a correct case though...)?
{code}
scala> Seq((Some(1), 1), (None, 2)).toDF("a", "b").selectExpr("a > 0 OR b > 
0").printSchema
root
 |-- ((a > 0) OR (b > 0)): boolean (nullable = true)


scala> Seq((1, 1), (0, 2)).toDF("a", "b").selectExpr("a > 0 OR b > 
0").printSchema
root
 |-- ((a > 0) OR (b > 0)): boolean (nullable = false)
{code}


was (Author: maropu):
What's a concrete query and result example?
It is like a sequence below?
{code}
scala> Seq((Some(1), 1), (None, 2)).toDF("a", "b").selectExpr("a > 0 OR b > 
0").printSchema
root
 |-- ((a > 0) OR (b > 0)): boolean (nullable = true)


scala> Seq((1, 1), (0, 2)).toDF("a", "b").selectExpr("a > 0 OR b > 
0").printSchema
root
 |-- ((a > 0) OR (b > 0)): boolean (nullable = false)
{code}

> OR nullability is incorrect
> ---
>
> Key: SPARK-21629
> URL: https://issues.apache.org/jira/browse/SPARK-21629
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.1, 2.2.0
>Reporter: Herman van Hovell
>Priority: Minor
>
> The SQL {{OR}} expression's nullability is slightly incorrect. It should only 
> be nullable when both of the input expressions are nullable, and not when 
> either of them is nullable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21629) OR nullability is incorrect

2017-08-03 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113752#comment-16113752
 ] 

Takeshi Yamamuro commented on SPARK-21629:
--

What's a concrete query and result example?
It is like a sequence below?
{code}
scala> Seq((Some(1), 1), (None, 2)).toDF("a", "b").selectExpr("a > 0 OR b > 
0").printSchema
root
 |-- ((a > 0) OR (b > 0)): boolean (nullable = true)


scala> Seq((1, 1), (0, 2)).toDF("a", "b").selectExpr("a > 0 OR b > 
0").printSchema
root
 |-- ((a > 0) OR (b > 0)): boolean (nullable = false)
{code}

> OR nullability is incorrect
> ---
>
> Key: SPARK-21629
> URL: https://issues.apache.org/jira/browse/SPARK-21629
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.1, 2.2.0
>Reporter: Herman van Hovell
>Priority: Minor
>
> The SQL {{OR}} expression's nullability is slightly incorrect. It should only 
> be nullable when both of the input expressions are nullable, and not when 
> either of them is nullable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21631) Building Spark with SBT unsuccessful when source code in Mllib is modified, But with MVN is ok

2017-08-03 Thread Sean Wong (JIRA)

Sean Wong created SPARK-21631:
-

 Summary: Building Spark with SBT unsuccessful when source code in 
Mllib is modified, But with MVN is ok
 Key: SPARK-21631
 URL: https://issues.apache.org/jira/browse/SPARK-21631
 Project: Spark
  Issue Type: Bug
  Components: Build, MLlib
Affects Versions: 2.1.1
 Environment: ubuntu 14.04

Spark 2.1.1

MVN 3.3.9

scala 2.11.8
Reporter: Sean Wong


I added 
import org.apache.spark.internal.Logging
at the head of LinearRegression.scala file

Then, I try to build Spark using SBT.
However, here is the error:
*[info] Done packaging.
java.lang.RuntimeException: errors exist
at scala.sys.package$.error(package.scala:27)
at org.scalastyle.sbt.Tasks$.onHasErrors$1(Plugin.scala:132)
at org.scalastyle.sbt.Tasks$.doScalastyleWithConfig$1(Plugin.scala:187)
at org.scalastyle.sbt.Tasks$.doScalastyle(Plugin.scala:195)
at 
SparkBuild$$anonfun$cachedScalaStyle$1$$anonfun$17.apply(SparkBuild.scala:205)
at 
SparkBuild$$anonfun$cachedScalaStyle$1$$anonfun$17.apply(SparkBuild.scala:192)
at sbt.FileFunction$$anonfun$cached$1.apply(Tracked.scala:235)
at sbt.FileFunction$$anonfun$cached$1.apply(Tracked.scala:235)
at 
sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3$$anonfun$apply$4.apply(Tracked.scala:249)
at 
sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3$$anonfun$apply$4.apply(Tracked.scala:245)
at sbt.Difference.apply(Tracked.scala:224)
at sbt.Difference.apply(Tracked.scala:206)
at 
sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3.apply(Tracked.scala:245)
at 
sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3.apply(Tracked.scala:244)
at sbt.Difference.apply(Tracked.scala:224)
at sbt.Difference.apply(Tracked.scala:200)
at sbt.FileFunction$$anonfun$cached$2.apply(Tracked.scala:244)
at sbt.FileFunction$$anonfun$cached$2.apply(Tracked.scala:242)
at SparkBuild$$anonfun$cachedScalaStyle$1.apply(SparkBuild.scala:212)
at SparkBuild$$anonfun$cachedScalaStyle$1.apply(SparkBuild.scala:187)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[error] (mllib/*:scalaStyleOnCompile) errors exist*

After this, I switch to use MVN to build Spark, Everything is ok and the 
building is successful.

So is this a bug for SBT building? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20812) Add Mesos Secrets support to the spark dispatcher

2017-08-03 Thread Arthur Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113731#comment-16113731
 ] 

Arthur Rand commented on SPARK-20812:
-

https://github.com/apache/spark/pull/18837

> Add Mesos Secrets support to the spark dispatcher
> -
>
> Key: SPARK-20812
> URL: https://issues.apache.org/jira/browse/SPARK-20812
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Michael Gummelt
>
> Mesos 1.4 will support secrets.  In order to support sending keytabs through 
> the Spark Dispatcher, or any other secret, we need to integrate this with the 
> Spark Dispatcher.
> The integration should include support for both file-based and env-based 
> secrets.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2017-08-03 Thread David Lewis (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113727#comment-16113727
 ] 

David Lewis commented on SPARK-19870:
-

I think I'm hitting a similar bug, here are two stack traces in the block 
manager, one waiting for read and one waiting for write:
{code}java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
org.apache.spark.storage.BlockInfoManager.lockForWriting(BlockInfoManager.scala:236)
org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1323)
org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1314)
org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1314)
scala.collection.Iterator$class.foreach(Iterator.scala:893)
scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1314)
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:66)
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:66)
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:66)
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:82)
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617){code}

and 

{code}java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
org.apache.spark.storage.BlockInfoManager.lockForWriting(BlockInfoManager.scala:236)
org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1323)
org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1314)
org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1314)
scala.collection.Iterator$class.foreach(Iterator.scala:893)
scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1314)
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:66)
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:66)
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:66)
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:82)
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:748){code}

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
> Attachments: stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked

[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-08-03 Thread Taichi Sano (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113702#comment-16113702
 ] 

Taichi Sano commented on SPARK-18406:
-

Hello,
I am experiencing an issue very similar to this. I am currently trying to do a 
groupByKeyAndWindow() with batch size of 1, window size of 80, and shift size 
of 1 from data that is being streamed from Kafka (ver 0.10) with Direct 
Streaming. Every once in a while, I encounter the AssertionError like so:

17/08/03 22:32:19 ERROR org.apache.spark.executor.Executor: Exception in task 
0.0 in stage 20936.0 (TID 4409)
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at 
org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at 
org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at 
org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
17/08/03 22:32:19 ERROR org.apache.spark.executor.Executor: Exception in task 
0.1 in stage 20936.0 (TID 4410)
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at 
org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at 
org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at 
org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
17/08/03 22:32:19 ERROR org.apache.spark.util.Utils: Uncaught exception in 
thread stdout writer for /opt/conda/bin/python
java.lang.AssertionError: assertion failed: Block rdd_30291_0 is not locked for 
reading
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)
at 
org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:720)
at 
org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:516)
at 
org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
at

[jira] [Commented] (SPARK-20853) spark.ui.reverseProxy=true leads to hanging communication to master

2017-08-03 Thread Josh Bacon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113670#comment-16113670
 ] 

Josh Bacon commented on SPARK-20853:


For the record, I'm experiencing the exact same behavior as described by 
[~tmckay]. If total number of workers + drivers exceed 9 (each with 
spark.ui.reverseProxy enabled), then the Master U.I. becomes unresponsive. 
Remove either workers or running drivers below the threshold, the Master U.I. 
will become responsive again.

> spark.ui.reverseProxy=true leads to hanging communication to master
> ---
>
> Key: SPARK-20853
> URL: https://issues.apache.org/jira/browse/SPARK-20853
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Benno Staebler
>  Labels: network, web-ui
>
> When *reverse proxy is enabled*
> {quote}
> spark.ui.reverseProxy=true
> spark.ui.reverseProxyUrl=/
> {quote}
>  first of all any invocation of the spark master Web UI hangs forever locally 
> (e.g. http://192.168.10.16:25001) and via external URL without any data 
> received. 
> One, sometimes two spark applications succeed without error and than workers 
> start throwing exceptions:
> {quote}
> Caused by: java.io.IOException: Failed to connect to /192.168.10.16:25050
> {quote}
> The application dies during creation of SparkContext:
> {quote}
> 2017-05-22 16:11:23 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting 
> to master spark://node0101:25000...
> 2017-05-22 16:11:23 INFO  TransportClientFactory:254 - Successfully created 
> connection to node0101/192.168.10.16:25000 after 169 ms (132 ms spent in 
> bootstraps)
> 2017-05-22 16:11:43 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting 
> to master spark://node0101:25000...
> 2017-05-22 16:12:03 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting 
> to master spark://node0101:25000...
> 2017-05-22 16:12:23 ERROR StandaloneSchedulerBackend:70 - Application has 
> been killed. Reason: All masters are unresponsive! Giving up.
> 2017-05-22 16:12:23 WARN  StandaloneSchedulerBackend:66 - Application ID is 
> not initialized yet.
> 2017-05-22 16:12:23 INFO  Utils:54 - Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 25056.
> .
> Caused by: java.lang.IllegalArgumentException: requirement failed: Can only 
> call getServletHandlers on a running MetricsSystem
> {quote}
> *This definitively does not happen without reverse proxy enabled!*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19112) add codec for ZStandard

2017-08-03 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113658#comment-16113658
 ] 

Marcelo Vanzin commented on SPARK-19112:


Yes, we can't merge the PR until Facebook re-licenses the code.

> add codec for ZStandard
> ---
>
> Key: SPARK-19112
> URL: https://issues.apache.org/jira/browse/SPARK-19112
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Thomas Graves
>Priority: Minor
>
> ZStandard: https://github.com/facebook/zstd and 
> http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was 
> recently released. Hadoop 
> (https://issues.apache.org/jira/browse/HADOOP-13578) and others 
> (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it.
> Zstd seems to give great results => Gzip level Compression with Lz4 level CPU.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19112) add codec for ZStandard

2017-08-03 Thread Adam Kennedy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113653#comment-16113653
 ] 

Adam Kennedy commented on SPARK-19112:
--

Will this be impacted by LEGAL-303? zstd-jni embeds zstd which has the Facebook 
PATENTS file in it.

> add codec for ZStandard
> ---
>
> Key: SPARK-19112
> URL: https://issues.apache.org/jira/browse/SPARK-19112
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Thomas Graves
>Priority: Minor
>
> ZStandard: https://github.com/facebook/zstd and 
> http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was 
> recently released. Hadoop 
> (https://issues.apache.org/jira/browse/HADOOP-13578) and others 
> (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it.
> Zstd seems to give great results => Gzip level Compression with Lz4 level CPU.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21478) Unpersist a DF also unpersists related DFs

2017-08-03 Thread Roberto Mirizzi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113635#comment-16113635
 ] 

Roberto Mirizzi commented on SPARK-21478:
-

Hi [~smilegator] Is that documented somewhere?

> Unpersist a DF also unpersists related DFs
> --
>
> Key: SPARK-21478
> URL: https://issues.apache.org/jira/browse/SPARK-21478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Roberto Mirizzi
>
> Starting with Spark 2.1.1 I observed this bug. Here's are the steps to 
> reproduce it:
> # create a DF
> # persist it
> # count the items in it
> # create a new DF as a transformation of the previous one
> # persist it
> # count the items in it
> # unpersist the first DF
> Once you do that you will see that also the 2nd DF is gone.
> The code to reproduce it is:
> {code:java}
> val x1 = Seq(1).toDF()
> x1.persist()
> x1.count()
> assert(x1.storageLevel.useMemory)
> val x11 = x1.select($"value" * 2)
> x11.persist()
> x11.count()
> assert(x11.storageLevel.useMemory)
> x1.unpersist()
> assert(!x1.storageLevel.useMemory)
> //the following assertion FAILS
> assert(x11.storageLevel.useMemory)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21618) http(s) not accepted in spark-submit jar uri

2017-08-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21618.
---
Resolution: Duplicate

> http(s) not accepted in spark-submit jar uri
> 
>
> Key: SPARK-21618
> URL: https://issues.apache.org/jira/browse/SPARK-21618
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1, 2.2.0
> Environment: pre-built for hadoop 2.6 and 2.7 on mac and ubuntu 
> 16.04. 
>Reporter: Ben Mayne
>Priority: Minor
>  Labels: documentation
>
> The documentation suggests I should be able to use an http(s) uri for a jar 
> in spark-submit, but I haven't been successful 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
> {noformat}
> benmayne@Benjamins-MacBook-Pro ~ $ spark-submit --deploy-mode client --master 
> local[2] --class class.name.Test https://test.com/path/to/jar.jar
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Exception in thread "main" java.io.IOException: No FileSystem for scheme: 
> https
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at 
> org.apache.spark.deploy.SparkSubmit$.downloadFile(SparkSubmit.scala:865)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:316)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> benmayne@Benjamins-MacBook-Pro ~ $
> {noformat}
> If I replace the path with a valid hdfs path 
> (hdfs:///user/benmayne/valid-jar.jar), it works as expected. I've seen the 
> same behavior across 2.2.0 (hadoop 2.6 & 2.7 on mac and ubuntu) and on 2.1.1 
> on ubuntu. 
> this is the example that I'm trying to replicate from 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management:
>  
> > Spark uses the following URL scheme to allow different strategies for 
> > disseminating jars:
> > file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file 
> > server, and every executor pulls the file from the driver HTTP server.
> > hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as 
> > expected
> {noformat}
> # Run on a Mesos cluster in cluster deploy mode with supervise
> ./bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master mesos://207.184.161.138:7077 \
>   --deploy-mode cluster \
>   --supervise \
>   --executor-memory 20G \
>   --total-executor-cores 100 \
>   http://path/to/examples.jar \
>   1000
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow

2017-08-03 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113622#comment-16113622
 ] 

Tejas Patil commented on SPARK-21595:
-

[~hvanhovell] : I am fine with either options you mentioned. 

one more option: Right now the (switch from in-memory to 
`UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) is 
controlled by a single threshold. If we de-couple those two using separate 
thresholds, then the "spill on memory pressure" behavior will be achieved. The 
threshold for in-memory can be kept small and keeping the spilling to disk 
higher will avoid excessive disk spills. This is fairly simple change to do. 
What do you think ?

> introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 
> breaks existing workflow
> -
>
> Key: SPARK-21595
> URL: https://issues.apache.org/jira/browse/SPARK-21595
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.2.0
> Environment: pyspark on linux
>Reporter: Stephan Reiling
>Priority: Minor
>  Labels: documentation, regression
>
> My pyspark code has the following statement:
> {code:java}
> # assign row key for tracking
> df = df.withColumn(
> 'association_idx',
> sqlf.row_number().over(
> Window.orderBy('uid1', 'uid2')
> )
> )
> {code}
> where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates 
> one large window for the whole dataframe to sort over.
> In spark 2.1 this works without problem, in spark 2.2 this fails either with 
> out of memory exception or too many open files exception, depending on memory 
> settings (which is what I tried first to fix this).
> Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 
> creates >110,000 files.
> In the log I see the following messages (110,000 of these):
> {noformat}
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (0  time so far)
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (1  time so far)
> {noformat}
> So I started hunting for clues in UnsafeExternalSorter, without luck. What I 
> had missed was this one message:
> {noformat}
> 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
> threshold of 4096 rows, switching to 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
> {noformat}
> Which allowed me to track down the issue. 
> By changing the configuration to include:
> {code:java}
> spark.sql.windowExec.buffer.spill.threshold   2097152
> {code}
> I got it to work again and with the same performance as spark 2.1.
> I have workflows where I use windowing functions that do not fail, but took a 
> performance hit due to the excessive spilling when using the default of 4096.
> I think to make it easier to track down these issues this config variable 
> should be included in the configuration documentation. 
> Maybe 4096 is too small of a default value?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113605#comment-16113605
 ] 

Mark Hamstra commented on SPARK-21619:
--

But part of the point of the split in my half-baked example is to fork the 
query execution pipeline before physical plan generation, allowing the cost of 
that generation to be parallelized with an instance per execution engine. Yes, 
maybe doing dispatch of physical plans via the CBO or other means is all that I 
should realistically hope for, but it doesn't mean that it isn't worth thinking 
about alternatives. 

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21617) ALTER TABLE...ADD COLUMNS broken in Hive 2.1 for DS tables

2017-08-03 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-21617:
---
Summary: ALTER TABLE...ADD COLUMNS broken in Hive 2.1 for DS tables  (was: 
ALTER TABLE...ADD COLUMNS creates invalid metadata in Hive metastore for DS 
tables)

> ALTER TABLE...ADD COLUMNS broken in Hive 2.1 for DS tables
> --
>
> Key: SPARK-21617
> URL: https://issues.apache.org/jira/browse/SPARK-21617
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> When you have a data source table and you run a "ALTER TABLE...ADD COLUMNS" 
> query, Spark will save invalid metadata to the Hive metastore.
> Namely, it will overwrite the table's schema with the data frame's schema; 
> that is not desired for data source tables (where the schema is stored in a 
> table property instead).
> Moreover, if you use a newer metastore client where 
> METASTORE_DISALLOW_INCOMPATIBLE_COL_TYPE_CHANGES is on by default, you 
> actually get an exception:
> {noformat}
> InvalidOperationException(message:The following columns have types 
> incompatible with the existing columns in their respective positions :
> c1)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.throwExceptionIfIncompatibleColTypeChange(MetaStoreUtils.java:615)
>   at 
> org.apache.hadoop.hive.metastore.HiveAlterHandler.alterTable(HiveAlterHandler.java:133)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.alter_table_core(HiveMetaStore.java:3704)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.alter_table_with_environment_context(HiveMetaStore.java:3675)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:140)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
>   at com.sun.proxy.$Proxy26.alter_table_with_environment_context(Unknown 
> Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.alter_table_with_environmentContext(HiveMetaStoreClient.java:402)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.alter_table_with_environmentContext(SessionHiveMetaStoreClient.java:309)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
>   at com.sun.proxy.$Proxy27.alter_table_with_environmentContext(Unknown 
> Source)
>   at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:601)
> {noformat}
> That exception is handled by Spark in an odd way (see code in 
> {{HiveExternalCatalog.scala}}) which still stores invalid metadata.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21617) ALTER TABLE...ADD COLUMNS creates invalid metadata in Hive metastore for DS tables

2017-08-03 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113602#comment-16113602
 ] 

Marcelo Vanzin commented on SPARK-21617:


Here's the full test error from our internal build against 2.1:

{noformat}
15:11:29.602 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Could 
not alter schema of table  `default`.`t1` in a Hive compatible way. Updating 
Hive metastore in Spark SQL specific format.
[snip]
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter 
table. The following columns have types incompatible with the existing columns 
in their respective positions :
c1
at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:624)
at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:602)
- alter datasource table add columns - partitioned - csv *** FAILED ***
  org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: at least one column must be 
specified for the table;
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:107)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.alterTableSchema(HiveExternalCatalog.scala:656)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableSchema(SessionCatalog.scala:372)
{noformat}

So the exception above is just a warning, and the problem seems to actually be 
in how Spark is recovering from that situation (the exception handler in 
{{HiveExternalCatalog.alterTableSchema}}).


> ALTER TABLE...ADD COLUMNS creates invalid metadata in Hive metastore for DS 
> tables
> --
>
> Key: SPARK-21617
> URL: https://issues.apache.org/jira/browse/SPARK-21617
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> When you have a data source table and you run a "ALTER TABLE...ADD COLUMNS" 
> query, Spark will save invalid metadata to the Hive metastore.
> Namely, it will overwrite the table's schema with the data frame's schema; 
> that is not desired for data source tables (where the schema is stored in a 
> table property instead).
> Moreover, if you use a newer metastore client where 
> METASTORE_DISALLOW_INCOMPATIBLE_COL_TYPE_CHANGES is on by default, you 
> actually get an exception:
> {noformat}
> InvalidOperationException(message:The following columns have types 
> incompatible with the existing columns in their respective positions :
> c1)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.throwExceptionIfIncompatibleColTypeChange(MetaStoreUtils.java:615)
>   at 
> org.apache.hadoop.hive.metastore.HiveAlterHandler.alterTable(HiveAlterHandler.java:133)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.alter_table_core(HiveMetaStore.java:3704)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.alter_table_with_environment_context(HiveMetaStore.java:3675)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:140)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
>   at com.sun.proxy.$Proxy26.alter_table_with_environment_context(Unknown 
> Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.alter_table_with_environmentContext(HiveMetaStoreClient.java:402)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.alter_table_with_environmentContext(SessionHiveMetaStoreClient.java:309)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
>   at com.sun.proxy.$Proxy27.alter_table_with_environmentContext(Unknown 
> Source)
>   at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:601)
> {noformat}
> That exception is handled by Spark in an odd way (see code in 
> {{HiveExternalCatalog.scala}}) which still stores invalid metadata.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail:

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113598#comment-16113598
 ] 

Reynold Xin commented on SPARK-21619:
-

Just look at structured streaming. That eould be one example.




> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113593#comment-16113593
 ] 

Reynold Xin commented on SPARK-21619:
-

Just generate different physical plan?




> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113586#comment-16113586
 ] 

Mark Hamstra commented on SPARK-21619:
--

Or you can just enlighten me on how one should design a dispatch function for 
multiple expressions of semantically equivalent query plans under the current 
architecture. :) Dispatching based on a canonical form of a plan seems like an 
obvious solution to me, but maybe I'm missing something.  

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113579#comment-16113579
 ] 

Reynold Xin commented on SPARK-21619:
-

Ok so we are good with this one.

Sorry I don't see why this issue blocks or has any impact on supporting
different execution engines. I have many prototypes done myself that does
exactly what you were describing in the past under the current design.
Maybe we just need to agree to disagree.




> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113555#comment-16113555
 ] 

Mark Hamstra edited comment on SPARK-21619 at 8/3/17 10:01 PM:
---

Yes, I absolutely understand that this issue and PR are meant to address an 
immediate need, and that a deeper redesign would be one or likely more separate 
issues. I'm more trying to raise awareness or improve my understanding than to 
delay or block progress on addressing the immediate need.

I do have concerns, though, that making canonical plans unexecutable just 
because they are in canonical form does make certain evolutions of Spark more 
difficult. As one half-baked example, you could want to decouple query plans 
from a single execution engine, so that certain kinds of logical plans could be 
sent toward execution on one engine (or cluster configuration) while other 
plans could be directed to a separate engine (presumably more suitable to those 
plans in some way.) Splitting and forking Spark's query execution pipeline in 
that kind of way isn't really that difficult (I've done it in at least a 
proof-of-concept), and has some perhaps significant potential benefits. To do 
that, though, you'd really like to have a single, canonical form for any 
semantically equivalent queries by the time they reach your dispatch function 
for determining the destination execution engine for a query (and where results 
will be cached locally, etc.) Making the canonical form unexecutable throws a 
wrench into that.  


was (Author: markhamstra):
Yes, I absolutely understand that this issue and PR are meant to address an 
immediate need, and that a deeper redesign would be one or likely more separate 
issues. I more trying to raise awareness or improve my understanding than to 
delay or block progress on addressing the immediate need.

I do have concerns, though, that making canonical plans unexecutable just 
because they are in canonical form does make certain evolutions of Spark more 
difficult. As one half-baked example, you could want to decouple query plans 
from a single execution engine, so that certain kinds of logical plans could be 
sent toward execution on one engine (or cluster configuration) while other 
plans could be directed to a separate engine (presumably more suitable to those 
plans in some way.) Splitting and forking Spark's query execution pipeline in 
that kind of way isn't really that difficult (I've done it in at least a 
proof-of-concept), and has some perhaps significant potential benefits. To do 
that, though, you'd really like to have a single, canonical form for any 
semantically equivalent queries by the time they reach your dispatch function 
for determining the destination execution engine for a query (and where results 
will be cached locally, etc.) Making the canonical form unexecutable throws a 
wrench into that.  

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113571#comment-16113571
 ] 

Mark Hamstra commented on SPARK-21619:
--

_"Why would you want to execute multiple semantically equivalent plans in 
different forms?" -> Because they can be executed in different times, using 
different aliases, etc?_

Right, so for separate executions of semantically equivalent plans you need to 
maintain a mapping between the aliases of a particular plan and their canonical 
form, but after doing that you can more easily recover data and metadata 
associated with a prior execution of an equivalent plan.

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113555#comment-16113555
 ] 

Mark Hamstra commented on SPARK-21619:
--

Yes, I absolutely understand that this issue and PR are meant to address an 
immediate need, and that a deeper redesign would be one or likely more separate 
issues. I more trying to raise awareness or improve my understanding than to 
delay or block progress on addressing the immediate need.

I do have concerns, though, that making canonical plans unexecutable just 
because they are in canonical form does make certain evolutions of Spark more 
difficult. As one half-baked example, you could want to decouple query plans 
from a single execution engine, so that certain kinds of logical plans could be 
sent toward execution on one engine (or cluster configuration) while other 
plans could be directed to a separate engine (presumably more suitable to those 
plans in some way.) Splitting and forking Spark's query execution pipeline in 
that kind of way isn't really that difficult (I've done it in at least a 
proof-of-concept), and has some perhaps significant potential benefits. To do 
that, though, you'd really like to have a single, canonical form for any 
semantically equivalent queries by the time they reach your dispatch function 
for determining the destination execution engine for a query (and where results 
will be cached locally, etc.) Making the canonical form unexecutable throws a 
wrench into that.  

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113538#comment-16113538
 ] 

Reynold Xin commented on SPARK-21619:
-

Also self-joins are very difficult to handle. They have different expression 
ids for resolution, even though on both sides of the join the plans (at least 
the subtrees) are semantically equivalent.


> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113536#comment-16113536
 ] 

Reynold Xin commented on SPARK-21619:
-

Mark that's a great point but you are going into the existential question of 
how we should design query execution and potentially overthrow the entire 
architecture here. The way canonicalization is defined as is in Spark is that 
it is not meant for execution. This ticket simply enforces that with a few line 
of change.

If we want to redesign how query execution should work (I don't see why we 
would want to since I don't see much real practical benefits given we already 
have sameResult and semanticHash), we should do it separately.

"Why would you want to execute multiple semantically equivalent plans in 
different forms?" -> Because they can be executed in different times, using 
different aliases, etc?

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113526#comment-16113526
 ] 

Mark Hamstra commented on SPARK-21619:
--

Two reason, mostly: 1) To provide better guarantees that plans that are deemed 
to be semantically equivalent actually end up being expressed the same way 
before execution and thus go down the same code paths; 2) To simplify some 
downstream logic; so instead of needing to maintain a mapping between multiple, 
semantically equivalent plans and a single canonical form, after a certain 
canonicalization point the plans really are the same.

To perhaps clear up my confusion, maybe you can answer the question going the 
other way: Why would you want to execute multiple semantically equivalent plans 
in different forms?

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21588) SQLContext.getConf(key, null) should return null, but it throws NPE

2017-08-03 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113519#comment-16113519
 ] 

Burak Yavuz commented on SPARK-21588:
-

that's what I was proposing. `null` seemed more familiar than `` 
before I looked at the code. 

> SQLContext.getConf(key, null) should return null, but it throws NPE
> ---
>
> Key: SPARK-21588
> URL: https://issues.apache.org/jira/browse/SPARK-21588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Priority: Minor
>
> SQLContext.get(key) for a key that is not defined in the conf, and doesn't 
> have a default value defined, throws a NoSuchElementException. In order to 
> avoid that, I used a null as the default value, which threw a NPE instead. If 
> it is null, it shouldn't try to parse the default value in `getConfString`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113516#comment-16113516
 ] 

Reynold Xin commented on SPARK-21619:
-

Sorry I don't understand your question or point at all. Why should a plan be 
canonicalized before execution?


> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21588) SQLContext.getConf(key, null) should return null, but it throws NPE

2017-08-03 Thread Anton Okolnychyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113515#comment-16113515
 ] 

Anton Okolnychyi commented on SPARK-21588:
--

Sure, but the converter will not be called if the default value that you pass 
is "". However, the check can be extended to `defaultValue != null 
&& defaultValue != ""` in the SQLConf#getConfString.

> SQLContext.getConf(key, null) should return null, but it throws NPE
> ---
>
> Key: SPARK-21588
> URL: https://issues.apache.org/jira/browse/SPARK-21588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Priority: Minor
>
> SQLContext.get(key) for a key that is not defined in the conf, and doesn't 
> have a default value defined, throws a NoSuchElementException. In order to 
> avoid that, I used a null as the default value, which threw a NPE instead. If 
> it is null, it shouldn't try to parse the default value in `getConfString`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113510#comment-16113510
 ] 

Mark Hamstra commented on SPARK-21619:
--

Ok, but my point is that if plans are to be canonicalized for some reasons, 
maybe they should also be canonicalized before execution. It seems odd both to 
execute plans that are not in a canonical form and to not be able to execute 
plans that are in a canonical form. That view makes failing the execution of 
canonical plans look more like a workaround/hack (maybe needed in the short 
term) than a solution to a deeper issue. 

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113498#comment-16113498
 ] 

Reynold Xin commented on SPARK-21619:
-

Canonicalized plan is used for semantic comparison. This has nothing to do with 
the actual blocking of execution of a query plan. This is to avoid some buggy 
code accidentally executing a canonicalized plan that is not meant for 
execution (and only for comparison), and leads to silent incorrect results or 
weird exceptions at runtime.


> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113494#comment-16113494
 ] 

Mark Hamstra commented on SPARK-21619:
--

Can you provide a little more context, Reynold, since on its face it would seem 
that if plans are to be blocked from executing based on their form, then 
non-canonical plans would be the ones that should be blocked. 

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21630) Pmod should not throw a divide by zero exception

2017-08-03 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-21630:
-

 Summary: Pmod should not throw a divide by zero exception
 Key: SPARK-21630
 URL: https://issues.apache.org/jira/browse/SPARK-21630
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.1, 2.0.2
Reporter: Herman van Hovell


Pmod currently throws a divide by zero exception when the right input is 0. It 
should - like Divide or Remainder - probably return null.

Here is a small reproducer:
{noformat}
scala> sql("select pmod(id, 0) from range(10)").show
17/08/03 22:36:43 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.ArithmeticException: / by zero
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21629) OR nullability is incorrect

2017-08-03 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-21629:
-

 Summary: OR nullability is incorrect
 Key: SPARK-21629
 URL: https://issues.apache.org/jira/browse/SPARK-21629
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.1, 2.0.0
Reporter: Herman van Hovell
Priority: Minor


The SQL {{OR}} expression's nullability is slightly incorrect. It should only 
be nullable when both of the input expressions are nullable, and not when 
either of them is nullable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21588) SQLContext.getConf(key, null) should return null, but it throws NPE

2017-08-03 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113422#comment-16113422
 ] 

Burak Yavuz commented on SPARK-21588:
-

[~vinodkc] [~aokolnychyi]

It happens when the config has a value converter, example 
`spark.sql.shuffle.partitions`. Basically any non-string sql conf.

> SQLContext.getConf(key, null) should return null, but it throws NPE
> ---
>
> Key: SPARK-21588
> URL: https://issues.apache.org/jira/browse/SPARK-21588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Priority: Minor
>
> SQLContext.get(key) for a key that is not defined in the conf, and doesn't 
> have a default value defined, throws a NoSuchElementException. In order to 
> avoid that, I used a null as the default value, which threw a NPE instead. If 
> it is null, it shouldn't try to parse the default value in `getConfString`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21588) SQLContext.getConf(key, null) should return null, but it throws NPE

2017-08-03 Thread Anton Okolnychyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113412#comment-16113412
 ] 

Anton Okolnychyi commented on SPARK-21588:
--

I did not manage to reproduce this. I tried:

{code}
spark.sqlContext.getConf("spark.sql.streaming.checkpointLocation", null) // null
spark.sqlContext.getConf("spark.sql.thriftserver.scheduler.pool", null) // null
spark.sqlContext.getConf("spark.sql.sources.outputCommitterClass", null) // null
spark.sqlContext.getConf("blabla", null) // null
spark.sqlContext.getConf("spark.sql.sources.outputCommitterClass") // 

{code}

I got a NPE only when I called getConf(key, null) for a parameter with a 
default value. For example, 
{code}
spark.sqlContext.getConf("spark.sql.thriftServer.incrementalCollect", 
"") // 
spark.sqlContext.getConf("spark.sql.thriftServer.incrementalCollect", null) // 
NPE
{code}


> SQLContext.getConf(key, null) should return null, but it throws NPE
> ---
>
> Key: SPARK-21588
> URL: https://issues.apache.org/jira/browse/SPARK-21588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Priority: Minor
>
> SQLContext.get(key) for a key that is not defined in the conf, and doesn't 
> have a default value defined, throws a NoSuchElementException. In order to 
> avoid that, I used a null as the default value, which threw a NPE instead. If 
> it is null, it shouldn't try to parse the default value in `getConfString`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-08-03 Thread Brendan Dwyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113404#comment-16113404
 ] 

Brendan Dwyer commented on SPARK-15799:
---

Is there any update on this?

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow

2017-08-03 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113388#comment-16113388
 ] 

Herman van Hovell commented on SPARK-21595:
---

The old and the new code are not exactly the same. The old code path would 
start using a disk spilling buffer when a window would become larger than 4096 
rows. The key difference is that old code path would not start to spill at that 
point, that would only happen when the Spark would get pressed for memory and 
the memory manager starts to force spills. The current version is overly active 
and starts spilling at a much earlier stage. We have seen similar problems with 
customer workloads on our end.

We either need to set this to a more sensible default, or return this to the 
old behavior.

> introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 
> breaks existing workflow
> -
>
> Key: SPARK-21595
> URL: https://issues.apache.org/jira/browse/SPARK-21595
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.2.0
> Environment: pyspark on linux
>Reporter: Stephan Reiling
>Priority: Minor
>  Labels: documentation, regression
>
> My pyspark code has the following statement:
> {code:java}
> # assign row key for tracking
> df = df.withColumn(
> 'association_idx',
> sqlf.row_number().over(
> Window.orderBy('uid1', 'uid2')
> )
> )
> {code}
> where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates 
> one large window for the whole dataframe to sort over.
> In spark 2.1 this works without problem, in spark 2.2 this fails either with 
> out of memory exception or too many open files exception, depending on memory 
> settings (which is what I tried first to fix this).
> Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 
> creates >110,000 files.
> In the log I see the following messages (110,000 of these):
> {noformat}
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (0  time so far)
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (1  time so far)
> {noformat}
> So I started hunting for clues in UnsafeExternalSorter, without luck. What I 
> had missed was this one message:
> {noformat}
> 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
> threshold of 4096 rows, switching to 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
> {noformat}
> Which allowed me to track down the issue. 
> By changing the configuration to include:
> {code:java}
> spark.sql.windowExec.buffer.spill.threshold   2097152
> {code}
> I got it to work again and with the same performance as spark 2.1.
> I have workflows where I use windowing functions that do not fail, but took a 
> performance hit due to the excessive spilling when using the default of 4096.
> I think to make it easier to track down these issues this config variable 
> should be included in the configuration documentation. 
> Maybe 4096 is too small of a default value?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2017-08-03 Thread Brad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113264#comment-16113264
 ] 

Brad commented on SPARK-21097:
--

I'm still working on thoroughly benchmarking and testing this change. If anyone 
is interested in this, send me a message. Thanks

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21453) Cached Kafka consumer may be closed too early

2017-08-03 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21453:
-
Summary: Cached Kafka consumer may be closed too early  (was: Streaming 
kafka source (structured spark))

> Cached Kafka consumer may be closed too early
> -
>
> Key: SPARK-21453
> URL: https://issues.apache.org/jira/browse/SPARK-21453
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 and kafka 0.10.2.0
>Reporter: Pablo Panero
>Priority: Minor
>
> On a streaming job using built-in kafka source and sink (over SSL), with  I 
> am getting the following exception:
> Config of the source:
> {code:java}
> val df = spark.readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", config.bootstrapServers)
>   .option("failOnDataLoss", value = false)
>   .option("kafka.connections.max.idle.ms", 360)
>   //SSL: this only applies to communication between Spark and Kafka 
> brokers; you are still responsible for separately securing Spark inter-node 
> communication.
>   .option("kafka.security.protocol", "SASL_SSL")
>   .option("kafka.sasl.mechanism", "GSSAPI")
>   .option("kafka.sasl.kerberos.service.name", "kafka")
>   .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts")
>   .option("kafka.ssl.truststore.password", "changeit")
>   .option("subscribe", config.topicConfigList.keys.mkString(","))
>   .load()
> {code}
> Config of the sink:
> {code:java}
> .writeStream
> .option("checkpointLocation", 
> s"${config.checkpointDir}/${topicConfig._1}/")
> .format("kafka")
> .option("kafka.bootstrap.servers", config.bootstrapServers)
> .option("kafka.connections.max.idle.ms", 360)
> //SSL: this only applies to communication between Spark and Kafka 
> brokers; you are still responsible for separately securing Spark inter-node 
> communication.
> .option("kafka.security.protocol", "SASL_SSL")
> .option("kafka.sasl.mechanism", "GSSAPI")
> .option("kafka.sasl.kerberos.service.name", "kafka")
> .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts")
> .option("kafka.ssl.truststore.password", "changeit")
> .start()
> {code}
> {code:java}
> 17/07/18 10:11:58 WARN SslTransportLayer: Failed to send SSL Close message 
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:195)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:163)
>   at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:731)
>   at 
> org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:54)
>   at org.apache.kafka.common.network.Selector.doClose(Selector.java:540)
>   at org.apache.kafka.common.network.Selector.close(Selector.java:531)
>   at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:378)
>   at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1047)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:298)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:206)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
>   at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157)
>   at 
>

[jira] [Assigned] (SPARK-20713) Speculative task that got CommitDenied exception shows up as failed

2017-08-03 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-20713:
-

Assignee: (was: Nuochen Lyu)

> Speculative task that got CommitDenied exception shows up as failed
> ---
>
> Key: SPARK-20713
> URL: https://issues.apache.org/jira/browse/SPARK-20713
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
> Fix For: 2.3.0
>
>
> When running speculative tasks you can end up getting a task failure on a 
> speculative task (the other task succeeded) because that task got a 
> CommitDenied exception when really it was "killed" by the driver. It is a 
> race between when the driver kills and when the executor tries to commit.
> I think ideally we should fix up the task state on this to be killed because 
> the fact that this task failed doesn't matter since the other speculative 
> task succeeded.  tasks showing up as failure confuse the user and could make 
> other scheduler cases harder.   
> This is somewhat related to SPARK-13343 where I think we should be correctly 
> account for speculative tasks.  only one of the 2 tasks really succeeded and 
> commited, and the other should be marked differently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20713) Speculative task that got CommitDenied exception shows up as failed

2017-08-03 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-20713:
-

Assignee: Nuochen Lyu

> Speculative task that got CommitDenied exception shows up as failed
> ---
>
> Key: SPARK-20713
> URL: https://issues.apache.org/jira/browse/SPARK-20713
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Nuochen Lyu
> Fix For: 2.3.0
>
>
> When running speculative tasks you can end up getting a task failure on a 
> speculative task (the other task succeeded) because that task got a 
> CommitDenied exception when really it was "killed" by the driver. It is a 
> race between when the driver kills and when the executor tries to commit.
> I think ideally we should fix up the task state on this to be killed because 
> the fact that this task failed doesn't matter since the other speculative 
> task succeeded.  tasks showing up as failure confuse the user and could make 
> other scheduler cases harder.   
> This is somewhat related to SPARK-13343 where I think we should be correctly 
> account for speculative tasks.  only one of the 2 tasks really succeeded and 
> commited, and the other should be marked differently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20713) Speculative task that got CommitDenied exception shows up as failed

2017-08-03 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-20713.
---
   Resolution: Fixed
 Assignee: Nuochen Lyu
Fix Version/s: 2.3.0

> Speculative task that got CommitDenied exception shows up as failed
> ---
>
> Key: SPARK-20713
> URL: https://issues.apache.org/jira/browse/SPARK-20713
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Nuochen Lyu
> Fix For: 2.3.0
>
>
> When running speculative tasks you can end up getting a task failure on a 
> speculative task (the other task succeeded) because that task got a 
> CommitDenied exception when really it was "killed" by the driver. It is a 
> race between when the driver kills and when the executor tries to commit.
> I think ideally we should fix up the task state on this to be killed because 
> the fact that this task failed doesn't matter since the other speculative 
> task succeeded.  tasks showing up as failure confuse the user and could make 
> other scheduler cases harder.   
> This is somewhat related to SPARK-13343 where I think we should be correctly 
> account for speculative tasks.  only one of the 2 tasks really succeeded and 
> commited, and the other should be marked differently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21453) Streaming kafka source (structured spark)

2017-08-03 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113224#comment-16113224
 ] 

Shixiong Zhu commented on SPARK-21453:
--

Reopened this one. There might be some bug in caching Kafka consumers.

[~ppanero] could you provide the logs, please?

> Streaming kafka source (structured spark)
> -
>
> Key: SPARK-21453
> URL: https://issues.apache.org/jira/browse/SPARK-21453
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 and kafka 0.10.2.0
>Reporter: Pablo Panero
>Priority: Minor
>
> On a streaming job using built-in kafka source and sink (over SSL), with  I 
> am getting the following exception:
> Config of the source:
> {code:java}
> val df = spark.readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", config.bootstrapServers)
>   .option("failOnDataLoss", value = false)
>   .option("kafka.connections.max.idle.ms", 360)
>   //SSL: this only applies to communication between Spark and Kafka 
> brokers; you are still responsible for separately securing Spark inter-node 
> communication.
>   .option("kafka.security.protocol", "SASL_SSL")
>   .option("kafka.sasl.mechanism", "GSSAPI")
>   .option("kafka.sasl.kerberos.service.name", "kafka")
>   .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts")
>   .option("kafka.ssl.truststore.password", "changeit")
>   .option("subscribe", config.topicConfigList.keys.mkString(","))
>   .load()
> {code}
> Config of the sink:
> {code:java}
> .writeStream
> .option("checkpointLocation", 
> s"${config.checkpointDir}/${topicConfig._1}/")
> .format("kafka")
> .option("kafka.bootstrap.servers", config.bootstrapServers)
> .option("kafka.connections.max.idle.ms", 360)
> //SSL: this only applies to communication between Spark and Kafka 
> brokers; you are still responsible for separately securing Spark inter-node 
> communication.
> .option("kafka.security.protocol", "SASL_SSL")
> .option("kafka.sasl.mechanism", "GSSAPI")
> .option("kafka.sasl.kerberos.service.name", "kafka")
> .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts")
> .option("kafka.ssl.truststore.password", "changeit")
> .start()
> {code}
> {code:java}
> 17/07/18 10:11:58 WARN SslTransportLayer: Failed to send SSL Close message 
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:195)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:163)
>   at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:731)
>   at 
> org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:54)
>   at org.apache.kafka.common.network.Selector.doClose(Selector.java:540)
>   at org.apache.kafka.common.network.Selector.close(Selector.java:531)
>   at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:378)
>   at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1047)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:298)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:206)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
>   at 
>

[jira] [Reopened] (SPARK-21453) Streaming kafka source (structured spark)

2017-08-03 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-21453:
--

> Streaming kafka source (structured spark)
> -
>
> Key: SPARK-21453
> URL: https://issues.apache.org/jira/browse/SPARK-21453
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 and kafka 0.10.2.0
>Reporter: Pablo Panero
>Priority: Minor
>
> On a streaming job using built-in kafka source and sink (over SSL), with  I 
> am getting the following exception:
> Config of the source:
> {code:java}
> val df = spark.readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", config.bootstrapServers)
>   .option("failOnDataLoss", value = false)
>   .option("kafka.connections.max.idle.ms", 360)
>   //SSL: this only applies to communication between Spark and Kafka 
> brokers; you are still responsible for separately securing Spark inter-node 
> communication.
>   .option("kafka.security.protocol", "SASL_SSL")
>   .option("kafka.sasl.mechanism", "GSSAPI")
>   .option("kafka.sasl.kerberos.service.name", "kafka")
>   .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts")
>   .option("kafka.ssl.truststore.password", "changeit")
>   .option("subscribe", config.topicConfigList.keys.mkString(","))
>   .load()
> {code}
> Config of the sink:
> {code:java}
> .writeStream
> .option("checkpointLocation", 
> s"${config.checkpointDir}/${topicConfig._1}/")
> .format("kafka")
> .option("kafka.bootstrap.servers", config.bootstrapServers)
> .option("kafka.connections.max.idle.ms", 360)
> //SSL: this only applies to communication between Spark and Kafka 
> brokers; you are still responsible for separately securing Spark inter-node 
> communication.
> .option("kafka.security.protocol", "SASL_SSL")
> .option("kafka.sasl.mechanism", "GSSAPI")
> .option("kafka.sasl.kerberos.service.name", "kafka")
> .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts")
> .option("kafka.ssl.truststore.password", "changeit")
> .start()
> {code}
> {code:java}
> 17/07/18 10:11:58 WARN SslTransportLayer: Failed to send SSL Close message 
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:195)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:163)
>   at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:731)
>   at 
> org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:54)
>   at org.apache.kafka.common.network.Selector.doClose(Selector.java:540)
>   at org.apache.kafka.common.network.Selector.close(Selector.java:531)
>   at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:378)
>   at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1047)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:298)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:206)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
>   at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157)
>   at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148)
>   at

[jira] [Commented] (SPARK-21453) Streaming kafka source (structured spark)

2017-08-03 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113220#comment-16113220
 ] 

Shixiong Zhu commented on SPARK-21453:
--

[~ppanero] could you create a new ticket for the Kafka producer issue?

> Streaming kafka source (structured spark)
> -
>
> Key: SPARK-21453
> URL: https://issues.apache.org/jira/browse/SPARK-21453
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 and kafka 0.10.2.0
>Reporter: Pablo Panero
>Priority: Minor
>
> On a streaming job using built-in kafka source and sink (over SSL), with  I 
> am getting the following exception:
> Config of the source:
> {code:java}
> val df = spark.readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", config.bootstrapServers)
>   .option("failOnDataLoss", value = false)
>   .option("kafka.connections.max.idle.ms", 360)
>   //SSL: this only applies to communication between Spark and Kafka 
> brokers; you are still responsible for separately securing Spark inter-node 
> communication.
>   .option("kafka.security.protocol", "SASL_SSL")
>   .option("kafka.sasl.mechanism", "GSSAPI")
>   .option("kafka.sasl.kerberos.service.name", "kafka")
>   .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts")
>   .option("kafka.ssl.truststore.password", "changeit")
>   .option("subscribe", config.topicConfigList.keys.mkString(","))
>   .load()
> {code}
> Config of the sink:
> {code:java}
> .writeStream
> .option("checkpointLocation", 
> s"${config.checkpointDir}/${topicConfig._1}/")
> .format("kafka")
> .option("kafka.bootstrap.servers", config.bootstrapServers)
> .option("kafka.connections.max.idle.ms", 360)
> //SSL: this only applies to communication between Spark and Kafka 
> brokers; you are still responsible for separately securing Spark inter-node 
> communication.
> .option("kafka.security.protocol", "SASL_SSL")
> .option("kafka.sasl.mechanism", "GSSAPI")
> .option("kafka.sasl.kerberos.service.name", "kafka")
> .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts")
> .option("kafka.ssl.truststore.password", "changeit")
> .start()
> {code}
> {code:java}
> 17/07/18 10:11:58 WARN SslTransportLayer: Failed to send SSL Close message 
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:195)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:163)
>   at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:731)
>   at 
> org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:54)
>   at org.apache.kafka.common.network.Selector.doClose(Selector.java:540)
>   at org.apache.kafka.common.network.Selector.close(Selector.java:531)
>   at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:378)
>   at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1047)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:298)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:206)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
>   at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157)
>   at 
>

[jira] [Commented] (SPARK-21453) Streaming kafka source (structured spark)

2017-08-03 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113218#comment-16113218
 ] 

Shixiong Zhu commented on SPARK-21453:
--

I'm aware of the Kafka producer issue. Right now a workaround is increasing 
"spark.kafka.producer.cache.timeout" to a large enough value to avoid Spark 
closing an in-used Kafka producer.

> Streaming kafka source (structured spark)
> -
>
> Key: SPARK-21453
> URL: https://issues.apache.org/jira/browse/SPARK-21453
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 and kafka 0.10.2.0
>Reporter: Pablo Panero
>Priority: Minor
>
> On a streaming job using built-in kafka source and sink (over SSL), with  I 
> am getting the following exception:
> Config of the source:
> {code:java}
> val df = spark.readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", config.bootstrapServers)
>   .option("failOnDataLoss", value = false)
>   .option("kafka.connections.max.idle.ms", 360)
>   //SSL: this only applies to communication between Spark and Kafka 
> brokers; you are still responsible for separately securing Spark inter-node 
> communication.
>   .option("kafka.security.protocol", "SASL_SSL")
>   .option("kafka.sasl.mechanism", "GSSAPI")
>   .option("kafka.sasl.kerberos.service.name", "kafka")
>   .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts")
>   .option("kafka.ssl.truststore.password", "changeit")
>   .option("subscribe", config.topicConfigList.keys.mkString(","))
>   .load()
> {code}
> Config of the sink:
> {code:java}
> .writeStream
> .option("checkpointLocation", 
> s"${config.checkpointDir}/${topicConfig._1}/")
> .format("kafka")
> .option("kafka.bootstrap.servers", config.bootstrapServers)
> .option("kafka.connections.max.idle.ms", 360)
> //SSL: this only applies to communication between Spark and Kafka 
> brokers; you are still responsible for separately securing Spark inter-node 
> communication.
> .option("kafka.security.protocol", "SASL_SSL")
> .option("kafka.sasl.mechanism", "GSSAPI")
> .option("kafka.sasl.kerberos.service.name", "kafka")
> .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts")
> .option("kafka.ssl.truststore.password", "changeit")
> .start()
> {code}
> {code:java}
> 17/07/18 10:11:58 WARN SslTransportLayer: Failed to send SSL Close message 
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:195)
>   at 
> org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:163)
>   at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:731)
>   at 
> org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:54)
>   at org.apache.kafka.common.network.Selector.doClose(Selector.java:540)
>   at org.apache.kafka.common.network.Selector.close(Selector.java:531)
>   at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:378)
>   at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1047)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:298)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:206)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
>   at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
>   at 
>

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-08-03 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113145#comment-16113145
 ] 

Felix Cheung commented on SPARK-21367:
--

still seeing it

Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
2: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
* installing *source* package 'SparkR' ...

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80213/console

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
> Attachments: R.paks
>
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21618) http(s) not accepted in spark-submit jar uri

2017-08-03 Thread John Zhuge (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113133#comment-16113133
 ] 

John Zhuge commented on SPARK-21618:


We have not backported HADOOP-14383 to CDH5.

> http(s) not accepted in spark-submit jar uri
> 
>
> Key: SPARK-21618
> URL: https://issues.apache.org/jira/browse/SPARK-21618
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1, 2.2.0
> Environment: pre-built for hadoop 2.6 and 2.7 on mac and ubuntu 
> 16.04. 
>Reporter: Ben Mayne
>Priority: Minor
>  Labels: documentation
>
> The documentation suggests I should be able to use an http(s) uri for a jar 
> in spark-submit, but I haven't been successful 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
> {noformat}
> benmayne@Benjamins-MacBook-Pro ~ $ spark-submit --deploy-mode client --master 
> local[2] --class class.name.Test https://test.com/path/to/jar.jar
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Exception in thread "main" java.io.IOException: No FileSystem for scheme: 
> https
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at 
> org.apache.spark.deploy.SparkSubmit$.downloadFile(SparkSubmit.scala:865)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:316)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> benmayne@Benjamins-MacBook-Pro ~ $
> {noformat}
> If I replace the path with a valid hdfs path 
> (hdfs:///user/benmayne/valid-jar.jar), it works as expected. I've seen the 
> same behavior across 2.2.0 (hadoop 2.6 & 2.7 on mac and ubuntu) and on 2.1.1 
> on ubuntu. 
> this is the example that I'm trying to replicate from 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management:
>  
> > Spark uses the following URL scheme to allow different strategies for 
> > disseminating jars:
> > file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file 
> > server, and every executor pulls the file from the driver HTTP server.
> > hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as 
> > expected
> {noformat}
> # Run on a Mesos cluster in cluster deploy mode with supervise
> ./bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master mesos://207.184.161.138:7077 \
>   --deploy-mode cluster \
>   --supervise \
>   --executor-memory 20G \
>   --total-executor-cores 100 \
>   http://path/to/examples.jar \
>   1000
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18838) High latency of event processing for large jobs

2017-08-03 Thread Miles Crawford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113067#comment-16113067
 ] 

Miles Crawford edited comment on SPARK-18838 at 8/3/17 5:13 PM:


We have disabled our eventlog listener, which is unfortunate, but seemed to 
help a lot. Nevertheless, we still get dropped events, which causes the UI to 
screw up, jobs to hang, and so forth.

Can we do anything to identify which listener is backing up? 

Are there any workarounds for this issue? 


was (Author: milesc):
We have disabled our eventlog listener, which is unfortunate, but seemed to 
help alot. Nevertheless, we still get dropped events, which causes the UI to 
screw up, jobs to hang, and so forth.

Can we do anything to identify which listener is backing up? 

Are there any workarounds for this issue? 

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17669) Strange behavior using Datasets

2017-08-03 Thread Miles Crawford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113083#comment-16113083
 ] 

Miles Crawford commented on SPARK-17669:


This UI behavior is caused by SPARK-18838 - events are being dropped so the UI 
cannot show accurate status.

> Strange behavior using Datasets
> ---
>
> Key: SPARK-17669
> URL: https://issues.apache.org/jira/browse/SPARK-17669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Miles Crawford
>
> I recently migrated my application to Spark 2.0, and everything worked well, 
> except for one function that uses "toDS" and the ML libraries.
> This stage used to complete in 15 minutes or so on 1.6.2, and now takes 
> almost two hours.
> The UI shows very strange behavior - completed stages still being worked on, 
> concurrent work on tons of stages, including ones from downstream jobs:
> https://dl.dropboxusercontent.com/u/231152/spark.png
> The only source change I made was changing "toDF" to "toDS()" before handing 
> my RDDs to the ML libraries.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-08-03 Thread Miles Crawford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113067#comment-16113067
 ] 

Miles Crawford commented on SPARK-18838:


We have disabled our eventlog listener, which is unfortunate, but seemed to 
help alot. Nevertheless, we still get dropped events, which causes the UI to 
screw up, jobs to hang, and so forth.

Can we do anything to identify which listener is backing up? 


> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18838) High latency of event processing for large jobs

2017-08-03 Thread Miles Crawford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113067#comment-16113067
 ] 

Miles Crawford edited comment on SPARK-18838 at 8/3/17 4:51 PM:


We have disabled our eventlog listener, which is unfortunate, but seemed to 
help alot. Nevertheless, we still get dropped events, which causes the UI to 
screw up, jobs to hang, and so forth.

Can we do anything to identify which listener is backing up? 

Are there any workarounds for this issue? 


was (Author: milesc):
We have disabled our eventlog listener, which is unfortunate, but seemed to 
help alot. Nevertheless, we still get dropped events, which causes the UI to 
screw up, jobs to hang, and so forth.

Can we do anything to identify which listener is backing up? 


> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21599) Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException

2017-08-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21599.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Collecting column statistics for datasource tables may fail with 
> java.util.NoSuchElementException
> -
>
> Key: SPARK-21599
> URL: https://issues.apache.org/jira/browse/SPARK-21599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
> Fix For: 2.3.0
>
>
> Collecting column level statistics for non compatible hive tables using 
> {code}
> ANALYZE TABLE  FOR COLUMNS 
> {code}
> may fail with the following exception.
> {code}
> key not found: a
> java.util.NoSuchElementException: key not found: a
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:657)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:656)
>   at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:656)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375)
>   at 
> org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21598) Collect usability/events information from Spark History Server

2017-08-03 Thread Eric Vandenberg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113040#comment-16113040
 ] 

Eric Vandenberg commented on SPARK-21598:
-

[~steve_l]  Do you have any input / thoughts here?  The goal here is to collect 
more information than is available in typical metrics.  I would like to 
directly correlate the replay times with other replay activity attributes like 
job size, user impact (ie, was user waiting for a response in real time?), etc. 
 This is usability more than operational, this information would make it be 
easier to target and measure specific improvements to the spark history server 
user experience.  We often internal users who complain on history server 
performance and need a way to directly reference / understand their experience 
since spark history server is critical for our internal debugging.  If there's 
a way to capture this information using metrics alone would like to like to 
learn more but from my understanding they aren't designed to capture this level 
of information.

> Collect usability/events information from Spark History Server
> --
>
> Key: SPARK-21598
> URL: https://issues.apache.org/jira/browse/SPARK-21598
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.0.2
>Reporter: Eric Vandenberg
>Priority: Minor
>
> The Spark History Server doesn't currently have a way to collect 
> usability/performance on its main activity, loading/replay of history files.  
> We'd like to collect this information to monitor, target and measure 
> improvements in the spark debugging experience (via history server usage.)  
> Once available these usability events could be analyzed using other analytics 
> tools.
> The event info to collect:
> SparkHistoryReplayEvent(
> logPath: String,
> logCompressionType: String,
> logReplayException: String // if an error
> logReplayAction: String // user replay, vs checkForLogs replay
> logCompleteFlag: Boolean,
> logFileSize: Long,
> logFileSizeUncompressed: Long,
> logLastModifiedTimestamp: Long,
> logCreationTimestamp: Long,
> logJobId: Long,
> logNumEvents: Int,
> logNumStages: Int,
> logNumTasks: Int
> logReplayDurationMillis: Long
> )
> The main spark engine has a SparkListenerInterface through which all compute 
> engine events are broadcast.  It probably doesn't make sense to reuse this 
> abstraction for broadcasting spark history server events since the "events" 
> are not related or compatible with one another.  Also note the metrics 
> registry collects history caching metrics but doesn't provide the type of 
> above information.
> Proposal here would be to add some basic event listener infrastructure to 
> capture history server activity events.  This would work similar to how the 
> SparkListener infrastructure works.  It could be configured in a similar 
> manner, eg. spark.history.listeners=MyHistoryListenerClass.
> Open to feedback / suggestions / comments on the approach or alternatives.
> cc: [~vanzin] [~cloud_fan] [~ajbozarth] [~jiangxb1987]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21143) Fail to fetch blocks >1MB in size in presence of conflicting Netty version

2017-08-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112974#comment-16112974
 ] 

Sean Owen commented on SPARK-21143:
---

[~baz33] please see the JIRA and make the change if you want to see it done.

> Fail to fetch blocks >1MB in size in presence of conflicting Netty version
> --
>
> Key: SPARK-21143
> URL: https://issues.apache.org/jira/browse/SPARK-21143
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Williams
>Priority: Minor
>
> One of my spark libraries inherited a transitive-dependency on Netty 
> 4.1.6.Final (vs. Spark's 4.0.42.Final), and I observed a strange failure I 
> wanted to document: fetches of blocks larger than 1MB (pre-compression, 
> afaict) seem to trigger a code path that results in {{AbstractMethodError}}'s 
> and ultimately stage failures.
> I put a minimal repro in [this github 
> repo|https://github.com/ryan-williams/spark-bugs/tree/netty]: {{collect}} on 
> a 1-partition RDD with 1032 {{Array\[Byte\]}}'s of size 1000 works, but at 
> 1033 {{Array}}'s it dies in a confusing way.
> Not sure what fixing/mitigating this in Spark would look like, other than 
> defensively shading+renaming netty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-08-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112973#comment-16112973
 ] 

Sean Owen commented on SPARK-19552:
---

You will still have to make Spark work with 4.1.x even if it's shaded, but 
you're welcome to do that. I think the linked PR above did that, and may still 
accomplish the necessary changes. We'd have to figure out whether it breaks any 
user code too. But yeah shading is probably the way to go, as with jetty.

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern but not one we'd be 
> exposed to with Spark "out of the box". Let's upgrade the version we use to 
> be on the safe side as the security fix I'm especially interested in is not 
> available in the 4.0.x release line. 
> We should move up anyway to take on a bunch of other big fixes cited in the 
> release notes (and if anyone were to use Spark with netty and tcnative, they 
> shouldn't be exposed to the security problem) - we should be good citizens 
> and make this change.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. This JIRA and associated pull 
> request starts the process which I'll work on - and any help would be much 
> appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21628) Explicitly specify Java version in maven compiler plugin so IntelliJ imports project correctly

2017-08-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21628.
---
Resolution: Duplicate

> Explicitly specify Java version in maven compiler plugin so IntelliJ imports 
> project correctly
> --
>
> Key: SPARK-21628
> URL: https://issues.apache.org/jira/browse/SPARK-21628
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Andrew Ray
>Priority: Minor
>
> see 
> https://stackoverflow.com/questions/27037657/stop-intellij-idea-to-switch-java-language-level-every-time-the-pom-is-reloaded



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21627) analyze hive table compute stats for columns with mixed case exception

2017-08-03 Thread Bogdan Raducanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21627:

Summary: analyze hive table compute stats for columns with mixed case 
exception  (was: hive compute stats for columns exception with column name 
camel case)

> analyze hive table compute stats for columns with mixed case exception
> --
>
> Key: SPARK-21627
> URL: https://issues.apache.org/jira/browse/SPARK-21627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>
> {code}
> sql("create table tabel1(b int) partitioned by (partColumn int)")
> sql("analyze table tabel1 compute statistics for columns partColumn, b")
> {code}
> {code}
> java.util.NoSuchElementException: key not found: partColumn
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:648)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:647)
>   at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:647)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375)
>   at 
> org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$$anonfun$47.apply(Dataset.scala:3036)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3035)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:636)
>   ... 39 elided
> {code}
> Looks like regression introduced by https://github.com/apache/spark/pull/18248
> In {{HiveExternalCatalog.alterTableState}} {{colNameTypeMap}} contains lower 
> case column names.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21628) Explicitly specify Java version in maven compiler plugin so IntelliJ imports project correctly

2017-08-03 Thread Andrew Ray (JIRA)

Andrew Ray created SPARK-21628:
--

 Summary: Explicitly specify Java version in maven compiler plugin 
so IntelliJ imports project correctly
 Key: SPARK-21628
 URL: https://issues.apache.org/jira/browse/SPARK-21628
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.2.0
Reporter: Andrew Ray
Priority: Minor


see 
https://stackoverflow.com/questions/27037657/stop-intellij-idea-to-switch-java-language-level-every-time-the-pom-is-reloaded



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-08-03 Thread BDeus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112879#comment-16112879
 ] 

BDeus commented on SPARK-19552:
---

I have the same problem with gRPC too, if we don't want upgrade to 4.1.x, can 
we at least discuss about the possibility to shade it?

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern but not one we'd be 
> exposed to with Spark "out of the box". Let's upgrade the version we use to 
> be on the safe side as the security fix I'm especially interested in is not 
> available in the 4.0.x release line. 
> We should move up anyway to take on a bunch of other big fixes cited in the 
> release notes (and if anyone were to use Spark with netty and tcnative, they 
> shouldn't be exposed to the security problem) - we should be good citizens 
> and make this change.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. This JIRA and associated pull 
> request starts the process which I'll work on - and any help would be much 
> appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21143) Fail to fetch blocks >1MB in size in presence of conflicting Netty version

2017-08-03 Thread Basile Deustua (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112873#comment-16112873
 ] 

Basile Deustua commented on SPARK-21143:


I have the exact same issue with io.grpc which heavily use netty 4.1.x.
It's very disapointing that spark community won't upgrade the netty version or 
at least shade the 4.0.x in the jar lib to let the choice of the version we 
want use. 
Be constrained to remain at 4.0.x version by spark dependency is a bit 
frustrating.

> Fail to fetch blocks >1MB in size in presence of conflicting Netty version
> --
>
> Key: SPARK-21143
> URL: https://issues.apache.org/jira/browse/SPARK-21143
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Williams
>Priority: Minor
>
> One of my spark libraries inherited a transitive-dependency on Netty 
> 4.1.6.Final (vs. Spark's 4.0.42.Final), and I observed a strange failure I 
> wanted to document: fetches of blocks larger than 1MB (pre-compression, 
> afaict) seem to trigger a code path that results in {{AbstractMethodError}}'s 
> and ultimately stage failures.
> I put a minimal repro in [this github 
> repo|https://github.com/ryan-williams/spark-bugs/tree/netty]: {{collect}} on 
> a 1-partition RDD with 1032 {{Array\[Byte\]}}'s of size 1000 works, but at 
> 1033 {{Array}}'s it dies in a confusing way.
> Not sure what fixing/mitigating this in Spark would look like, other than 
> defensively shading+renaming netty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21570) File __spark_libs__XXX.zip does not exist on networked file system w/ yarn

2017-08-03 Thread Albert Chu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112868#comment-16112868
 ] 

Albert Chu commented on SPARK-21570:


There's no scheme.  Just using "file://" to treat like a local file system.

> File __spark_libs__XXX.zip does not exist on networked file system w/ yarn
> --
>
> Key: SPARK-21570
> URL: https://issues.apache.org/jira/browse/SPARK-21570
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Albert Chu
>
> I have a set of scripts that run Spark with data in a networked file system.  
> One of my unit tests to make sure things don't break between Spark releases 
> is to simply run a word count (via org.apache.spark.examples.JavaWordCount) 
> on a file in the networked file system.  This test broke with Spark 2.2.0 
> when I use yarn to launch the job (using the spark standalone scheduler 
> things still work).  I'm currently using Hadoop 2.7.0.  I get the following 
> error:
> {noformat}
> Diagnostics: File 
> file:/p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> While debugging, I sat and watched the directory and did see that 
> /p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
>  does show up at some point.
> Wondering if it's possible something racy was introduced.  Nothing in the 
> Spark 2.2.0 release notes suggests any type of configuration change that 
> needs to be done.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21627) hive compute stats for columns exception with column name camel case

2017-08-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21627:
--
Affects Version/s: (was: 3.0.0)
   2.3.0

master = 2.3.0 right now

> hive compute stats for columns exception with column name camel case
> 
>
> Key: SPARK-21627
> URL: https://issues.apache.org/jira/browse/SPARK-21627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>
> {code}
> sql("create table tabel1(b int) partitioned by (partColumn int)")
> sql("analyze table tabel1 compute statistics for columns partColumn, b")
> {code}
> {code}
> java.util.NoSuchElementException: key not found: partColumn
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:648)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:647)
>   at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:647)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375)
>   at 
> org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$$anonfun$47.apply(Dataset.scala:3036)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3035)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:636)
>   ... 39 elided
> {code}
> Looks like regression introduced by https://github.com/apache/spark/pull/18248
> In {{HiveExternalCatalog.alterTableState}} {{colNameTypeMap}} contains lower 
> case column names.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21627) hive compute stats for columns exception with column name camel case

2017-08-03 Thread Bogdan Raducanu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112851#comment-16112851
 ] 

Bogdan Raducanu commented on SPARK-21627:
-

I expect it fails only in master branch. That's why it's 3.0

> hive compute stats for columns exception with column name camel case
> 
>
> Key: SPARK-21627
> URL: https://issues.apache.org/jira/browse/SPARK-21627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bogdan Raducanu
>
> {code}
> sql("create table tabel1(b int) partitioned by (partColumn int)")
> sql("analyze table tabel1 compute statistics for columns partColumn, b")
> {code}
> {code}
> java.util.NoSuchElementException: key not found: partColumn
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:648)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:647)
>   at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:647)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375)
>   at 
> org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$$anonfun$47.apply(Dataset.scala:3036)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3035)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:636)
>   ... 39 elided
> {code}
> Looks like regression introduced by https://github.com/apache/spark/pull/18248
> In {{HiveExternalCatalog.alterTableState}} {{colNameTypeMap}} contains lower 
> case column names.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21627) hive compute stats for columns exception with column name camel case

2017-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112821#comment-16112821
 ] 

Hyukjin Kwon commented on SPARK-21627:
--

Would you mind fixing {{Affects Version/s:}}? I guess we don't have Spark 3.0.0 
yet.

> hive compute stats for columns exception with column name camel case
> 
>
> Key: SPARK-21627
> URL: https://issues.apache.org/jira/browse/SPARK-21627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bogdan Raducanu
>
> {code}
> sql("create table tabel1(b int) partitioned by (partColumn int)")
> sql("analyze table tabel1 compute statistics for columns partColumn, b")
> {code}
> {code}
> java.util.NoSuchElementException: key not found: partColumn
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:648)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:647)
>   at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:647)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375)
>   at 
> org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$$anonfun$47.apply(Dataset.scala:3036)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3035)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:636)
>   ... 39 elided
> {code}
> Looks like regression introduced by https://github.com/apache/spark/pull/18248
> In {{HiveExternalCatalog.alterTableState}} {{colNameTypeMap}} contains lower 
> case column names.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21602) Add map_keys and map_values functions to R

2017-08-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21602:


Assignee: Hyukjin Kwon

> Add map_keys and map_values functions to R
> --
>
> Key: SPARK-21602
> URL: https://issues.apache.org/jira/browse/SPARK-21602
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> We have {{map_keys}} and {{map_values}} functions in other language APIs.
> It should nicer to have both in R API too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21602) Add map_keys and map_values functions to R

2017-08-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21602.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18809
[https://github.com/apache/spark/pull/18809]

> Add map_keys and map_values functions to R
> --
>
> Key: SPARK-21602
> URL: https://issues.apache.org/jira/browse/SPARK-21602
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> We have {{map_keys}} and {{map_values}} functions in other language APIs.
> It should nicer to have both in R API too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21627) hive compute stats for columns exception with column name camel case

2017-08-03 Thread Bogdan Raducanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21627:

Description: 
{code}
sql("create table tabel1(b int) partitioned by (partColumn int)")
sql("analyze table tabel1 compute statistics for columns partColumn, b")
{code}
{code}
java.util.NoSuchElementException: key not found: partColumn
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:59)
  at scala.collection.MapLike$class.apply(MapLike.scala:141)
  at scala.collection.AbstractMap.apply(Map.scala:59)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:648)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:647)
  at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:647)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375)
  at 
org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$$anonfun$47.apply(Dataset.scala:3036)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3035)
  at org.apache.spark.sql.Dataset.(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:636)
  ... 39 elided
{code}

Looks like regression introduced by https://github.com/apache/spark/pull/18248

In {{HiveExternalCatalog.alterTableState}} {{colNameTypeMap}} contains lower 
case column names.



  was:
{code}
sql("create table tabel1(b int) partitioned by (partColumn int)")
sql("analyze table tabel1 compute statistics for columns partColumn, b")
{code}
{code}
java.util.NoSuchElementException: key not found: partColumn
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:59)
  at scala.collection.MapLike$class.apply(MapLike.scala:141)
  at scala.collection.AbstractMap.apply(Map.scala:59)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:648)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:647)
  at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:647)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375)
  at 
org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$$anonfun$47.apply(Dataset.scala:3036)
  at

[jira] [Updated] (SPARK-21627) hive compute stats for columns exception with column name camel case

2017-08-03 Thread Bogdan Raducanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21627:

Description: 
{code}
sql("create table tabel1(b int) partitioned by (partColumn int)")
sql("analyze table tabel1 compute statistics for columns partColumn, b")
{code}
{code}
java.util.NoSuchElementException: key not found: partColumn
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:59)
  at scala.collection.MapLike$class.apply(MapLike.scala:141)
  at scala.collection.AbstractMap.apply(Map.scala:59)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:648)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:647)
  at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:647)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375)
  at 
org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$$anonfun$47.apply(Dataset.scala:3036)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3035)
  at org.apache.spark.sql.Dataset.(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:636)
  ... 39 elided
{code}

Looks like regression introduced by https://github.com/apache/spark/pull/18248

In {code}HiveExternalCatalog.alterTableStats{code} {code}colNameTypeMap{code} 
contains lower case column names.



  was:
{code}
sql("create table tabel1(b int) partitioned by (partColumn int)")
sql("analyze table tabel1 compute statistics for columns partColumn, b")
{code}
{code}
java.util.NoSuchElementException: key not found: partColumn
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:59)
  at scala.collection.MapLike$class.apply(MapLike.scala:141)
  at scala.collection.AbstractMap.apply(Map.scala:59)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:648)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:647)
  at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:647)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375)
  at 
org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$$anonfun$47.apply(Dataset.scala:3036)
  at

[jira] [Created] (SPARK-21627) hive compute stats for columns exception with column name camel case

2017-08-03 Thread Bogdan Raducanu (JIRA)

Bogdan Raducanu created SPARK-21627:
---

 Summary: hive compute stats for columns exception with column name 
camel case
 Key: SPARK-21627
 URL: https://issues.apache.org/jira/browse/SPARK-21627
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Bogdan Raducanu


{code}
sql("create table tabel1(b int) partitioned by (partColumn int)")
sql("analyze table tabel1 compute statistics for columns partColumn, b")
{code}
{code}
java.util.NoSuchElementException: key not found: partColumn
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:59)
  at scala.collection.MapLike$class.apply(MapLike.scala:141)
  at scala.collection.AbstractMap.apply(Map.scala:59)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:648)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:647)
  at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:647)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375)
  at 
org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$$anonfun$47.apply(Dataset.scala:3036)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3035)
  at org.apache.spark.sql.Dataset.(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:636)
  ... 39 elided
{code}

Looks like regression introduced by https://github.com/apache/spark/pull/18248
in {code}HiveExternalCatalog.alterTable{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-08-03 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112623#comment-16112623
 ] 

Nick Pentreath commented on SPARK-21086:


I just want to understand _why_ folks want to keep all the models? Is it 
actually the models (and model data) they want, or a way (well, easier 
"official API" way) to link the param permutations with the cross-val score to 
see what param combinations result in what scores? (In which case, 
https://issues.apache.org/jira/browse/SPARK-18704 is actually the solution).

> CrossValidator, TrainValidationSplit should preserve all models after fitting
> -
>
> Key: SPARK-21086
> URL: https://issues.apache.org/jira/browse/SPARK-21086
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> I've heard multiple requests for having CrossValidatorModel and 
> TrainValidationSplitModel preserve the full list of fitted models.  This 
> sounds very valuable.
> One decision should be made before we do this: Should we save and load the 
> models in ML persistence?  That could blow up the size of a saved Pipeline if 
> the models are large.
> * I suggest *not* saving the models by default but allowing saving if 
> specified.  We could specify whether to save the model as an extra Param for 
> CrossValidatorModelWriter, but we would have to make sure to expose 
> CrossValidatorModelWriter as a public API and modify the return type of 
> CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not 
> be a breaking change).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20922) Unsafe deserialization in Spark LauncherConnection

2017-08-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112602#comment-16112602
 ] 

Sean Owen commented on SPARK-20922:
---

If you'd email a suggested CVE description to priv...@spark.apache.org, we can 
go through the motions of reporting it as one. The ASF process is: 
https://www.apache.org/security/ https://www.apache.org/security/projects.html

> Unsafe deserialization in Spark LauncherConnection
> --
>
> Key: SPARK-20922
> URL: https://issues.apache.org/jira/browse/SPARK-20922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Aditya Sharad
>Assignee: Marcelo Vanzin
>  Labels: security
> Fix For: 2.0.3, 2.1.2, 2.2.0, 2.3.0
>
> Attachments: spark-deserialize-master.zip
>
>
> The {{run()}} method of the class 
> {{org.apache.spark.launcher.LauncherConnection}} performs unsafe 
> deserialization of data received by its socket. This makes Spark applications 
> launched programmatically using the {{SparkLauncher}} framework potentially 
> vulnerable to remote code execution by an attacker with access to any user 
> account on the local machine. Such an attacker could send a malicious 
> serialized Java object to multiple ports on the local machine, and if this 
> port matches the one (randomly) chosen by the Spark launcher, the malicious 
> object will be deserialized. By making use of gadget chains in code present 
> on the Spark application classpath, the deserialization process can lead to 
> RCE or privilege escalation.
> This vulnerability is identified by the “Unsafe deserialization” rule on 
> lgtm.com:
> https://lgtm.com/projects/g/apache/spark/snapshot/80fdc2c9d1693f5b3402a79ca4ec76f6e422ff13/files/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java#V58
>  
> Attached is a proof-of-concept exploit involving a simple 
> {{SparkLauncher}}-based application and a known gadget chain in the Apache 
> Commons Beanutils library referenced by Spark.
> See the readme file for demonstration instructions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20922) Unsafe deserialization in Spark LauncherConnection

2017-08-03 Thread Aditya Sharad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112600#comment-16112600
 ] 

Aditya Sharad commented on SPARK-20922:
---

Apologies for the delay in getting back to you. I believe we first got in touch 
privately to report this, but in future we'll discuss the details and fix on 
private@ first if that fits better into your workflow.

The scope is indeed limited to attacks from local users and the issue is now 
publicly disclosed. However, I would argue neither of these points disqualifies 
the vulnerability reported here for the purposes of getting a CVE assigned.

Depending on the configuration and the intentions of an attacker, the 
repercussions of this vulnerability are potentially extremely severe despite 
the limited scope:
- The worst case is obviously when Spark runs as an administrative user.
- In the more common case where Spark runs under a user account that is also 
responsible for other services (like Hadoop, HDFS), the repercussions can be 
very severe. This is the case in the default Cloudera setup, for example. In 
that particular scenario, an attacker can cause a widespread outage by simply 
wiping all data that belongs to the 'hdfs' user. The repercussions reach far 
beyond Spark itself.
- In the 'best' case, Spark is set up to use a dedicated user account. Here 
we're looking at a DoS to Spark specifically, with a severe risk for data loss. 
An attacker can stop the service and wipe all of Spark's data.

We have seen significantly less severe vulnerabilities for which a CVE is 
assigned. The prime reasons for doing so are to advise users and to maintain a 
visible record of the issue that isn't project-specific, which I think would be 
appropriate in this case.

Please let me know if there's anything I can help with. I am willing to file 
separately for the CVE if that is easier, but I do not wish to do so without 
first having your agreement and finding out if Spark has a preferred CVE route. 
If you'd like to discuss this further off-list, please feel free to contact me 
on adi...@semmle.com.

> Unsafe deserialization in Spark LauncherConnection
> --
>
> Key: SPARK-20922
> URL: https://issues.apache.org/jira/browse/SPARK-20922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Aditya Sharad
>Assignee: Marcelo Vanzin
>  Labels: security
> Fix For: 2.0.3, 2.1.2, 2.2.0, 2.3.0
>
> Attachments: spark-deserialize-master.zip
>
>
> The {{run()}} method of the class 
> {{org.apache.spark.launcher.LauncherConnection}} performs unsafe 
> deserialization of data received by its socket. This makes Spark applications 
> launched programmatically using the {{SparkLauncher}} framework potentially 
> vulnerable to remote code execution by an attacker with access to any user 
> account on the local machine. Such an attacker could send a malicious 
> serialized Java object to multiple ports on the local machine, and if this 
> port matches the one (randomly) chosen by the Spark launcher, the malicious 
> object will be deserialized. By making use of gadget chains in code present 
> on the Spark application classpath, the deserialization process can lead to 
> RCE or privilege escalation.
> This vulnerability is identified by the “Unsafe deserialization” rule on 
> lgtm.com:
> https://lgtm.com/projects/g/apache/spark/snapshot/80fdc2c9d1693f5b3402a79ca4ec76f6e422ff13/files/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java#V58
>  
> Attached is a proof-of-concept exploit involving a simple 
> {{SparkLauncher}}-based application and a known gadget chain in the Apache 
> Commons Beanutils library referenced by Spark.
> See the readme file for demonstration instructions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-21625) sqrt(negative number) should be null

2017-08-03 Thread panbingkun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

panbingkun updated SPARK-21625:
---
Comment: was deleted

(was: case class Sqrt(child: Expression) extends UnaryMathExpression(math.sqrt, 
"SQRT") {
  protected override def nullSafeEval(input: Any): Any = {
if (input.asInstanceOf[Double] < 0) {
  null
} else {
  f(input.asInstanceOf[Double])
}
  }

  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
nullSafeCodeGen(ctx, ev, c => {
  s"""
if ($c < 0) {
  ${ev.isNull} = true;
} else {
  ${ev.value} = java.lang.Math.sqrt($c);
}
  """
})
  }
})

> sqrt(negative number) should be null
> 
>
> Key: SPARK-21625
> URL: https://issues.apache.org/jira/browse/SPARK-21625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>
> Both Hive and MySQL are null:
> {code:sql}
> hive> select SQRT(-10.0);
> OK
> NULL
> Time taken: 0.384 seconds, Fetched: 1 row(s)
> {code}
> {code:sql}
> mysql> select sqrt(-10.0);
> +---+
> | sqrt(-10.0) |
> +---+
> |  NULL |
> +---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21625) sqrt(negative number) should be null

2017-08-03 Thread panbingkun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112571#comment-16112571
 ] 

panbingkun commented on SPARK-21625:


case class Sqrt(child: Expression) extends UnaryMathExpression(math.sqrt, 
"SQRT") {
  protected override def nullSafeEval(input: Any): Any = {
if (input.asInstanceOf[Double] < 0) {
  null
} else {
  f(input.asInstanceOf[Double])
}
  }

  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
nullSafeCodeGen(ctx, ev, c => {
  s"""
if ($c < 0) {
  ${ev.isNull} = true;
} else {
  ${ev.value} = java.lang.Math.sqrt($c);
}
  """
})
  }
}

> sqrt(negative number) should be null
> 
>
> Key: SPARK-21625
> URL: https://issues.apache.org/jira/browse/SPARK-21625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>
> Both Hive and MySQL are null:
> {code:sql}
> hive> select SQRT(-10.0);
> OK
> NULL
> Time taken: 0.384 seconds, Fetched: 1 row(s)
> {code}
> {code:sql}
> mysql> select sqrt(-10.0);
> +---+
> | sqrt(-10.0) |
> +---+
> |  NULL |
> +---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21605) Let IntelliJ IDEA correctly detect Language level and Target byte code version

2017-08-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21605:
-

Assignee: Chang chen

> Let IntelliJ IDEA correctly detect Language level and Target byte code version
> --
>
> Key: SPARK-21605
> URL: https://issues.apache.org/jira/browse/SPARK-21605
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Chang chen
>Assignee: Chang chen
>Priority: Minor
>  Labels: IDE, maven
> Fix For: 2.3.0
>
>
> With SPARK-21592, removing source and target properties from 
> maven-compiler-plugin lets IntelliJ IDEA use default Language level and 
> Target byte code version which are 1.4.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21605) Let IntelliJ IDEA correctly detect Language level and Target byte code version

2017-08-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21605.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18808
[https://github.com/apache/spark/pull/18808]

> Let IntelliJ IDEA correctly detect Language level and Target byte code version
> --
>
> Key: SPARK-21605
> URL: https://issues.apache.org/jira/browse/SPARK-21605
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Chang chen
>Priority: Minor
>  Labels: IDE, maven
> Fix For: 2.3.0
>
>
> With SPARK-21592, removing source and target properties from 
> maven-compiler-plugin lets IntelliJ IDEA use default Language level and 
> Target byte code version which are 1.4.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21626) "WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable"

2017-08-03 Thread Gu Chao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112564#comment-16112564
 ] 

Gu Chao commented on SPARK-21626:
-

[~srowen] I can solve this problem, but I do not know why.
{code:shell}
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
{code}


> "WARN NativeCodeLoader: Unable to load native-hadoop library for your 
> platform... using builtin-java classes where applicable"
> --
>
> Key: SPARK-21626
> URL: https://issues.apache.org/jira/browse/SPARK-21626
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Gu Chao
>
> After starting spark-shell, It output:
> 17/08/03 18:24:16 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21626) "WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable"

2017-08-03 Thread Gu Chao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112564#comment-16112564
 ] 

Gu Chao edited comment on SPARK-21626 at 8/3/17 10:57 AM:
--

[~srowen] I can solve this problem, but I do not know why.
{code:none}
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
{code}



was (Author: gu chao):
[~srowen] I can solve this problem, but I do not know why.
{code:shell}
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
{code}


> "WARN NativeCodeLoader: Unable to load native-hadoop library for your 
> platform... using builtin-java classes where applicable"
> --
>
> Key: SPARK-21626
> URL: https://issues.apache.org/jira/browse/SPARK-21626
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Gu Chao
>
> After starting spark-shell, It output:
> 17/08/03 18:24:16 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21626) "WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable"

2017-08-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21626.
---
Resolution: Not A Problem

Not a problem, not even specific to Spark. It means what it says, and is not an 
error. Search the internet.

> "WARN NativeCodeLoader: Unable to load native-hadoop library for your 
> platform... using builtin-java classes where applicable"
> --
>
> Key: SPARK-21626
> URL: https://issues.apache.org/jira/browse/SPARK-21626
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Gu Chao
>
> After starting spark-shell, It output:
> 17/08/03 18:24:16 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21626) "WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable"

2017-08-03 Thread Gu Chao (JIRA)

Gu Chao created SPARK-21626:
---

 Summary: "WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable"
 Key: SPARK-21626
 URL: https://issues.apache.org/jira/browse/SPARK-21626
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.2.0
Reporter: Gu Chao


After starting spark-shell, It output:
17/08/03 18:24:16 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21618) http(s) not accepted in spark-submit jar uri

2017-08-03 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112529#comment-16112529
 ] 

Steve Loughran edited comment on SPARK-21618 at 8/3/17 10:09 AM:
-

If you're relying on hadoop-common to provide the FS connection, no, not yet, 
and's not something I'm in a rush to backport, given it's unexpected 
consequences. Once I'm happy it could go into 2.8.x, but I think it'd need more 
explicit spark tests for that —something to bring up Jetty and serve over 
HTTPS, perhaps.

Actually, maybe a test could just use a JAR off maven central...the JAR classes 
don't actually need to be executed, and for security reasons you wouldn't (the 
artifact wouldn't have its checksums/signatures verified, after all).


was (Author: ste...@apache.org):
If you're relying on hadoop-common to provide the FS connection, no, not yet

> http(s) not accepted in spark-submit jar uri
> 
>
> Key: SPARK-21618
> URL: https://issues.apache.org/jira/browse/SPARK-21618
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1, 2.2.0
> Environment: pre-built for hadoop 2.6 and 2.7 on mac and ubuntu 
> 16.04. 
>Reporter: Ben Mayne
>Priority: Minor
>  Labels: documentation
>
> The documentation suggests I should be able to use an http(s) uri for a jar 
> in spark-submit, but I haven't been successful 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
> {noformat}
> benmayne@Benjamins-MacBook-Pro ~ $ spark-submit --deploy-mode client --master 
> local[2] --class class.name.Test https://test.com/path/to/jar.jar
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Exception in thread "main" java.io.IOException: No FileSystem for scheme: 
> https
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at 
> org.apache.spark.deploy.SparkSubmit$.downloadFile(SparkSubmit.scala:865)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:316)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> benmayne@Benjamins-MacBook-Pro ~ $
> {noformat}
> If I replace the path with a valid hdfs path 
> (hdfs:///user/benmayne/valid-jar.jar), it works as expected. I've seen the 
> same behavior across 2.2.0 (hadoop 2.6 & 2.7 on mac and ubuntu) and on 2.1.1 
> on ubuntu. 
> this is the example that I'm trying to replicate from 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management:
>  
> > Spark uses the following URL scheme to allow different strategies for 
> > disseminating jars:
> > file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file 
> > server, and every executor pulls the file from the driver HTTP server.
> > hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as 
> > expected
> {noformat}
> # Run on a Mesos cluster in cluster deploy mode with supervise
> ./bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master mesos://207.184.161.138:7077 \
>   --deploy-mode cluster \
>   --supervise \
>   --executor-memory 20G \
>   --total-executor-cores 100 \
>   http://path/to/examples.jar \
>   1000
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21611) Error class name for log in several classes.

2017-08-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21611.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18816
[https://github.com/apache/spark/pull/18816]

> Error class name for log in several classes.
> 
>
> Key: SPARK-21611
> URL: https://issues.apache.org/jira/browse/SPARK-21611
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Assignee: zuotingbing
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Error class name for log in several classes. such as:
> 2017-08-02 16:43:37,695 INFO CompositeService: Operation log root directory 
> is created: /tmp/mr/operation_logs
> "Operation log root directory is created" is in SessionManager.java  actually



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21618) http(s) not accepted in spark-submit jar uri

2017-08-03 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112529#comment-16112529
 ] 

Steve Loughran commented on SPARK-21618:


If you're relying on hadoop-common to provide the FS connection, no, not yet

> http(s) not accepted in spark-submit jar uri
> 
>
> Key: SPARK-21618
> URL: https://issues.apache.org/jira/browse/SPARK-21618
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1, 2.2.0
> Environment: pre-built for hadoop 2.6 and 2.7 on mac and ubuntu 
> 16.04. 
>Reporter: Ben Mayne
>Priority: Minor
>  Labels: documentation
>
> The documentation suggests I should be able to use an http(s) uri for a jar 
> in spark-submit, but I haven't been successful 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
> {noformat}
> benmayne@Benjamins-MacBook-Pro ~ $ spark-submit --deploy-mode client --master 
> local[2] --class class.name.Test https://test.com/path/to/jar.jar
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Exception in thread "main" java.io.IOException: No FileSystem for scheme: 
> https
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at 
> org.apache.spark.deploy.SparkSubmit$.downloadFile(SparkSubmit.scala:865)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:316)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> benmayne@Benjamins-MacBook-Pro ~ $
> {noformat}
> If I replace the path with a valid hdfs path 
> (hdfs:///user/benmayne/valid-jar.jar), it works as expected. I've seen the 
> same behavior across 2.2.0 (hadoop 2.6 & 2.7 on mac and ubuntu) and on 2.1.1 
> on ubuntu. 
> this is the example that I'm trying to replicate from 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management:
>  
> > Spark uses the following URL scheme to allow different strategies for 
> > disseminating jars:
> > file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file 
> > server, and every executor pulls the file from the driver HTTP server.
> > hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as 
> > expected
> {noformat}
> # Run on a Mesos cluster in cluster deploy mode with supervise
> ./bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master mesos://207.184.161.138:7077 \
>   --deploy-mode cluster \
>   --supervise \
>   --executor-memory 20G \
>   --total-executor-cores 100 \
>   http://path/to/examples.jar \
>   1000
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21618) http(s) not accepted in spark-submit jar uri

2017-08-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112524#comment-16112524
 ] 

Sean Owen commented on SPARK-21618:
---

I see, so this may really not work in general. At least we'd update the Spark 
docs then.

> http(s) not accepted in spark-submit jar uri
> 
>
> Key: SPARK-21618
> URL: https://issues.apache.org/jira/browse/SPARK-21618
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1, 2.2.0
> Environment: pre-built for hadoop 2.6 and 2.7 on mac and ubuntu 
> 16.04. 
>Reporter: Ben Mayne
>Priority: Minor
>  Labels: documentation
>
> The documentation suggests I should be able to use an http(s) uri for a jar 
> in spark-submit, but I haven't been successful 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
> {noformat}
> benmayne@Benjamins-MacBook-Pro ~ $ spark-submit --deploy-mode client --master 
> local[2] --class class.name.Test https://test.com/path/to/jar.jar
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Exception in thread "main" java.io.IOException: No FileSystem for scheme: 
> https
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at 
> org.apache.spark.deploy.SparkSubmit$.downloadFile(SparkSubmit.scala:865)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:316)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> benmayne@Benjamins-MacBook-Pro ~ $
> {noformat}
> If I replace the path with a valid hdfs path 
> (hdfs:///user/benmayne/valid-jar.jar), it works as expected. I've seen the 
> same behavior across 2.2.0 (hadoop 2.6 & 2.7 on mac and ubuntu) and on 2.1.1 
> on ubuntu. 
> this is the example that I'm trying to replicate from 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management:
>  
> > Spark uses the following URL scheme to allow different strategies for 
> > disseminating jars:
> > file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file 
> > server, and every executor pulls the file from the driver HTTP server.
> > hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as 
> > expected
> {noformat}
> # Run on a Mesos cluster in cluster deploy mode with supervise
> ./bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master mesos://207.184.161.138:7077 \
>   --deploy-mode cluster \
>   --supervise \
>   --executor-memory 20G \
>   --total-executor-cores 100 \
>   http://path/to/examples.jar \
>   1000
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21618) http(s) not accepted in spark-submit jar uri

2017-08-03 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112520#comment-16112520
 ] 

Steve Loughran commented on SPARK-21618:


BTW, we haven't backported HADOOP-14383 into HDP; don't know about CDH (check 
with [~jzhuge]?), and I'm assuming EMR doesn't have it either, as S3 is their 
distribution mechanism 

> http(s) not accepted in spark-submit jar uri
> 
>
> Key: SPARK-21618
> URL: https://issues.apache.org/jira/browse/SPARK-21618
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1, 2.2.0
> Environment: pre-built for hadoop 2.6 and 2.7 on mac and ubuntu 
> 16.04. 
>Reporter: Ben Mayne
>Priority: Minor
>  Labels: documentation
>
> The documentation suggests I should be able to use an http(s) uri for a jar 
> in spark-submit, but I haven't been successful 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
> {noformat}
> benmayne@Benjamins-MacBook-Pro ~ $ spark-submit --deploy-mode client --master 
> local[2] --class class.name.Test https://test.com/path/to/jar.jar
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Exception in thread "main" java.io.IOException: No FileSystem for scheme: 
> https
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at 
> org.apache.spark.deploy.SparkSubmit$.downloadFile(SparkSubmit.scala:865)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:316)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:316)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> benmayne@Benjamins-MacBook-Pro ~ $
> {noformat}
> If I replace the path with a valid hdfs path 
> (hdfs:///user/benmayne/valid-jar.jar), it works as expected. I've seen the 
> same behavior across 2.2.0 (hadoop 2.6 & 2.7 on mac and ubuntu) and on 2.1.1 
> on ubuntu. 
> this is the example that I'm trying to replicate from 
> https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management:
>  
> > Spark uses the following URL scheme to allow different strategies for 
> > disseminating jars:
> > file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file 
> > server, and every executor pulls the file from the driver HTTP server.
> > hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as 
> > expected
> {noformat}
> # Run on a Mesos cluster in cluster deploy mode with supervise
> ./bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master mesos://207.184.161.138:7077 \
>   --deploy-mode cluster \
>   --supervise \
>   --executor-memory 20G \
>   --total-executor-cores 100 \
>   http://path/to/examples.jar \
>   1000
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 124 matches

Mail list logo