date:20160509

[jira] [Commented] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry

2016-05-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277716#comment-15277716
 ] 

Hyukjin Kwon commented on SPARK-15245:
--

(BTW, as you might already know, the reason why I thought the message is wrong 
was it should be {{path}} not {{basePath}} because {{path}} option is exposed 
to users with {{stream()}} API. (e.g. {{option("path", path).stream()}}))

> stream API throws an exception with an incorrect message when the path is not 
> a direcotry
> -
>
> Key: SPARK-15245
> URL: https://issues.apache.org/jira/browse/SPARK-15245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> {code}
> val path = "tmp.csv" // This is not a directory
> val cars = spark.read
>   .format("csv")
>   .stream(path)
>   .write
>   .option("checkpointLocation", "streaming.metadata")
>   .startStream("tmp")
> {code}
> This throws an exception as below.
> {code}
> java.lang.IllegalArgumentException: Option 'basePath' must be a directory
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> {code}
> It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be 
> great if it has a better message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14261) Memory leak in Spark Thrift Server

2016-05-09 Thread Weizhong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275238#comment-15275238
 ] 

Weizhong edited comment on SPARK-14261 at 5/10/16 6:52 AM:
---

I also face this issue.
I found each session will add one HiveConf on sun.misc.Launcher$AppClassLoader 
and one on org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1. All 
these HiveConf can't be released until OOM.
>From the OOM dump(Use Eclipse Memory Analyzer to analyze), all these HiveConf 
>have ref, it's GC root like below:
{noformat}
org.apache.hadoop.hive.conf.HiveConf
  conf org.apache.hadoop.hive.ql.session.SessionState$SessionStates
value java.lang.ThreadLocal$ThreadLocalMap$Entry
  [19]  java.lang.ThreadLocal$ThreadLocalMap$Entry[32]
table java.lang.ThreadLocal$ThreadLocalMap
  threadLocals java.lang.Thread
  referent java.util.WeakHashMap$Entry
  conf org.apache.hadoop.hive.ql.session.SessionState
state org.apache.spark.sql.hive.client.ClientWrapper
  metaHive, metadataHive, metaHive 
org.apache.spark.sql.hive.client.HiveContext
$outer org.apache.spark.sql.SQLContext$$anon$4
[265] java.lang.Object[267]
  array java.util.concurrent.CopyWriteArrayList
listeners org.apache.spark.scheduler.LiveListenerBus
  $outer org.apache.spark.util.AsynchronousListenerBus$$anon$1
{noformat}


was (Author: sephiroth-lin):
I also face this issue.
I found every session will add one metastore connection, and when i dump the 
heap, i found have many Hive and HiveConf object don't be removed

> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-14261
> URL: https://issues.apache.org/jira/browse/SPARK-14261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xiaochun Liang
> Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, 
> MemorySnapshot.PNG
>
>
> I am running Spark Thrift server on Windows Server 2012. The Spark Thrift 
> server is launched as Yarn client mode. Its memory usage is increased 
> gradually with the queries in.  I am wondering there is memory leak in Spark 
> Thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry

2016-05-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277700#comment-15277700
 ] 

Hyukjin Kwon commented on SPARK-15245:
--

Thank you so much. Let me close my PR.

> stream API throws an exception with an incorrect message when the path is not 
> a direcotry
> -
>
> Key: SPARK-15245
> URL: https://issues.apache.org/jira/browse/SPARK-15245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> {code}
> val path = "tmp.csv" // This is not a directory
> val cars = spark.read
>   .format("csv")
>   .stream(path)
>   .write
>   .option("checkpointLocation", "streaming.metadata")
>   .startStream("tmp")
> {code}
> This throws an exception as below.
> {code}
> java.lang.IllegalArgumentException: Option 'basePath' must be a directory
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> {code}
> It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be 
> great if it has a better message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry

2016-05-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277697#comment-15277697
 ] 

Sean Owen commented on SPARK-15245:
---

The message seems correct. I agree it could possibly be checked earlier though, 
yes. I also agree it's not clear if it's worth a little extra code and extra 
calls to check it, since it's a rare failure mode and handled quickly and 
correctly anyway. It doesn't affect normal usage.

> stream API throws an exception with an incorrect message when the path is not 
> a direcotry
> -
>
> Key: SPARK-15245
> URL: https://issues.apache.org/jira/browse/SPARK-15245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> {code}
> val path = "tmp.csv" // This is not a directory
> val cars = spark.read
>   .format("csv")
>   .stream(path)
>   .write
>   .option("checkpointLocation", "streaming.metadata")
>   .startStream("tmp")
> {code}
> This throws an exception as below.
> {code}
> java.lang.IllegalArgumentException: Option 'basePath' must be a directory
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> {code}
> It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be 
> great if it has a better message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry

2016-05-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277696#comment-15277696
 ] 

Hyukjin Kwon commented on SPARK-15245:
--

Oh, the main reason is, I thought {{'basePath' must be a directory}} is not a 
correct message. 

Also, I realised this can be checked earlier in driver-side. It seems the 
exception is raised not in driver-side.

So, diver-side seems not catching it. Above codes return 0 but print out the 
exception message in my local test.


So, I thought anyway this can be caught earlier with a better message.

But after thinking more, I started to get worried that opening the given paths 
might be overhead... especially for S3.. 

I am not sure of this one. I can close if we think it is not appropriate.

> stream API throws an exception with an incorrect message when the path is not 
> a direcotry
> -
>
> Key: SPARK-15245
> URL: https://issues.apache.org/jira/browse/SPARK-15245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> {code}
> val path = "tmp.csv" // This is not a directory
> val cars = spark.read
>   .format("csv")
>   .stream(path)
>   .write
>   .option("checkpointLocation", "streaming.metadata")
>   .startStream("tmp")
> {code}
> This throws an exception as below.
> {code}
> java.lang.IllegalArgumentException: Option 'basePath' must be a directory
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> {code}
> It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be 
> great if it has a better message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15239) spark document conflict about mesos cluster

2016-05-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15239.
---
Resolution: Not A Problem

If you take a look at the docs in the repo you can see this has been updated 
already. Mesos supports cluster mode. You can ask questions on the user list 
rather than here.

> spark document conflict about mesos cluster 
> 
>
> Key: SPARK-15239
> URL: https://issues.apache.org/jira/browse/SPARK-15239
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.6.1
>Reporter: JasonChang
>
> 1. http://spark.apache.org/docs/latest/submitting-applications.html
> if your application is submitted from a machine far from the worker machines 
> (e.g. locally on your laptop), it is common to use cluster mode to minimize 
> network latency between the drivers and the executors. Note that cluster mode 
> is currently not supported for Mesos clusters. Currently only YARN supports 
> cluster mode for Python applications.
> 2. http://spark.apache.org/docs/latest/running-on-mesos.html
> Spark on Mesos also supports cluster mode, where the driver is launched in 
> the cluster and the client can find the results of the driver from the Mesos 
> Web UI.
> I confused does mesos supports cluster mode?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15228) pyspark.RDD.toLocalIterator Documentation

2016-05-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15228.
---
Resolution: Not A Problem

> pyspark.RDD.toLocalIterator Documentation
> -
>
> Key: SPARK-15228
> URL: https://issues.apache.org/jira/browse/SPARK-15228
> Project: Spark
>  Issue Type: Documentation
>Reporter: Ignacio Tartavull
>Priority: Trivial
>
> There is a little bug in the parsing of the documentation of 
> http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14623) add label binarizer

2016-05-09 Thread hujiayin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-14623:
-
Description: 
It relates to https://issues.apache.org/jira/browse/SPARK-7445

Map the labels to 0/1. 
For example,
Input:
"yellow,green,red,green,0"
The labels: "0, green, red, yellow"
Output:
0, 0, 0, 1
0, 1, 0, 0
0, 0, 1, 0
0, 1, 0, 0
1, 0 ,0, 0
Refer to 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html



  was:
It relates to https://issues.apache.org/jira/browse/SPARK-7445

Map the labels to 0/1. 
For example,
Input:
"yellow,green,red,green,0"
The labels: "0, green, red, yellow"
Output:
0, 0, 0, 1
0, 1, 0, 0
0, 0, 1, 0
0, 1, 0, 0
1, 0 ,0, 0




> add label binarizer 
> 
>
> Key: SPARK-14623
> URL: https://issues.apache.org/jira/browse/SPARK-14623
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: hujiayin
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It relates to https://issues.apache.org/jira/browse/SPARK-7445
> Map the labels to 0/1. 
> For example,
> Input:
> "yellow,green,red,green,0"
> The labels: "0, green, red, yellow"
> Output:
> 0, 0, 0, 1
> 0, 1, 0, 0
> 0, 0, 1, 0
> 0, 1, 0, 0
> 1, 0 ,0, 0
> Refer to 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry

2016-05-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277690#comment-15277690
 ] 

Sean Owen commented on SPARK-15245:
---

Hm, what's the issue here? it says the arg was not a directory, which is the 
problem.

> stream API throws an exception with an incorrect message when the path is not 
> a direcotry
> -
>
> Key: SPARK-15245
> URL: https://issues.apache.org/jira/browse/SPARK-15245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> {code}
> val path = "tmp.csv" // This is not a directory
> val cars = spark.read
>   .format("csv")
>   .stream(path)
>   .write
>   .option("checkpointLocation", "streaming.metadata")
>   .startStream("tmp")
> {code}
> This throws an exception as below.
> {code}
> java.lang.IllegalArgumentException: Option 'basePath' must be a directory
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> {code}
> It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be 
> great if it has a better message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15228) pyspark.RDD.toLocalIterator Documentation

2016-05-09 Thread Sandeep Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277688#comment-15277688
 ] 

Sandeep Singh edited comment on SPARK-15228 at 5/10/16 6:21 AM:


Seems to be fixed in the current master.

Before
!http://i.imgur.com/aiIP4u3.png! 

After (behaviour on current master)
!http://i.imgur.com/jSpIySj.png!


was (Author: techaddict):
Seems to be fixed in the current master.

Before
!http://i.imgur.com/aiIP4u3.png! 

After
!http://i.imgur.com/jSpIySj.png!

> pyspark.RDD.toLocalIterator Documentation
> -
>
> Key: SPARK-15228
> URL: https://issues.apache.org/jira/browse/SPARK-15228
> Project: Spark
>  Issue Type: Documentation
>Reporter: Ignacio Tartavull
>Priority: Trivial
>
> There is a little bug in the parsing of the documentation of 
> http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15228) pyspark.RDD.toLocalIterator Documentation

2016-05-09 Thread Sandeep Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277688#comment-15277688
 ] 

Sandeep Singh commented on SPARK-15228:
---

Seems to be fixed in the current master.

Before
!http://i.imgur.com/aiIP4u3.png! 

After
!http://i.imgur.com/jSpIySj.png!

> pyspark.RDD.toLocalIterator Documentation
> -
>
> Key: SPARK-15228
> URL: https://issues.apache.org/jira/browse/SPARK-15228
> Project: Spark
>  Issue Type: Documentation
>Reporter: Ignacio Tartavull
>Priority: Trivial
>
> There is a little bug in the parsing of the documentation of 
> http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14162) java.lang.IllegalStateException: Did not find registered driver with class oracle.jdbc.OracleDriver

2016-05-09 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277682#comment-15277682
 ] 

Sun Rui commented on SPARK-14162:
-

We met the same error. The cause is that in one worker node, mysql JDBC driver 
is not in the CLASSPATH.

[~mchalek] It seems this is not a bug. It seems that in your case, for some 
reason, in one worker node, ojdbc6 driver is not automatically loaded and 
registered.

> java.lang.IllegalStateException: Did not find registered driver with class 
> oracle.jdbc.OracleDriver
> ---
>
> Key: SPARK-14162
> URL: https://issues.apache.org/jira/browse/SPARK-14162
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Zoltan Fedor
>
> This is an interesting one.
> We are using JupyterHub with Python to connect to a Hadoop cluster to run 
> Spark jobs and as the new Spark versions come out I compile them and add as 
> new kernels to JupyterHub to be used.
> There are also some libraries we are using, like ojdbc to connect to an 
> Oracle database.
> Now the interesting thing, that ojdbc worked fine in Spark 1.6.0 but suddenly 
> "it cannot be found" in 1.6.1.
> Everything, all settings are the same when starting pyspark 1.6.1 and 1.6.0, 
> so there is no reason for it not to work in 1.6.1 if it works in 1.6.0.
> This is the pysparjk code I am running in both 1.6.1 and 1.6.0:
> {quote}
> df = 
> sqlContext.read.format('jdbc').options(url='jdbc:oracle:thin:'+connection_script+'',
>  dbtable='bi.contact').load()
> print(df.count()){quote}
> And it throws this error in 1.6.1 only:
> {quote}
> java.lang.IllegalStateException: Did not find registered driver with class 
> oracle.jdbc.OracleDriver
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2$$anonfun$3.apply(JdbcUtils.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2$$anonfun$3.apply(JdbcUtils.scala:58)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:52)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:347)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745){quote}
> I know that this usually means that the ojdbc driver is not available on the 
> executor, but it is. Spark is being started the exact same way in 1.6.1 as in 
> 1.6.0 and it does find it on 1.6.0.
> I can steadily reproduce this, so the only conclusion that something must 
> have changed between 1.6.0 and 1.6.1 causing this, but I have see no 
> "depreciation" notice of anything what could cause this.
> Environment variables set when starting pyspark 1.6.1:
> {quote}
>   "SPARK_HOME": "/usr/lib/spark-1.6.1-hive",
>   "SCALA_HOME": "/usr/lib/scala",
>   "HADOOP_CONF_DIR": "/etc/hadoop/venus-hadoop-conf",
>   "HADOOP_HOME": "/usr/bin/hadoop",
>   "HIVE_HOME": "/usr/bin/hive",
>   "LD_LIBRARY_PATH": "/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH",
>   "YARN_HOME": "",
>   "SPARK_DIST_CLASSPATH": 
> "/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*",
>   "SPARK_LIBRARY_PATH": "/usr/lib/hadoop/lib",
>   "PAT

[jira] [Commented] (SPARK-15228) pyspark.RDD.toLocalIterator Documentation

2016-05-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277681#comment-15277681
 ] 

Sean Owen commented on SPARK-15228:
---

Still not clear what that means. Can you make a pull request?

> pyspark.RDD.toLocalIterator Documentation
> -
>
> Key: SPARK-15228
> URL: https://issues.apache.org/jira/browse/SPARK-15228
> Project: Spark
>  Issue Type: Documentation
>Reporter: Ignacio Tartavull
>Priority: Trivial
>
> There is a little bug in the parsing of the documentation of 
> http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15246) Fix code style and improve volatile for SPARK-4452

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15246:


Assignee: (was: Apache Spark)

>  Fix code style and improve volatile for SPARK-4452
> ---
>
> Key: SPARK-15246
> URL: https://issues.apache.org/jira/browse/SPARK-15246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Lianhui Wang
>
> for SPARK-4452
> 1. Fix code style
> 2. remote volatile of elementsRead method because there is only one thread to 
> use it.
> 3. avoid volatile of _elementsRead because Collection increases number of 
> _elementsRead when it insert a element. It is very expensive. So we can avoid 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15246) Fix code style and improve volatile for SPARK-4452

2016-05-09 Thread Lianhui Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15246:
-
Summary:  Fix code style and improve volatile for SPARK-4452  (was:  Fix 
code style and improve volatile for Spillable)

>  Fix code style and improve volatile for SPARK-4452
> ---
>
> Key: SPARK-15246
> URL: https://issues.apache.org/jira/browse/SPARK-15246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Lianhui Wang
>
> for SPARK-4452
> 1. Fix code style
> 2. remote volatile of elementsRead method because there is only one thread to 
> use it.
> 3. avoid volatile of _elementsRead because Collection increases number of 
> _elementsRead when it insert a element. It is very expensive. So we can avoid 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15246) Fix code style and improve volatile for SPARK-4452

2016-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277672#comment-15277672
 ] 

Apache Spark commented on SPARK-15246:
--

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/13020

>  Fix code style and improve volatile for SPARK-4452
> ---
>
> Key: SPARK-15246
> URL: https://issues.apache.org/jira/browse/SPARK-15246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Lianhui Wang
>
> for SPARK-4452
> 1. Fix code style
> 2. remote volatile of elementsRead method because there is only one thread to 
> use it.
> 3. avoid volatile of _elementsRead because Collection increases number of 
> _elementsRead when it insert a element. It is very expensive. So we can avoid 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15246) Fix code style and improve volatile for SPARK-4452

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15246:


Assignee: Apache Spark

>  Fix code style and improve volatile for SPARK-4452
> ---
>
> Key: SPARK-15246
> URL: https://issues.apache.org/jira/browse/SPARK-15246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Lianhui Wang
>Assignee: Apache Spark
>
> for SPARK-4452
> 1. Fix code style
> 2. remote volatile of elementsRead method because there is only one thread to 
> use it.
> 3. avoid volatile of _elementsRead because Collection increases number of 
> _elementsRead when it insert a element. It is very expensive. So we can avoid 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15246) Fix code style and improve volatile for Spillable

2016-05-09 Thread Lianhui Wang (JIRA)

Lianhui Wang created SPARK-15246:


 Summary:  Fix code style and improve volatile for Spillable
 Key: SPARK-15246
 URL: https://issues.apache.org/jira/browse/SPARK-15246
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Lianhui Wang


for SPARK-4452
1. Fix code style
2. remote volatile of elementsRead method because there is only one thread to 
use it.
3. avoid volatile of _elementsRead because Collection increases number of 
_elementsRead when it insert a element. It is very expensive. So we can avoid 
it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15245:


Assignee: Apache Spark

> stream API throws an exception with an incorrect message when the path is not 
> a direcotry
> -
>
> Key: SPARK-15245
> URL: https://issues.apache.org/jira/browse/SPARK-15245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> {code}
> val path = "tmp.csv" // This is not a directory
> val cars = spark.read
>   .format("csv")
>   .stream(path)
>   .write
>   .option("checkpointLocation", "streaming.metadata")
>   .startStream("tmp")
> {code}
> This throws an exception as below.
> {code}
> java.lang.IllegalArgumentException: Option 'basePath' must be a directory
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> {code}
> It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be 
> great if it has a better message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry

2016-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277647#comment-15277647
 ] 

Apache Spark commented on SPARK-15245:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/13021

> stream API throws an exception with an incorrect message when the path is not 
> a direcotry
> -
>
> Key: SPARK-15245
> URL: https://issues.apache.org/jira/browse/SPARK-15245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> {code}
> val path = "tmp.csv" // This is not a directory
> val cars = spark.read
>   .format("csv")
>   .stream(path)
>   .write
>   .option("checkpointLocation", "streaming.metadata")
>   .startStream("tmp")
> {code}
> This throws an exception as below.
> {code}
> java.lang.IllegalArgumentException: Option 'basePath' must be a directory
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> {code}
> It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be 
> great if it has a better message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15245:


Assignee: (was: Apache Spark)

> stream API throws an exception with an incorrect message when the path is not 
> a direcotry
> -
>
> Key: SPARK-15245
> URL: https://issues.apache.org/jira/browse/SPARK-15245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> {code}
> val path = "tmp.csv" // This is not a directory
> val cars = spark.read
>   .format("csv")
>   .stream(path)
>   .write
>   .option("checkpointLocation", "streaming.metadata")
>   .startStream("tmp")
> {code}
> This throws an exception as below.
> {code}
> java.lang.IllegalArgumentException: Option 'basePath' must be a directory
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> {code}
> It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be 
> great if it has a better message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry

2016-05-09 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-15245:


 Summary: stream API throws an exception with an incorrect message 
when the path is not a direcotry
 Key: SPARK-15245
 URL: https://issues.apache.org/jira/browse/SPARK-15245
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Priority: Trivial



{code}
val path = "tmp.csv" // This is not a directory
val cars = spark.read
  .format("csv")
  .stream(path)
  .write
  .option("checkpointLocation", "streaming.metadata")
  .startStream("tmp")
{code}

This throws an exception as below.

{code}
java.lang.IllegalArgumentException: Option 'basePath' must be a directory
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117)
at 
org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at 
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201)
at 
org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
{code}


It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be great 
if it has a better message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15207) Use Travis CI for Java Linter and JDK7/8 compilation test

2016-05-09 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15207:
--
Description: 
Currently, Java Linter is disabled in Jenkins tests.

https://github.com/apache/spark/blob/master/dev/run-tests.py#L554

However, as of today, Spark has 721 java files with 97362 code (without 
blank/comments). It's about 1/3 of Scala.
{code}

Language  files  blankcomment   code

Scala  2353  62819 124060 318747
Java721  18617  23314  97362
{code}

This issue aims to take advantage of Travis CI to handle the following static 
analysis by adding a single file, `.travis.yml` without any additional burden 
on the existing servers.

- Java Linter
- JDK7/JDK8 maven compile

Note that this issue does not propose to remove some of the above work items 
from the Jenkins. It's possible, but we need to observe the Travis CI stability 
for a while. The goal of this issue is to removing committer's overhead on 
linter-related PRs (the original PR and the fixation PR).

By the way, historically, Spark used Travis CI before.

  was:
Currently, Java Linter is disabled in Jenkins tests.

https://github.com/apache/spark/blob/master/dev/run-tests.py#L554

This issue aims to take advantage of Travis CI to handle the following static 
analysis by adding a single file, `.travis.yml` without any additional burden 
on the existing servers.

- Java Linter
- JDK7/JDK8 maven compile

Note that this issue does not propose to remove some of the above work items 
from the Jenkins. It's possible, but we need to observe the Travis CI stability 
for a while. The goal of this issue is to removing committer's overhead on 
linter-related PRs (the original PR and the fixation PR).

By the way, historically, Spark used Travis CI before.


> Use Travis CI for Java Linter and JDK7/8 compilation test
> -
>
> Key: SPARK-15207
> URL: https://issues.apache.org/jira/browse/SPARK-15207
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Dongjoon Hyun
>
> Currently, Java Linter is disabled in Jenkins tests.
> https://github.com/apache/spark/blob/master/dev/run-tests.py#L554
> However, as of today, Spark has 721 java files with 97362 code (without 
> blank/comments). It's about 1/3 of Scala.
> {code}
> 
> Language  files  blankcomment   
> code
> 
> Scala  2353  62819 124060 
> 318747
> Java721  18617  23314  
> 97362
> {code}
> This issue aims to take advantage of Travis CI to handle the following static 
> analysis by adding a single file, `.travis.yml` without any additional burden 
> on the existing servers.
> - Java Linter
> - JDK7/JDK8 maven compile
> Note that this issue does not propose to remove some of the above work items 
> from the Jenkins. It's possible, but we need to observe the Travis CI 
> stability for a while. The goal of this issue is to removing committer's 
> overhead on linter-related PRs (the original PR and the fixation PR).
> By the way, historically, Spark used Travis CI before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15207) Use Travis CI for Java Linter and JDK7/8 compilation test

2016-05-09 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15207:
--
Description: 
Currently, Java Linter is disabled in Jenkins tests.

https://github.com/apache/spark/blob/master/dev/run-tests.py#L554

This issue aims to take advantage of Travis CI to handle the following static 
analysis by adding a single file, `.travis.yml` without any additional burden 
on the existing servers.

- Java Linter
- JDK7/JDK8 maven compile

Note that this issue does not propose to remove some of the above work items 
from the Jenkins. It's possible, but we need to observe the Travis CI stability 
for a while. The goal of this issue is to removing committer's overhead on 
linter-related PRs (the original PR and the fixation PR).

By the way, historically, Spark used Travis CI before.

  was:
Currently, Java Linter is disabled due to lack of testing resources.

https://github.com/apache/spark/blob/master/dev/run-tests.py#L554

This issue aims to take advantage of Travis CI to handle the following static 
analysis by adding a single file, `.travis.yml` without any additional burden 
on the existing servers.

- Apache License Check
- Python Linter
- Scala Linter
- Java Linter
- JDK7/JDK8 maven compile

Note that this issue does not propose to remove some of the above work items 
from the Jenkins. It's possible, but we need to observe the Travis CI stability 
for a while.

The goal of this issue is to removing committer's overhead on linter-related 
PRs (the original PR and the fixation PR).

By the way, historically, Spark used Travis CI before.

Summary: Use Travis CI for Java Linter and JDK7/8 compilation test  
(was: Use Travis CI for Java/Scala/Python Linter and JDK7/8 compilation test)

> Use Travis CI for Java Linter and JDK7/8 compilation test
> -
>
> Key: SPARK-15207
> URL: https://issues.apache.org/jira/browse/SPARK-15207
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Dongjoon Hyun
>
> Currently, Java Linter is disabled in Jenkins tests.
> https://github.com/apache/spark/blob/master/dev/run-tests.py#L554
> This issue aims to take advantage of Travis CI to handle the following static 
> analysis by adding a single file, `.travis.yml` without any additional burden 
> on the existing servers.
> - Java Linter
> - JDK7/JDK8 maven compile
> Note that this issue does not propose to remove some of the above work items 
> from the Jenkins. It's possible, but we need to observe the Travis CI 
> stability for a while. The goal of this issue is to removing committer's 
> overhead on linter-related PRs (the original PR and the fixation PR).
> By the way, historically, Spark used Travis CI before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15218) Error: Could not find or load main class org.apache.spark.launcher.Main when run from a directory containing colon ':'

2016-05-09 Thread Adam Cecile (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277610#comment-15277610
 ] 

Adam Cecile commented on SPARK-15218:
-

Hello,

Good catch, this mesos bug should fix my issue.
I'm using mesosphere debian package, up to date, so I should receive the fix 
soon.
What about hacking around the SPARk_HOME variable within the shell wrapper to 
add backslashes? No luck?

> Error: Could not find or load main class org.apache.spark.launcher.Main when 
> run from a directory containing colon ':'
> --
>
> Key: SPARK-15218
> URL: https://issues.apache.org/jira/browse/SPARK-15218
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 1.6.1
>Reporter: Adam Cecile
>  Labels: mesos
>
> {noformat}
> mkdir /tmp/qwe:rtz
> cd /tmp/qwe:rtz
> wget 
> http://www-eu.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz
> tar xvzf spark-1.6.1-bin-without-hadoop.tgz 
> cd spark-1.6.1-bin-without-hadoop/
> bin/spark-submit
> {noformat}
> Returns "Error: Could not find or load main class 
> org.apache.spark.launcher.Main".
> That would not be such an issue if Mesos executor did not have colon in the 
> generated paths. It means withtout hacking (define relative SPARK_HOME path 
> by myself) there's no way to run a spark-job insode a mesos job container...
> Best regards, Adam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15204) Improve nullability inference for Aggregator

2016-05-09 Thread Koert Kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277600#comment-15277600
 ] 

Koert Kuipers commented on SPARK-15204:
---

Makes sense



> Improve nullability inference for Aggregator
> 
>
> Key: SPARK-15204
> URL: https://issues.apache.org/jira/browse/SPARK-15204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
> Environment: spark-2.0.0-SNAPSHOT
>Reporter: koert kuipers
>Priority: Minor
>
> {noformat}
> object SimpleSum extends Aggregator[Row, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Row) = b + a.getInt(1)
>   def merge(b1: Int, b2: Int) = b1 + b2
>   def finish(b: Int) = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v")
> val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1")
> df1.printSchema
> df1.show
> root
>  |-- k: string (nullable = true)
>  |-- v1: integer (nullable = true)
> +---+---+
> |  k| v1|
> +---+---+
> |  a|  6|
> +---+---+
> {noformat}
> notice how v1 has nullable set to true. the default (and expected) behavior 
> for spark sql is to give an int column false for nullable. for example if i 
> had uses a built-in aggregator like "sum" instead if would have reported 
> nullable = false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15244) Type of column name created with sqlContext.createDataFrame() is not consistent.

2016-05-09 Thread Kazuki Yokoishi (JIRA)

Kazuki Yokoishi created SPARK-15244:
---

 Summary: Type of column name created with 
sqlContext.createDataFrame() is not consistent.
 Key: SPARK-15244
 URL: https://issues.apache.org/jira/browse/SPARK-15244
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
 Environment: CentOS 7, Spark 1.6.0
Reporter: Kazuki Yokoishi
Priority: Minor


StructField() converts field name to str in __init__.
But, when list of str/unicode is passed to sqlContext.createDataFrame() as a 
schema, the type of StructField.name is not converted.

To reproduce:
{noformat}
>>> schema = StructType([StructField(u"col", StringType())])
>>> df1 = sqlContext.createDataFrame([("a",)], schema)
>>> df1.columns # "col" is str
['col']
>>> df2 = sqlContext.createDataFrame([("a",)], [u"col"])
>>> df2.columns # "col" is unicode
[u'col']
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15243) Binarizer.explainParam(u"...") raises ValueError

2016-05-09 Thread Kazuki Yokoishi (JIRA)

Kazuki Yokoishi created SPARK-15243:
---

 Summary: Binarizer.explainParam(u"...") raises ValueError
 Key: SPARK-15243
 URL: https://issues.apache.org/jira/browse/SPARK-15243
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
 Environment: CentOS 7, Spark 1.6.0
Reporter: Kazuki Yokoishi
Priority: Minor


When unicode is passed to Binarizer.explainParam(), ValueError occurs.

To reproduce:
{noformat}
>>> binarizer = Binarizer(threshold=1.0, inputCol="values", 
>>> outputCol="features")
>>> binarizer.explainParam("threshold") # str can be passed
'threshold: threshold in binary classification prediction, in range [0, 1] 
(default: 0.0, current: 1.0)'

>>> binarizer.explainParam(u"threshold") # unicode cannot be passed
---
ValueErrorTraceback (most recent call last)
 in ()
> 1 binarizer.explainParam(u"threshold")

/usr/spark/current/python/pyspark/ml/param/__init__.py in explainParam(self, 
param)
 96 default value and user-supplied value in a string.
 97 """
---> 98 param = self._resolveParam(param)
 99 values = []
100 if self.isDefined(param):

/usr/spark/current/python/pyspark/ml/param/__init__.py in _resolveParam(self, 
param)
231 return self.getParam(param)
232 else:
--> 233 raise ValueError("Cannot resolve %r as a param." % param)
234 
235 @staticmethod

ValueError: Cannot resolve u'threshold' as a param.
{noformat}

Same erros occur in other methods.
* Binarizer.hasDefault()
* Binarizer.getOrDefault()
* Binarizer.isSet()

These errors are caused by checks *isinstance(obj, str)* in 
pyspark.ml.param.Params._resolveParam().

basestring should be used instead of str in isinstance() for backward 
compatibility as below.
{noformat}
if sys.version >= '3':
 basestring = str

if isinstance(obj, basestring):
# TODO
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15204) Improve nullability inference for Aggregator

2016-05-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-15204.
---
Resolution: Not A Bug

> Improve nullability inference for Aggregator
> 
>
> Key: SPARK-15204
> URL: https://issues.apache.org/jira/browse/SPARK-15204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
> Environment: spark-2.0.0-SNAPSHOT
>Reporter: koert kuipers
>Priority: Minor
>
> {noformat}
> object SimpleSum extends Aggregator[Row, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Row) = b + a.getInt(1)
>   def merge(b1: Int, b2: Int) = b1 + b2
>   def finish(b: Int) = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v")
> val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1")
> df1.printSchema
> df1.show
> root
>  |-- k: string (nullable = true)
>  |-- v1: integer (nullable = true)
> +---+---+
> |  k| v1|
> +---+---+
> |  a|  6|
> +---+---+
> {noformat}
> notice how v1 has nullable set to true. the default (and expected) behavior 
> for spark sql is to give an int column false for nullable. for example if i 
> had uses a built-in aggregator like "sum" instead if would have reported 
> nullable = false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-15204) Improve nullability inference for Aggregator

2016-05-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-15204:
-

> Improve nullability inference for Aggregator
> 
>
> Key: SPARK-15204
> URL: https://issues.apache.org/jira/browse/SPARK-15204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
> Environment: spark-2.0.0-SNAPSHOT
>Reporter: koert kuipers
>Priority: Minor
>
> {noformat}
> object SimpleSum extends Aggregator[Row, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Row) = b + a.getInt(1)
>   def merge(b1: Int, b2: Int) = b1 + b2
>   def finish(b: Int) = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v")
> val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1")
> df1.printSchema
> df1.show
> root
>  |-- k: string (nullable = true)
>  |-- v1: integer (nullable = true)
> +---+---+
> |  k| v1|
> +---+---+
> |  a|  6|
> +---+---+
> {noformat}
> notice how v1 has nullable set to true. the default (and expected) behavior 
> for spark sql is to give an int column false for nullable. for example if i 
> had uses a built-in aggregator like "sum" instead if would have reported 
> nullable = false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15204) Improve nullability inference for Aggregator

2016-05-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15204:

Summary: Improve nullability inference for Aggregator  (was: Nullable is 
not correct for Aggregator)

> Improve nullability inference for Aggregator
> 
>
> Key: SPARK-15204
> URL: https://issues.apache.org/jira/browse/SPARK-15204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
> Environment: spark-2.0.0-SNAPSHOT
>Reporter: koert kuipers
>Priority: Minor
>
> {noformat}
> object SimpleSum extends Aggregator[Row, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Row) = b + a.getInt(1)
>   def merge(b1: Int, b2: Int) = b1 + b2
>   def finish(b: Int) = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v")
> val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1")
> df1.printSchema
> df1.show
> root
>  |-- k: string (nullable = true)
>  |-- v1: integer (nullable = true)
> +---+---+
> |  k| v1|
> +---+---+
> |  a|  6|
> +---+---+
> {noformat}
> notice how v1 has nullable set to true. the default (and expected) behavior 
> for spark sql is to give an int column false for nullable. for example if i 
> had uses a built-in aggregator like "sum" instead if would have reported 
> nullable = false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15204) Nullable is not correct for Aggregator

2016-05-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15204:

Issue Type: Improvement  (was: Bug)

> Nullable is not correct for Aggregator
> --
>
> Key: SPARK-15204
> URL: https://issues.apache.org/jira/browse/SPARK-15204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
> Environment: spark-2.0.0-SNAPSHOT
>Reporter: koert kuipers
>Priority: Minor
>
> {noformat}
> object SimpleSum extends Aggregator[Row, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Row) = b + a.getInt(1)
>   def merge(b1: Int, b2: Int) = b1 + b2
>   def finish(b: Int) = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v")
> val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1")
> df1.printSchema
> df1.show
> root
>  |-- k: string (nullable = true)
>  |-- v1: integer (nullable = true)
> +---+---+
> |  k| v1|
> +---+---+
> |  a|  6|
> +---+---+
> {noformat}
> notice how v1 has nullable set to true. the default (and expected) behavior 
> for spark sql is to give an int column false for nullable. for example if i 
> had uses a built-in aggregator like "sum" instead if would have reported 
> nullable = false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15204) Nullable is not correct for Aggregator

2016-05-09 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277592#comment-15277592
 ] 

Reynold Xin commented on SPARK-15204:
-

I'm going to change the title because it is always correct to assume 
"nullable". It is just better if we can tighten the nullability when possible. 
This is not a "bug".


> Nullable is not correct for Aggregator
> --
>
> Key: SPARK-15204
> URL: https://issues.apache.org/jira/browse/SPARK-15204
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark-2.0.0-SNAPSHOT
>Reporter: koert kuipers
>Priority: Minor
>
> {noformat}
> object SimpleSum extends Aggregator[Row, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Row) = b + a.getInt(1)
>   def merge(b1: Int, b2: Int) = b1 + b2
>   def finish(b: Int) = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v")
> val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1")
> df1.printSchema
> df1.show
> root
>  |-- k: string (nullable = true)
>  |-- v1: integer (nullable = true)
> +---+---+
> |  k| v1|
> +---+---+
> |  a|  6|
> +---+---+
> {noformat}
> notice how v1 has nullable set to true. the default (and expected) behavior 
> for spark sql is to give an int column false for nullable. for example if i 
> had uses a built-in aggregator like "sum" instead if would have reported 
> nullable = false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2016-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277583#comment-15277583
 ] 

Apache Spark commented on SPARK-4452:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/13020

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes:
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. Previously the spillable 
> objects trigger spilling by themselves. So one may not trigger spilling even 
> if another object in the same thread needs more memory. After this change The 
> ShuffleMemoryManager could trigger the spilling of an object if it needs to.
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
> change 3 and have a prototype of change 1 and 2 to evict spillable from 
> memory manager, still in progress. I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15187) Disallow Dropping Default Database

2016-05-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15187:

Assignee: Xiao Li

> Disallow Dropping Default Database
> --
>
> Key: SPARK-15187
> URL: https://issues.apache.org/jira/browse/SPARK-15187
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> We should disallow users to drop default database, like what hive metastore 
> did. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15187) Disallow Dropping Default Database

2016-05-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15187.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12962
[https://github.com/apache/spark/pull/12962]

> Disallow Dropping Default Database
> --
>
> Key: SPARK-15187
> URL: https://issues.apache.org/jira/browse/SPARK-15187
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.0.0
>
>
> We should disallow users to drop default database, like what hive metastore 
> did. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15241) support scala decimal in external row

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15241:


Assignee: Apache Spark  (was: Wenchen Fan)

> support scala decimal in external row
> -
>
> Key: SPARK-15241
> URL: https://issues.apache.org/jira/browse/SPARK-15241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15242) keep decimal precision and scale when convert external decimal to catalyst decimal

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15242:


Assignee: Wenchen Fan  (was: Apache Spark)

> keep decimal precision and scale when convert external decimal to catalyst 
> decimal
> --
>
> Key: SPARK-15242
> URL: https://issues.apache.org/jira/browse/SPARK-15242
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15241) support scala decimal in external row

2016-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277565#comment-15277565
 ] 

Apache Spark commented on SPARK-15241:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/13019

> support scala decimal in external row
> -
>
> Key: SPARK-15241
> URL: https://issues.apache.org/jira/browse/SPARK-15241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15242) keep decimal precision and scale when convert external decimal to catalyst decimal

2016-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277566#comment-15277566
 ] 

Apache Spark commented on SPARK-15242:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/13019

> keep decimal precision and scale when convert external decimal to catalyst 
> decimal
> --
>
> Key: SPARK-15242
> URL: https://issues.apache.org/jira/browse/SPARK-15242
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15241) support scala decimal in external row

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15241:


Assignee: Wenchen Fan  (was: Apache Spark)

> support scala decimal in external row
> -
>
> Key: SPARK-15241
> URL: https://issues.apache.org/jira/browse/SPARK-15241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15242) keep decimal precision and scale when convert external decimal to catalyst decimal

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15242:


Assignee: Apache Spark  (was: Wenchen Fan)

> keep decimal precision and scale when convert external decimal to catalyst 
> decimal
> --
>
> Key: SPARK-15242
> URL: https://issues.apache.org/jira/browse/SPARK-15242
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15242) keep decimal precision and scale when convert external decimal to catalyst decimal

2016-05-09 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-15242:
---

 Summary: keep decimal precision and scale when convert external 
decimal to catalyst decimal
 Key: SPARK-15242
 URL: https://issues.apache.org/jira/browse/SPARK-15242
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15241) support scala decimal in external row

2016-05-09 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-15241:
---

 Summary: support scala decimal in external row
 Key: SPARK-15241
 URL: https://issues.apache.org/jira/browse/SPARK-15241
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15240) Use buffer variables to improve buffer serialization/deserialization in TungstenAggregate

2016-05-09 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-15240:

Summary: Use buffer variables to improve buffer 
serialization/deserialization in TungstenAggregate  (was: Use buffer variables 
for update/merge expressions instead duplicate serialization/deserialization)

> Use buffer variables to improve buffer serialization/deserialization in 
> TungstenAggregate
> -
>
> Key: SPARK-15240
> URL: https://issues.apache.org/jira/browse/SPARK-15240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We do serialization/deserialization on aggregation buffer in 
> TungstenAggregate for each aggregation function. It wastes time on duplicate 
> serde for the same grouping keys. We can optimize it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15240) Use buffer variables for update/merge expressions instead duplicate serialization/deserialization

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15240:


Assignee: (was: Apache Spark)

> Use buffer variables for update/merge expressions instead duplicate 
> serialization/deserialization
> -
>
> Key: SPARK-15240
> URL: https://issues.apache.org/jira/browse/SPARK-15240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We do serialization/deserialization on aggregation buffer in 
> TungstenAggregate for each aggregation function. It wastes time on duplicate 
> serde for the same grouping keys. We can optimize it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15240) Use buffer variables for update/merge expressions instead duplicate serialization/deserialization

2016-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277536#comment-15277536
 ] 

Apache Spark commented on SPARK-15240:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/13018

> Use buffer variables for update/merge expressions instead duplicate 
> serialization/deserialization
> -
>
> Key: SPARK-15240
> URL: https://issues.apache.org/jira/browse/SPARK-15240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We do serialization/deserialization on aggregation buffer in 
> TungstenAggregate for each aggregation function. It wastes time on duplicate 
> serde for the same grouping keys. We can optimize it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15240) Use buffer variables for update/merge expressions instead duplicate serialization/deserialization

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15240:


Assignee: Apache Spark

> Use buffer variables for update/merge expressions instead duplicate 
> serialization/deserialization
> -
>
> Key: SPARK-15240
> URL: https://issues.apache.org/jira/browse/SPARK-15240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> We do serialization/deserialization on aggregation buffer in 
> TungstenAggregate for each aggregation function. It wastes time on duplicate 
> serde for the same grouping keys. We can optimize it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15240) Use buffer variables for update/merge expressions instead duplicate serialization/deserialization

2016-05-09 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-15240:
---

 Summary: Use buffer variables for update/merge expressions instead 
duplicate serialization/deserialization
 Key: SPARK-15240
 URL: https://issues.apache.org/jira/browse/SPARK-15240
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


We do serialization/deserialization on aggregation buffer in TungstenAggregate 
for each aggregation function. It wastes time on duplicate serde for the same 
grouping keys. We can optimize it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15229) Make case sensitivity setting internal

2016-05-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15229.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Make case sensitivity setting internal
> --
>
> Key: SPARK-15229
> URL: https://issues.apache.org/jira/browse/SPARK-15229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Our case sensitivity support is different from what ANSI SQL standards 
> support. Postgres' behavior is that if an identifier is quoted, then it is 
> treated as case sensitive, otherwise it is folded to lower case. We will 
> likely need to revisit this in the future and change our behavior. For now, 
> the safest change to do for Spark 2.0 is to make the case sensitive option 
> internal and discourage users from turning it on, effectively making Spark 
> always case insensitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly

2016-05-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15234.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> spark.catalog.listDatabases.show() is not formatted correctly
> -
>
> Key: SPARK-15234
> URL: https://issues.apache.org/jira/browse/SPARK-15234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> {code}
> scala> spark.catalog.listDatabases.show()
> ++---+---+
> |name|description|locationUri|
> ++---+---+
> |Database[name='de...|
> |Database[name='my...|
> |Database[name='so...|
> ++---+---+
> {code}
> It's because org.apache.spark.sql.catalog.Database is not a case class!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-05-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277473#comment-15277473
 ] 

Shivaram Venkataraman commented on SPARK-12661:
---

Yes - We will do a full round of AMI updates as well to get Python 2.7, Scala 
2.11 and Java 8 on the EC2 images. I plan to do that once we have 2.0 RC1

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15238) Clarify Python 3 support in docs

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15238:


Assignee: Apache Spark

> Clarify Python 3 support in docs
> 
>
> Key: SPARK-15238
> URL: https://issues.apache.org/jira/browse/SPARK-15238
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Assignee: Apache Spark
>Priority: Trivial
>
> The [current doc|http://spark.apache.org/docs/1.6.1/] reads:
> {quote}
> Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.1 
> uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).
> {quote}
> Projects that support Python 3 generally mention that explicitly. A casual 
> Python user might assume from this line that Spark supports Python 2.6 and 
> 2.7 but not 3+.
> More specifically, I gather from SPARK-4897 that Spark actually supports 3.4+ 
> and not earlier versions of Python 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15238) Clarify Python 3 support in docs

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15238:


Assignee: (was: Apache Spark)

> Clarify Python 3 support in docs
> 
>
> Key: SPARK-15238
> URL: https://issues.apache.org/jira/browse/SPARK-15238
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> The [current doc|http://spark.apache.org/docs/1.6.1/] reads:
> {quote}
> Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.1 
> uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).
> {quote}
> Projects that support Python 3 generally mention that explicitly. A casual 
> Python user might assume from this line that Spark supports Python 2.6 and 
> 2.7 but not 3+.
> More specifically, I gather from SPARK-4897 that Spark actually supports 3.4+ 
> and not earlier versions of Python 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15238) Clarify Python 3 support in docs

2016-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277468#comment-15277468
 ] 

Apache Spark commented on SPARK-15238:
--

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13017

> Clarify Python 3 support in docs
> 
>
> Key: SPARK-15238
> URL: https://issues.apache.org/jira/browse/SPARK-15238
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> The [current doc|http://spark.apache.org/docs/1.6.1/] reads:
> {quote}
> Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.1 
> uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).
> {quote}
> Projects that support Python 3 generally mention that explicitly. A casual 
> Python user might assume from this line that Spark supports Python 2.6 and 
> 2.7 but not 3+.
> More specifically, I gather from SPARK-4897 that Spark actually supports 3.4+ 
> and not earlier versions of Python 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15239) spark document conflict about mesos cluster

2016-05-09 Thread JasonChang (JIRA)

JasonChang created SPARK-15239:
--

 Summary: spark document conflict about mesos cluster 
 Key: SPARK-15239
 URL: https://issues.apache.org/jira/browse/SPARK-15239
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.6.1
Reporter: JasonChang


1. http://spark.apache.org/docs/latest/submitting-applications.html

if your application is submitted from a machine far from the worker machines 
(e.g. locally on your laptop), it is common to use cluster mode to minimize 
network latency between the drivers and the executors. Note that cluster mode 
is currently not supported for Mesos clusters. Currently only YARN supports 
cluster mode for Python applications.

2. http://spark.apache.org/docs/latest/running-on-mesos.html

Spark on Mesos also supports cluster mode, where the driver is launched in the 
cluster and the client can find the results of the driver from the Mesos Web UI.

I confused does mesos supports cluster mode?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15238) Clarify Python 3 support in docs

2016-05-09 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-15238:


 Summary: Clarify Python 3 support in docs
 Key: SPARK-15238
 URL: https://issues.apache.org/jira/browse/SPARK-15238
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Reporter: Nicholas Chammas
Priority: Trivial


The [current doc|http://spark.apache.org/docs/1.6.1/] reads:

{quote}
Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.1 
uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).
{quote}

Projects that support Python 3 generally mention that explicitly. A casual 
Python user might assume from this line that Spark supports Python 2.6 and 2.7 
but not 3+.

More specifically, I gather from SPARK-4897 that Spark actually supports 3.4+ 
and not earlier versions of Python 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15237) SparkR corr function documentation

2016-05-09 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277454#comment-15277454
 ] 

Sun Rui commented on SPARK-15237:
-

SparkR supports two types of corr(), something like corr(SparkDataFrame, 
"col1", "col2") and corr(df$col1, df$col2). But the documentation of corr seems 
containing only the example for the latter.
[~felixcheung] how to fix the documentation?

> SparkR corr function documentation
> --
>
> Key: SPARK-15237
> URL: https://issues.apache.org/jira/browse/SPARK-15237
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Shaul
>Priority: Minor
>  Labels: corr, sparkr
>
> Please review the documentation of the corr function in SparkR, the example 
> given: corr(df$c, df$d) won't run. The correct usage seems to be 
> corr(dataFrame,"someColumn","OtherColumn"), is this correct? 
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-05-09 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277450#comment-15277450
 ] 

Nicholas Chammas commented on SPARK-12661:
--

[~davies] / [~joshrosen] - Has this been settled on? The dev list discussion 
from January seemed to converge almost unanimously on dropping Python 2.6 
support in Spark 2.0, but I don't think an official decision was ever announced.

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-05-09 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277448#comment-15277448
 ] 

Nicholas Chammas commented on SPARK-12661:
--

[~shivaram] - Can you confirm that spark-ec2 will drop support for Python 2.6 
starting with Spark 2.0?

For the record, I don't think it will affect things much either way anymore 
since spark-ec2 is now a separate project.

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-05-09 Thread Kai Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277427#comment-15277427
 ] 

Kai Chen commented on SPARK-13634:
--

[~Rahul Palamuttam] and [~chrismattmann]

Try
{code}
@transient val newSC = sc
{code}

 in the REPL to prevent SparkContext from being dragged into the serialization 
graph.

Cheers!

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13485) (Dataset-oriented) API evolution in Spark 2.0

2016-05-09 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-13485:
-
Labels: releasenotes  (was: )

> (Dataset-oriented) API evolution in Spark 2.0
> -
>
> Key: SPARK-13485
> URL: https://issues.apache.org/jira/browse/SPARK-13485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 2.0.0
>
> Attachments: API Evolution in Spark 2.0.pdf
>
>
> As part of Spark 2.0, we want to create a stable API foundation for Dataset 
> to become the main user-facing API in Spark. This ticket tracks various tasks 
> related to that.
> The main high level changes are:
> 1. Merge Dataset/DataFrame
> 2. Create a more natural entry point for Dataset (SQLContext/HiveContext are 
> not ideal because of the name "SQL"/"Hive", and "SparkContext" is not ideal 
> because of its heavy dependency on RDDs)
> 3. First class support for sessions
> 4. First class support for some system catalog
> See the design doc for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15025) creating datasource table with option (PATH) results in duplicate path key in serdeProperties

2016-05-09 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15025:
-
Assignee: Xin Wu

> creating datasource table with option (PATH) results in duplicate path key in 
> serdeProperties
> -
>
> Key: SPARK-15025
> URL: https://issues.apache.org/jira/browse/SPARK-15025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Assignee: Xin Wu
> Fix For: 2.0.0
>
>
> Repro:
> {code}create table t1 using parquet options (PATH "/tmp/t1") as select 1 as 
> a, 2 as b{code}
> This will create a hive external table whose dataLocation is 
> "/someDefaultPath", which is not the same as the provided one.  Yet, 
> serdeInfo.parameters contain following key value pairs: 
> PATH, "/tmp/t1"
> path, "/someDefaultPath"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15025) creating datasource table with option (PATH) results in duplicate path key in serdeProperties

2016-05-09 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15025.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12804
[https://github.com/apache/spark/pull/12804]

> creating datasource table with option (PATH) results in duplicate path key in 
> serdeProperties
> -
>
> Key: SPARK-15025
> URL: https://issues.apache.org/jira/browse/SPARK-15025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
> Fix For: 2.0.0
>
>
> Repro:
> {code}create table t1 using parquet options (PATH "/tmp/t1") as select 1 as 
> a, 2 as b{code}
> This will create a hive external table whose dataLocation is 
> "/someDefaultPath", which is not the same as the provided one.  Yet, 
> serdeInfo.parameters contain following key value pairs: 
> PATH, "/tmp/t1"
> path, "/someDefaultPath"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15237) SparkR corr function documentation

2016-05-09 Thread Shaul (JIRA)

Shaul created SPARK-15237:
-

 Summary: SparkR corr function documentation
 Key: SPARK-15237
 URL: https://issues.apache.org/jira/browse/SPARK-15237
 Project: Spark
  Issue Type: Documentation
  Components: SparkR
Affects Versions: 1.6.1, 1.6.0
Reporter: Shaul
Priority: Minor


Please review the documentation of the corr function in SparkR, the example 
given: corr(df$c, df$d) won't run. The correct usage seems to be 
corr(dataFrame,"someColumn","OtherColumn"), is this correct? 
Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15236) No way to disable Hive support in REPL

2016-05-09 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15236:
--
Component/s: Spark Shell

> No way to disable Hive support in REPL
> --
>
> Key: SPARK-15236
> URL: https://issues.apache.org/jira/browse/SPARK-15236
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> If you built Spark with Hive classes, there's no switch to flip to start a 
> new `spark-shell` using the InMemoryCatalog. The only thing you can do now is 
> to rebuild Spark again. That is quite inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15019) Propagate all Spark Confs to HiveConf created in HiveClientImpl

2016-05-09 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15019:
-
Labels: release_notes releasenotes  (was: )

> Propagate all Spark Confs to HiveConf created in HiveClientImpl
> ---
>
> Key: SPARK-15019
> URL: https://issues.apache.org/jira/browse/SPARK-15019
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Right now, the HiveConf created in HiveClientImpl only takes conf set at 
> runtime or set in hive-site.xml. We should also propagate Spark confs to it. 
> So, users do not have to use hive-site.xml to set warehouse location and 
> metastore url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15019) Propagate all Spark Confs to HiveConf created in HiveClientImpl

2016-05-09 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277312#comment-15277312
 ] 

Yin Huai commented on SPARK-15019:
--

This change also drops the support of hive-site.xml. Users should set all hive 
related confs to Spark conf.

> Propagate all Spark Confs to HiveConf created in HiveClientImpl
> ---
>
> Key: SPARK-15019
> URL: https://issues.apache.org/jira/browse/SPARK-15019
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Right now, the HiveConf created in HiveClientImpl only takes conf set at 
> runtime or set in hive-site.xml. We should also propagate Spark confs to it. 
> So, users do not have to use hive-site.xml to set warehouse location and 
> metastore url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15209) Web UI's timeline visualizations fails to render if descriptions contain single quotes

2016-05-09 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-15209.

Resolution: Fixed

> Web UI's timeline visualizations fails to render if descriptions contain 
> single quotes
> --
>
> Key: SPARK-15209
> URL: https://issues.apache.org/jira/browse/SPARK-15209
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.2, 2.0.0
>
>
> If a Spark job's job description contains a single quote (') then the driver 
> UI's job event timeline will fail to render due to Javascript errors. To 
> reproduce these symptoms, run
> {code}
> sc.setJobDescription("double quote: \" ")
> sc.parallelize(1 to 10).count()
> sc.setJobDescription("single quote: ' ")
> sc.parallelize(1 to 10).count() 
> {code}
> and browse to the driver UI. This will currently result in an "Uncaught 
> SyntaxError" because the single quote is not escaped and ends up closing a 
> Javascript string literal too early.
> I think that a simple fix may be to change the relevant JS to use double 
> quotes and then to use the existing XML escaping logic to escape the string's 
> contents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15209) Web UI's timeline visualizations fails to render if descriptions contain single quotes

2016-05-09 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-15209:
---
Fix Version/s: 2.0.0
   1.6.2

> Web UI's timeline visualizations fails to render if descriptions contain 
> single quotes
> --
>
> Key: SPARK-15209
> URL: https://issues.apache.org/jira/browse/SPARK-15209
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.2, 2.0.0
>
>
> If a Spark job's job description contains a single quote (') then the driver 
> UI's job event timeline will fail to render due to Javascript errors. To 
> reproduce these symptoms, run
> {code}
> sc.setJobDescription("double quote: \" ")
> sc.parallelize(1 to 10).count()
> sc.setJobDescription("single quote: ' ")
> sc.parallelize(1 to 10).count() 
> {code}
> and browse to the driver UI. This will currently result in an "Uncaught 
> SyntaxError" because the single quote is not escaped and ends up closing a 
> Javascript string literal too early.
> I think that a simple fix may be to change the relevant JS to use double 
> quotes and then to use the existing XML escaping logic to escape the string's 
> contents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15236) No way to disable Hive support in REPL

2016-05-09 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15236:
--
Assignee: (was: Andrew Or)

> No way to disable Hive support in REPL
> --
>
> Key: SPARK-15236
> URL: https://issues.apache.org/jira/browse/SPARK-15236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> If you built Spark with Hive classes, there's no switch to flip to start a 
> new `spark-shell` using the InMemoryCatalog. The only thing you can do now is 
> to rebuild Spark again. That is quite inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15236) No way to disable Hive support in REPL

2016-05-09 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15236:
-

 Summary: No way to disable Hive support in REPL
 Key: SPARK-15236
 URL: https://issues.apache.org/jira/browse/SPARK-15236
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


If you built Spark with Hive classes, there's no switch to flip to start a new 
`spark-shell` using the InMemoryCatalog. The only thing you can do now is to 
rebuild Spark again. That is quite inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15235) Corresponding row cannot be highlighted even though cursor is on the job on Web UI's timeline

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15235:


Assignee: (was: Apache Spark)

> Corresponding row cannot be highlighted even though cursor is on the job on 
> Web UI's timeline
> -
>
> Key: SPARK-15235
> URL: https://issues.apache.org/jira/browse/SPARK-15235
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Kousuke Saruta
>Priority: Trivial
>
> To extract job descriptions and stage name, there are following regular 
> expressions in timeline-view.js
> {code}
> var jobIdText = 
> $($(baseElem).find(".application-timeline-content")[0]).text();
> var jobId = jobIdText.match("\\(Job (\\d+)\\)")[1];
> ...
> var stageIdText = $($(baseElem).find(".job-timeline-content")[0]).text();
> var stageIdAndAttempt = stageIdText.match("\\(Stage 
> (\\d+\\.\\d+)\\)")[1].split(".");
> {code}
> But if job descriptions include patterns like "(Job x)" or stage names 
> include patterns like "(Stage x.y)", the regular expressions cannot be match 
> as we expected, ending up with corresponding row cannot be highlighted even 
> though we move the cursor onto the job on Web UI's timeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15186) Add user guide for Generalized Linear Regression.

2016-05-09 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277293#comment-15277293
 ] 

Seth Hendrickson commented on SPARK-15186:
--

I have a PR for this once 
[SPARK-14979|https://issues.apache.org/jira/browse/SPARK-14979] is merged.

> Add user guide for Generalized Linear Regression.
> -
>
> Key: SPARK-15186
> URL: https://issues.apache.org/jira/browse/SPARK-15186
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> We should add a user guide for the new GLR interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15235) Corresponding row cannot be highlighted even though cursor is on the job on Web UI's timeline

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15235:


Assignee: Apache Spark

> Corresponding row cannot be highlighted even though cursor is on the job on 
> Web UI's timeline
> -
>
> Key: SPARK-15235
> URL: https://issues.apache.org/jira/browse/SPARK-15235
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Trivial
>
> To extract job descriptions and stage name, there are following regular 
> expressions in timeline-view.js
> {code}
> var jobIdText = 
> $($(baseElem).find(".application-timeline-content")[0]).text();
> var jobId = jobIdText.match("\\(Job (\\d+)\\)")[1];
> ...
> var stageIdText = $($(baseElem).find(".job-timeline-content")[0]).text();
> var stageIdAndAttempt = stageIdText.match("\\(Stage 
> (\\d+\\.\\d+)\\)")[1].split(".");
> {code}
> But if job descriptions include patterns like "(Job x)" or stage names 
> include patterns like "(Stage x.y)", the regular expressions cannot be match 
> as we expected, ending up with corresponding row cannot be highlighted even 
> though we move the cursor onto the job on Web UI's timeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15235) Corresponding row cannot be highlighted even though cursor is on the job on Web UI's timeline

2016-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277295#comment-15277295
 ] 

Apache Spark commented on SPARK-15235:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/13016

> Corresponding row cannot be highlighted even though cursor is on the job on 
> Web UI's timeline
> -
>
> Key: SPARK-15235
> URL: https://issues.apache.org/jira/browse/SPARK-15235
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Kousuke Saruta
>Priority: Trivial
>
> To extract job descriptions and stage name, there are following regular 
> expressions in timeline-view.js
> {code}
> var jobIdText = 
> $($(baseElem).find(".application-timeline-content")[0]).text();
> var jobId = jobIdText.match("\\(Job (\\d+)\\)")[1];
> ...
> var stageIdText = $($(baseElem).find(".job-timeline-content")[0]).text();
> var stageIdAndAttempt = stageIdText.match("\\(Stage 
> (\\d+\\.\\d+)\\)")[1].split(".");
> {code}
> But if job descriptions include patterns like "(Job x)" or stage names 
> include patterns like "(Stage x.y)", the regular expressions cannot be match 
> as we expected, ending up with corresponding row cannot be highlighted even 
> though we move the cursor onto the job on Web UI's timeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15235) Corresponding row cannot be highlighted even though cursor is on the job on Web UI's timeline

2016-05-09 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-15235:
--

 Summary: Corresponding row cannot be highlighted even though 
cursor is on the job on Web UI's timeline
 Key: SPARK-15235
 URL: https://issues.apache.org/jira/browse/SPARK-15235
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.6.1, 2.0.0
Reporter: Kousuke Saruta
Priority: Trivial


To extract job descriptions and stage name, there are following regular 
expressions in timeline-view.js

{code}
var jobIdText = $($(baseElem).find(".application-timeline-content")[0]).text();
var jobId = jobIdText.match("\\(Job (\\d+)\\)")[1];
...
var stageIdText = $($(baseElem).find(".job-timeline-content")[0]).text();
var stageIdAndAttempt = stageIdText.match("\\(Stage 
(\\d+\\.\\d+)\\)")[1].split(".");
{code}

But if job descriptions include patterns like "(Job x)" or stage names include 
patterns like "(Stage x.y)", the regular expressions cannot be match as we 
expected, ending up with corresponding row cannot be highlighted even though we 
move the cursor onto the job on Web UI's timeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly

2016-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277263#comment-15277263
 ] 

Apache Spark commented on SPARK-15234:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/13014

> spark.catalog.listDatabases.show() is not formatted correctly
> -
>
> Key: SPARK-15234
> URL: https://issues.apache.org/jira/browse/SPARK-15234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> scala> spark.catalog.listDatabases.show()
> ++---+---+
> |name|description|locationUri|
> ++---+---+
> |Database[name='de...|
> |Database[name='my...|
> |Database[name='so...|
> ++---+---+
> {code}
> It's because org.apache.spark.sql.catalog.Database is not a case class!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14995) Add "since" tag in Roxygen documentation for SparkR API methods

2016-05-09 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277260#comment-15277260
 ] 

Felix Cheung commented on SPARK-14995:
--

should we have Experimental for new API like in Python?

note:: Experimental


> Add "since" tag in Roxygen documentation for SparkR API methods
> ---
>
> Key: SPARK-14995
> URL: https://issues.apache.org/jira/browse/SPARK-14995
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> This is request adding something in SparkR API like "versionadded" in PySpark 
> API and "@since" in Scala/Java API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15228) pyspark.RDD.toLocalIterator Documentation

2016-05-09 Thread Sandeep Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277243#comment-15277243
 ] 

Sandeep Singh commented on SPARK-15228:
---

The Code is formatted properly at the end.

> pyspark.RDD.toLocalIterator Documentation
> -
>
> Key: SPARK-15228
> URL: https://issues.apache.org/jira/browse/SPARK-15228
> Project: Spark
>  Issue Type: Documentation
>Reporter: Ignacio Tartavull
>Priority: Trivial
>
> There is a little bug in the parsing of the documentation of 
> http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly

2016-05-09 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-15234:
-

Assignee: Andrew Or

> spark.catalog.listDatabases.show() is not formatted correctly
> -
>
> Key: SPARK-15234
> URL: https://issues.apache.org/jira/browse/SPARK-15234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> scala> spark.catalog.listDatabases.show()
> ++---+---+
> |name|description|locationUri|
> ++---+---+
> |Database[name='de...|
> |Database[name='my...|
> |Database[name='so...|
> ++---+---+
> {code}
> It's because org.apache.spark.sql.catalog.Database is not a case class!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly

2016-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277237#comment-15277237
 ] 

Apache Spark commented on SPARK-15234:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/13015

> spark.catalog.listDatabases.show() is not formatted correctly
> -
>
> Key: SPARK-15234
> URL: https://issues.apache.org/jira/browse/SPARK-15234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> {code}
> scala> spark.catalog.listDatabases.show()
> ++---+---+
> |name|description|locationUri|
> ++---+---+
> |Database[name='de...|
> |Database[name='my...|
> |Database[name='so...|
> ++---+---+
> {code}
> It's because org.apache.spark.sql.catalog.Database is not a case class!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14983) Getting CompileException when feed an UDF with an array and another paramter

2016-05-09 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277221#comment-15277221
 ] 

Shixiong Zhu commented on SPARK-14983:
--

In addition, the Scala type for `ArrayType[String]` is `Seq[String]`. You 
should use it instead of `Array[String]` as the UDF parameter type.

> Getting CompileException when feed an UDF with an array and another paramter
> 
>
> Key: SPARK-14983
> URL: https://issues.apache.org/jira/browse/SPARK-14983
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.5.0
> Environment: Spark 1.5.0 on EMR
>Reporter: Sky Yin
>
> I'm trying to apply an UDF to an array column (generated from the {{array}} 
> funciton in SparkSQL) and another string column and got this error:
> (some long source code before this...)
> {code}
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:392)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:412)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:409)
>   at 
> org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   ... 31 more
> Caused by: org.codehaus.commons.compiler.CompileException: Line 1038, Column 
> 28: Redefinition of local variable "values" 
>   at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2554)
>   at org.codehaus.janino.UnitCompiler.access$4300(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitLocalVariableDeclarationStatement(UnitCompiler.java:2434)
>   at 
> org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:2508)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2437)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2395)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2250)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:822)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:794)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:507)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:658)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:662)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:350)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1035)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:354)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:769)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:532)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:393)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:347)
>   at 
> org.codehaus.janino.Java$PackageMemberClassDeclaration.accept(Java.java:1139)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:354)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:322)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:383)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:315)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:233)
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:192)
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:84)
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:77)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:387)
>   ... 35 more
> {code}
> I tried to create another UDF with an array as the only parameter and that 
> worked well as the array got passed into the Python function as a list 
> without any problem.
> Also I tried to define an UDF with an array and a string as paramters in 
> Scala and got the same {{org.codehaus.commons.com

[jira] [Assigned] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15234:


Assignee: (was: Apache Spark)

> spark.catalog.listDatabases.show() is not formatted correctly
> -
>
> Key: SPARK-15234
> URL: https://issues.apache.org/jira/browse/SPARK-15234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> {code}
> scala> spark.catalog.listDatabases.show()
> ++---+---+
> |name|description|locationUri|
> ++---+---+
> |Database[name='de...|
> |Database[name='my...|
> |Database[name='so...|
> ++---+---+
> {code}
> It's because org.apache.spark.sql.catalog.Database is not a case class!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15234:


Assignee: Apache Spark

> spark.catalog.listDatabases.show() is not formatted correctly
> -
>
> Key: SPARK-15234
> URL: https://issues.apache.org/jira/browse/SPARK-15234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> {code}
> scala> spark.catalog.listDatabases.show()
> ++---+---+
> |name|description|locationUri|
> ++---+---+
> |Database[name='de...|
> |Database[name='my...|
> |Database[name='so...|
> ++---+---+
> {code}
> It's because org.apache.spark.sql.catalog.Database is not a case class!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly

2016-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277205#comment-15277205
 ] 

Apache Spark commented on SPARK-15234:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/13014

> spark.catalog.listDatabases.show() is not formatted correctly
> -
>
> Key: SPARK-15234
> URL: https://issues.apache.org/jira/browse/SPARK-15234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> {code}
> scala> spark.catalog.listDatabases.show()
> ++---+---+
> |name|description|locationUri|
> ++---+---+
> |Database[name='de...|
> |Database[name='my...|
> |Database[name='so...|
> ++---+---+
> {code}
> It's because org.apache.spark.sql.catalog.Database is not a case class!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14983) Getting CompileException when feed an UDF with an array and another paramter

2016-05-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-14983.
--
Resolution: Duplicate

This has been fixed in SPARK-10461

> Getting CompileException when feed an UDF with an array and another paramter
> 
>
> Key: SPARK-14983
> URL: https://issues.apache.org/jira/browse/SPARK-14983
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.5.0
> Environment: Spark 1.5.0 on EMR
>Reporter: Sky Yin
>
> I'm trying to apply an UDF to an array column (generated from the {{array}} 
> funciton in SparkSQL) and another string column and got this error:
> (some long source code before this...)
> {code}
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:392)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:412)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:409)
>   at 
> org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   ... 31 more
> Caused by: org.codehaus.commons.compiler.CompileException: Line 1038, Column 
> 28: Redefinition of local variable "values" 
>   at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2554)
>   at org.codehaus.janino.UnitCompiler.access$4300(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitLocalVariableDeclarationStatement(UnitCompiler.java:2434)
>   at 
> org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:2508)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2437)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2395)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2250)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:822)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:794)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:507)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:658)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:662)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:350)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1035)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:354)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:769)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:532)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:393)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:347)
>   at 
> org.codehaus.janino.Java$PackageMemberClassDeclaration.accept(Java.java:1139)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:354)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:322)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:383)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:315)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:233)
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:192)
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:84)
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:77)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:387)
>   ... 35 more
> {code}
> I tried to create another UDF with an array as the only parameter and that 
> worked well as the array got passed into the Python function as a list 
> without any problem.
> Also I tried to define an UDF with an array and a string as paramters in 
> Scala and got the same {{org.codehaus.commons.compiler.CompileException: Line 
> 1038, Column 28: Redefinition of local variable "values"}} error.



--
This message was sent by Atlassia

[jira] [Created] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly

2016-05-09 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15234:
-

 Summary: spark.catalog.listDatabases.show() is not formatted 
correctly
 Key: SPARK-15234
 URL: https://issues.apache.org/jira/browse/SPARK-15234
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or


{code}
scala> spark.catalog.listDatabases.show()
++---+---+
|name|description|locationUri|
++---+---+
|Database[name='de...|
|Database[name='my...|
|Database[name='so...|
++---+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly

2016-05-09 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15234:
--
Description: 
{code}
scala> spark.catalog.listDatabases.show()
++---+---+
|name|description|locationUri|
++---+---+
|Database[name='de...|
|Database[name='my...|
|Database[name='so...|
++---+---+
{code}

It's because org.apache.spark.sql.catalog.Database is not a case class!

  was:
{code}
scala> spark.catalog.listDatabases.show()
++---+---+
|name|description|locationUri|
++---+---+
|Database[name='de...|
|Database[name='my...|
|Database[name='so...|
++---+---+
{code}


> spark.catalog.listDatabases.show() is not formatted correctly
> -
>
> Key: SPARK-15234
> URL: https://issues.apache.org/jira/browse/SPARK-15234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> {code}
> scala> spark.catalog.listDatabases.show()
> ++---+---+
> |name|description|locationUri|
> ++---+---+
> |Database[name='de...|
> |Database[name='my...|
> |Database[name='so...|
> ++---+---+
> {code}
> It's because org.apache.spark.sql.catalog.Database is not a case class!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15233) Spark task metrics should include hdfs read write latency

2016-05-09 Thread Sital Kedia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-15233:

Affects Version/s: 1.6.1
 Priority: Minor  (was: Major)
  Description: Currently the Spark task metrics does not have hdfs 
read/write latency. It will be very useful to have those to find the bottleneck 
in the query.
   Issue Type: Improvement  (was: Bug)
  Summary: Spark task metrics should include hdfs read write 
latency  (was: Spark UI should show metrics for hdfs read write latency)

> Spark task metrics should include hdfs read write latency
> -
>
> Key: SPARK-15233
> URL: https://issues.apache.org/jira/browse/SPARK-15233
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Priority: Minor
>
> Currently the Spark task metrics does not have hdfs read/write latency. It 
> will be very useful to have those to find the bottleneck in the query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15233) Spark UI should show metrics for hdfs read write latency

2016-05-09 Thread Sital Kedia (JIRA)

Sital Kedia created SPARK-15233:
---

 Summary: Spark UI should show metrics for hdfs read write latency
 Key: SPARK-15233
 URL: https://issues.apache.org/jira/browse/SPARK-15233
 Project: Spark
  Issue Type: Bug
Reporter: Sital Kedia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15229) Make case sensitivity setting internal

2016-05-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15229:

Description: 
Our case sensitivity support is different from what ANSI SQL standards support. 
Postgres' behavior is that if an identifier is quoted, then it is treated as 
case sensitive, otherwise it is folded to lower case. We will likely need to 
revisit this in the future and change our behavior. For now, the safest change 
to do for Spark 2.0 is to make the case sensitive option internal and 
discourage users from turning it on, effectively making Spark always case 
insensitive.


  was:
Our case sensitivity support is different from what ANSI SQL standards support. 
Postgres' behavior is that if an identifier is quoted, then it is treated as 
case sensitive, otherwise it is folded to lower case. We will likely need to 
revisit this in the future and change our behavior. For now, the safest change 
to do for Spark 2.0 is to make the case sensitive option internal and recommend 
users from not turning it on, effectively making Spark always case insensitive.



> Make case sensitivity setting internal
> --
>
> Key: SPARK-15229
> URL: https://issues.apache.org/jira/browse/SPARK-15229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Our case sensitivity support is different from what ANSI SQL standards 
> support. Postgres' behavior is that if an identifier is quoted, then it is 
> treated as case sensitive, otherwise it is folded to lower case. We will 
> likely need to revisit this in the future and change our behavior. For now, 
> the safest change to do for Spark 2.0 is to make the case sensitive option 
> internal and discourage users from turning it on, effectively making Spark 
> always case insensitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14898) MultivariateGaussian could use Cholesky in calculateCovarianceConstants

2016-05-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14898.
---
Resolution: Not A Problem

OK someone reopen this if we've misunderstood, but I think this isn't an issue.

> MultivariateGaussian could use Cholesky in calculateCovarianceConstants
> ---
>
> Key: SPARK-14898
> URL: https://issues.apache.org/jira/browse/SPARK-14898
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.ml.stat.distribution.MultivariateGaussian, 
> calculateCovarianceConstants uses SVD.  It might be more efficient to use 
> Cholesky.  We should check other numerical libraries and see if we should 
> switch to Cholesky.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15228) pyspark.RDD.toLocalIterator Documentation

2016-05-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277073#comment-15277073
 ] 

Sean Owen commented on SPARK-15228:
---

(What's the bug?)

> pyspark.RDD.toLocalIterator Documentation
> -
>
> Key: SPARK-15228
> URL: https://issues.apache.org/jira/browse/SPARK-15228
> Project: Spark
>  Issue Type: Documentation
>Reporter: Ignacio Tartavull
>Priority: Trivial
>
> There is a little bug in the parsing of the documentation of 
> http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15231) Document the semantic of saveAsTable and insertInto and don't drop columns silently

2016-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277003#comment-15277003
 ] 

Apache Spark commented on SPARK-15231:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/13013

> Document the semantic of saveAsTable and insertInto and don't drop columns 
> silently
> ---
>
> Key: SPARK-15231
> URL: https://issues.apache.org/jira/browse/SPARK-15231
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> We should document the difference between "insertInto" and "saveAsTable": 
> "insertInto" is using the position-based column resolution but "saveAsTable 
> with append" is using name based resolution.
> In addition, for  "saveAsTable with append", when try to add more columns, we 
> just drop the new columns silently. The correct behavior should be throwing 
> an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15231) Document the semantic of saveAsTable and insertInto and don't drop columns silently

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15231:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Document the semantic of saveAsTable and insertInto and don't drop columns 
> silently
> ---
>
> Key: SPARK-15231
> URL: https://issues.apache.org/jira/browse/SPARK-15231
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> We should document the difference between "insertInto" and "saveAsTable": 
> "insertInto" is using the position-based column resolution but "saveAsTable 
> with append" is using name based resolution.
> In addition, for  "saveAsTable with append", when try to add more columns, we 
> just drop the new columns silently. The correct behavior should be throwing 
> an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15232) Add subquery SQL building tests to LogicalPlanToSQLSuite

2016-05-09 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-15232:
-

 Summary: Add subquery SQL building tests to LogicalPlanToSQLSuite
 Key: SPARK-15232
 URL: https://issues.apache.org/jira/browse/SPARK-15232
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Herman van Hovell
Assignee: Herman van Hovell
Priority: Minor


We currently test subquery SQL building using the {{HiveCompatibilitySuite}}. 
The is not desired since SQL building is actually a part of sql/core and 
because we are slowly reducing our dependency on Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15231) Document the semantic of saveAsTable and insertInto and don't drop columns silently

2016-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15231:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Document the semantic of saveAsTable and insertInto and don't drop columns 
> silently
> ---
>
> Key: SPARK-15231
> URL: https://issues.apache.org/jira/browse/SPARK-15231
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> We should document the difference between "insertInto" and "saveAsTable": 
> "insertInto" is using the position-based column resolution but "saveAsTable 
> with append" is using name based resolution.
> In addition, for  "saveAsTable with append", when try to add more columns, we 
> just drop the new columns silently. The correct behavior should be throwing 
> an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-05-09 Thread Trystan Leftwich (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trystan Leftwich updated SPARK-14318:
-
Description: 
TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
process AND all CPUs are used 100% by the executor JVMs.

It is very easy to reproduce:
1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB 
text file (assuming you know how to generate the csv data). My command is like 
this:

{noformat}
/TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
--verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
--executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > 
q14.o{noformat}

The Spark console output:
{noformat}
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 
17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 
17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200)
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 
17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 
17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200)
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 
17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes)
16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 
17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200)
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 
17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes)
16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on 
executor id: 2 hostname: bigaperf137.svl.ibm.com.
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 
17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200)
{noformat}

Notice that time durations between tasks are unusually long: 2~5 minutes.

When looking at the Linux 'perf' tool, two top CPU consumers are:
86.48%java  [unknown]   
12.41%libjvm.so

Using the Java hotspot profiling tools, I am able to show what hotspot methods 
are (top 5):
{noformat}
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 
46.845276   9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms
9,654,179 ms
org.apache.spark.unsafe.Platform.copyMemory()   18.631157   3,848,442 ms 
(18.6%)3,848,442 ms3,848,442 ms3,848,442 ms
org.apache.spark.util.collection.CompactBuffer.$plus$eq()   6.8570185   
1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 
4.6126328   955,495 ms (4.6%)   955,495 ms  2,153,910 ms
2,153,910 ms
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write()  
4.581077949,930 ms (4.6%)   949,930 ms  19,967,510 ms   
19,967,510 ms
{noformat}
So as you can see, the test has been running for 1.5 hours...with 46% CPU spent 
in the 
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. 

The stacks for top two are:
{noformat}
Marshalling 
I
java/io/DataOutputStream.writeInt() line 197
org.apache.spark.sql 
I
org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() 
line 60
org.apache.spark.storage 
I
org/apache/spark/storage/DiskBlockObjectWriter.write() line 185
org.apache.spark.shuffle 
I
org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150
org.apache.spark.scheduler   
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46
I
org/apache/spark/scheduler/Task.run() line 82
org.apache.spark.executor
I
org/apache/spark/executor/Executor$TaskRunner.run() line 231
Dispatching Overhead, Standard Library Worker Dispatching  
I
java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142
I
java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617
I
java/lang/Thread.run() line 745
{noformat}

and 

{noformat}
org.apache.spark.unsafe  
I
org/apache/spark

[jira] [Created] (SPARK-15231) Document the semantic of saveAsTable and insertInto and don't drop columns silently

2016-05-09 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-15231:


 Summary: Document the semantic of saveAsTable and insertInto and 
don't drop columns silently
 Key: SPARK-15231
 URL: https://issues.apache.org/jira/browse/SPARK-15231
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


We should document the difference between "insertInto" and "saveAsTable": 
"insertInto" is using the position-based column resolution but "saveAsTable 
with append" is using name based resolution.

In addition, for  "saveAsTable with append", when try to add more columns, we 
just drop the new columns silently. The correct behavior should be throwing an 
exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 207 matches

Mail list logo