[jira] [Commented] (SPARK-3276) Provide a API to specify whether the old files need to be ignored in file input text DStream

2015-01-20 Thread Jack Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283662#comment-14283662
 ] 

Jack Hu commented on SPARK-3276:


With some cases, the old files (older than current spark system time) are
needed: if you have a fixed list in hdfs you want to correlate to the input
stream, then you need to load it from the file system.

As the newFilesOnly options, it breaks on spark 1.2 (It works on 1.1).




 Provide a API to specify whether the old files need to be ignored in file 
 input text DStream
 

 Key: SPARK-3276
 URL: https://issues.apache.org/jira/browse/SPARK-3276
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Jack Hu
Priority: Minor

 Currently, only one API called textFileStream in StreamingContext to specify 
 the text file dstream, which ignores the old files always. On some times, the 
 old files is still useful.
 Need a API to let user choose whether the old files need to be ingored or not 
 .
 The API currently in StreamingContext:
 def textFileStream(directory: String): DStream[String] = {
 fileStream[LongWritable, Text, 
 TextInputFormat](directory).map(_._2.toString)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5334) NullPointerException when getting files from S3 (hadoop 2.3+)

2015-01-20 Thread Kevin (Sangwoo) Kim (JIRA)
Kevin (Sangwoo) Kim created SPARK-5334:
--

 Summary: NullPointerException when getting files from S3 (hadoop 
2.3+)
 Key: SPARK-5334
 URL: https://issues.apache.org/jira/browse/SPARK-5334
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: Spark 1.2 built with Hadoop 2.3+
Reporter: Kevin (Sangwoo) Kim


In Spark 1.2 built with Hadoop 2.3+, 
unable to get files from AWS S3. 
Same codes works well with same setup in Spark built with Hadoop 2.2-.
I saw that jets3t version changed in profile with Hadoop 2.3+, I guess there 
might be an issue with it.

===

scala sc.textFile(s3n://logs/log.2014-12-05.gz).count
15/01/20 11:22:40 INFO MemoryStore: ensureFreeSpace(104533) called with 
curMem=0, maxMem=27783541555
15/01/20 11:22:40 INFO MemoryStore: Block broadcast_2 stored as values in 
memory (estimated size 102.1 KB, free 25.9 GB)
java.lang.NullPointerException
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1157)
at org.apache.spark.rdd.RDD.count(RDD.scala:904)
at $iwC$$iwC$$iwC$$iwC.init(console:13)
at $iwC$$iwC$$iwC.init(console:18)
at $iwC$$iwC.init(console:20)
at $iwC.init(console:22)
at init(console:24)
at .init(console:28)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 

[jira] [Created] (SPARK-5332) Efficient way to deal with ExecutorLost

2015-01-20 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-5332:
--

 Summary: Efficient way to deal with ExecutorLost
 Key: SPARK-5332
 URL: https://issues.apache.org/jira/browse/SPARK-5332
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh


Currently, the handler of the case when an executor being lost in DAGScheduler 
(handleExecutorLost) looks not efficient. This pr tries to add a bit of extra 
information to Stage class to improve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException

2015-01-20 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5333:

Summary: [Mesos] MesosTaskLaunchData occurs BufferUnderflowException  (was: 
[Mesos]MesosTaskLaunchData occurs BufferUnderflowException)

 [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
 ---

 Key: SPARK-5333
 URL: https://issues.apache.org/jira/browse/SPARK-5333
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Jongyoul Lee
Priority: Blocker

 MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task 
 because serializedTask.remaining is 0.
 {code}
 Exception in thread Thread-6 java.nio.BufferUnderflowException
   at java.nio.Buffer.nextGetIndex(Buffer.java:498)
   at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46)
   at 
 org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81)
 {code}
 I've checked this bug with fine-grained mode. This is because 
 MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4803) Duplicate RegisterReceiver messages sent from ReceiverSupervisor

2015-01-20 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4803.
--
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

 Duplicate RegisterReceiver messages sent from ReceiverSupervisor
 

 Key: SPARK-4803
 URL: https://issues.apache.org/jira/browse/SPARK-4803
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Ilayaperumal Gopinathan
Priority: Trivial
 Fix For: 1.3.0, 1.2.1


 The ReceiverTracker receivers `RegisterReceiver` messages two times
  1) When the actor at `ReceiverSupervisorImpl`'s preStart is invoked
  2) After the receiver is started at the executor `onReceiverStart()` at 
 `ReceiverSupervisorImpl`
 Though the 'RegisterReceiver' message uses the same streamId and the 
 receiverInfo gets updated every time the message is processed at the 
 `ReceiverTracker`, it makes sense to call register receiver only after the
 receiver is started.
 or, am I missing something here?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5311) EventLoggingListener throws exception if log directory does not exist

2015-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283732#comment-14283732
 ] 

Apache Spark commented on SPARK-5311:
-

User 'ganonp' has created a pull request for this issue:
https://github.com/apache/spark/pull/4120

 EventLoggingListener throws exception if log directory does not exist
 -

 Key: SPARK-5311
 URL: https://issues.apache.org/jira/browse/SPARK-5311
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: Josh Rosen
Priority: Blocker

 If the log directory does not exist, EventLoggingListener throws an 
 IllegalArgumentException.  Here's a simple reproduction (using the master 
 branch (1.3.0)):
 {code}
 ./bin/spark-shell --conf spark.eventLog.enabled=true --conf 
 spark.eventLog.dir=/tmp/nonexistent-dir
 {code}
 where /tmp/nonexistent-dir is a directory that doesn't exist and /tmp exists. 
  This results in the following exception:
 {code}
 15/01/18 17:10:44 INFO HttpServer: Starting HTTP Server
 15/01/18 17:10:44 INFO Utils: Successfully started service 'HTTP file server' 
 on port 62729.
 15/01/18 17:10:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
 Attempting port 4041.
 15/01/18 17:10:44 INFO Utils: Successfully started service 'SparkUI' on port 
 4041.
 15/01/18 17:10:44 INFO SparkUI: Started SparkUI at 
 http://joshs-mbp.att.net:4041
 15/01/18 17:10:45 INFO Executor: Using REPL class URI: 
 http://192.168.1.248:62726
 15/01/18 17:10:45 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
 akka.tcp://sparkdri...@joshs-mbp.att.net:62728/user/HeartbeatReceiver
 15/01/18 17:10:45 INFO NettyBlockTransferService: Server created on 62730
 15/01/18 17:10:45 INFO BlockManagerMaster: Trying to register BlockManager
 15/01/18 17:10:45 INFO BlockManagerMasterActor: Registering block manager 
 localhost:62730 with 265.4 MB RAM, BlockManagerId(driver, localhost, 62730)
 15/01/18 17:10:45 INFO BlockManagerMaster: Registered BlockManager
 java.lang.IllegalArgumentException: Log directory /tmp/nonexistent-dir does 
 not exist.
   at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:90)
   at org.apache.spark.SparkContext.init(SparkContext.scala:363)
   at 
 org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986)
   at $iwC$$iwC.init(console:9)
   at $iwC.init(console:18)
   at init(console:20)
   at .init(console:24)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:123)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:270)
   at 
 org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:60)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:945)
   at 
 org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:147)
   at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:60)
   at 
 org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
   at 
 org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:60)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:962)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
   at 
 

[jira] [Commented] (SPARK-5332) Efficient way to deal with ExecutorLost

2015-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283648#comment-14283648
 ] 

Apache Spark commented on SPARK-5332:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4118

 Efficient way to deal with ExecutorLost
 ---

 Key: SPARK-5332
 URL: https://issues.apache.org/jira/browse/SPARK-5332
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh

 Currently, the handler of the case when an executor being lost in 
 DAGScheduler (handleExecutorLost) looks not efficient. This pr tries to add a 
 bit of extra information to Stage class to improve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5333) [Mesos]MesosTaskLaunchData occurs BufferUnderflowException

2015-01-20 Thread Jongyoul Lee (JIRA)
Jongyoul Lee created SPARK-5333:
---

 Summary: [Mesos]MesosTaskLaunchData occurs BufferUnderflowException
 Key: SPARK-5333
 URL: https://issues.apache.org/jira/browse/SPARK-5333
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Jongyoul Lee
Priority: Blocker


MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task 
because serializedTask.remaining is 0.

{code}
Exception in thread Thread-6 java.nio.BufferUnderflowException
at java.nio.Buffer.nextGetIndex(Buffer.java:498)
at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355)
at 
org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46)
at 
org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81)
{code}

I've checked this bug with fine-grained mode. This is because 
MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException

2015-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283669#comment-14283669
 ] 

Apache Spark commented on SPARK-5333:
-

User 'jongyoul' has created a pull request for this issue:
https://github.com/apache/spark/pull/4119

 [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
 ---

 Key: SPARK-5333
 URL: https://issues.apache.org/jira/browse/SPARK-5333
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Jongyoul Lee
Priority: Blocker

 MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task 
 because serializedTask.remaining is 0.
 {code}
 Exception in thread Thread-6 java.nio.BufferUnderflowException
   at java.nio.Buffer.nextGetIndex(Buffer.java:498)
   at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46)
   at 
 org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81)
 {code}
 I've checked this bug with fine-grained mode. This is because 
 MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException

2015-01-20 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5333:

Target Version/s: 1.3.0  (was: 1.3.0, 1.2.1)

 [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
 ---

 Key: SPARK-5333
 URL: https://issues.apache.org/jira/browse/SPARK-5333
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Jongyoul Lee
Priority: Blocker

 MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task 
 because serializedTask.remaining is 0.
 {code}
 Exception in thread Thread-6 java.nio.BufferUnderflowException
   at java.nio.Buffer.nextGetIndex(Buffer.java:498)
   at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46)
   at 
 org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81)
 {code}
 I've checked this bug with fine-grained mode. This is because 
 MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4630) Dynamically determine optimal number of partitions

2015-01-20 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283809#comment-14283809
 ] 

Lianhui Wang edited comment on SPARK-4630 at 1/20/15 1:28 PM:
--

I think it is better that we use stage's output size to decide number of next 
stage's partiton. Because RDD is one of stage's operator and number of 
partition is only related to shuffle,i think stage's  statistics is better than 
RDD.
also in SQL,there are many optimizations to choose different physical plan 
based on statistics. example: hash join or sort merge join.but this is another 
thing.
[~sandyr] what you said before depends on number of partitions in Map Writer is 
very large, so reducer can fetch data using range partition. if initial number 
of parititons is small, we need to repartition data and it is very expensive to 
scan data twice. I donot know is there a better way in this situation.
so I currently use input size of parent stage to determine on number of a 
stage's partition.



was (Author: lianhuiwang):
I think it is better that we use stage's output size to decide number of next 
stage's partiton. Because RDD is one of stage's operator and number of 
partition is only related to shuffle,i think stage's  statistics is better than 
RDD.
also in SQL,there are many optimizations to choose different physical plan 
based on statistics. example: hash join or sort merge join.but this is another 
thing.
[~sandyr] what you said before depends on number of partitions in Map Writer is 
very large, so reducer can fetch data using range partition. if initial number 
of parititons is small, we need to repartition data and it is very expensive to 
scan data twice. I donot know is there a better way in this situation.


 Dynamically determine optimal number of partitions
 --

 Key: SPARK-4630
 URL: https://issues.apache.org/jira/browse/SPARK-4630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

 Partition sizes play a big part in how fast stages execute during a Spark 
 job. There is a direct relationship between the size of partitions to the 
 number of tasks - larger partitions, fewer tasks. For better performance, 
 Spark has a sweet spot for how large partitions should be that get executed 
 by a task. If partitions are too small, then the user pays a disproportionate 
 cost in scheduling overhead. If the partitions are too large, then task 
 execution slows down due to gc pressure and spilling to disk.
 To increase performance of jobs, users often hand optimize the number(size) 
 of partitions that the next stage gets. Factors that come into play are:
 Incoming partition sizes from previous stage
 number of available executors
 available memory per executor (taking into account 
 spark.shuffle.memoryFraction)
 Spark has access to this data and so should be able to automatically do the 
 partition sizing for the user. This feature can be turned off/on with a 
 configuration option. 
 To make this happen, we propose modifying the DAGScheduler to take into 
 account partition sizes upon stage completion. Before scheduling the next 
 stage, the scheduler can examine the sizes of the partitions and determine 
 the appropriate number tasks to create. Since this change requires 
 non-trivial modifications to the DAGScheduler, a detailed design doc will be 
 attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4017) Progress bar in console

2015-01-20 Thread Paul Wolfe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283848#comment-14283848
 ] 

Paul Wolfe commented on SPARK-4017:
---

Hello,  was wondering if there is a way to turn this feature off?  Clutters log 
files in java spark applications.

 Progress bar in console
 ---

 Key: SPARK-4017
 URL: https://issues.apache.org/jira/browse/SPARK-4017
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.2.0


 It will be nice to have a progress bar in console, then we could change the 
 default logging level to WARN.
 The progress bar should be in one line, could also be in the title of 
 terminal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions

2015-01-20 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283809#comment-14283809
 ] 

Lianhui Wang commented on SPARK-4630:
-

I think it is better that we use stage's output size to decide number of next 
stage's partiton. Because RDD is one of stage's operator and number of 
partition is only related to shuffle,i think stage's  statistics is better than 
RDD.
also in SQL,there are many optimizations to choose different physical plan 
based on statistics. example: hash join or sort merge join.but this is another 
thing.
[~sandyr] what you said before depends on number of partitions in Map Writer is 
very large, so reducer can fetch data using range partition. if initial number 
of parititons is small, we need to repartition data and it is very expensive to 
scan data twice. I donot know is there a better way in this situation.


 Dynamically determine optimal number of partitions
 --

 Key: SPARK-4630
 URL: https://issues.apache.org/jira/browse/SPARK-4630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

 Partition sizes play a big part in how fast stages execute during a Spark 
 job. There is a direct relationship between the size of partitions to the 
 number of tasks - larger partitions, fewer tasks. For better performance, 
 Spark has a sweet spot for how large partitions should be that get executed 
 by a task. If partitions are too small, then the user pays a disproportionate 
 cost in scheduling overhead. If the partitions are too large, then task 
 execution slows down due to gc pressure and spilling to disk.
 To increase performance of jobs, users often hand optimize the number(size) 
 of partitions that the next stage gets. Factors that come into play are:
 Incoming partition sizes from previous stage
 number of available executors
 available memory per executor (taking into account 
 spark.shuffle.memoryFraction)
 Spark has access to this data and so should be able to automatically do the 
 partition sizing for the user. This feature can be turned off/on with a 
 configuration option. 
 To make this happen, we propose modifying the DAGScheduler to take into 
 account partition sizes upon stage completion. Before scheduling the next 
 stage, the scheduler can examine the sizes of the partitions and determine 
 the appropriate number tasks to create. Since this change requires 
 non-trivial modifications to the DAGScheduler, a detailed design doc will be 
 attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5328) Update PySpark MLlib NaiveBayes API to take model type parameter for Bernoulli fit

2015-01-20 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283887#comment-14283887
 ] 

RJ Nowling commented on SPARK-5328:
---

The Python API for Naive Bayes is located in 
python/pyspark/mllib/classification.py .  The Python implementation calls the 
Scala implementation for training through the interface in 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala .

The classes in classification.py will need to be updated (with additional pydoc 
tests), a new method will need to be added to PythonMLLibAPI.scala, and the 
Python portion of docs/mllib-naive-bayes.md will need to be updated.



 Update PySpark MLlib NaiveBayes API to take model type parameter for 
 Bernoulli fit
 --

 Key: SPARK-5328
 URL: https://issues.apache.org/jira/browse/SPARK-5328
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Leah McGuire
Priority: Minor
  Labels: mllib

 [SPARK-4894] Adds Bernoulli-variant of Naive Bayes adds Bernoulli fitting to 
 NaiveBayes.scala need to update python API to accept model type parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module

2015-01-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283941#comment-14283941
 ] 

Sean Owen commented on SPARK-4442:
--

[~matthewcornell] Normally I'd say you don't need to build any JARs yourself, 
and shouldn't bother manually managing JARs; just use Maven or SBT and write in 
the dependencies you want. But I see that Spark doesn't actually publish test 
artifacts. (Which to be fair would be unusual. But [~joshrosen] is that not the 
simplest way to expose this?). 

You can mvn package as shown on the Building Spark documentation, and you'll 
end up with a bunch of artifacts in core/target, including the test JAR file 
containing Spark's test code and thus any utility code you want from there.

 Move common unit test utilities into their own package / module
 ---

 Key: SPARK-4442
 URL: https://issues.apache.org/jira/browse/SPARK-4442
 Project: Spark
  Issue Type: Improvement
Reporter: Josh Rosen
Priority: Minor

 We should move generally-useful unit test fixtures / utility methods to their 
 own test utilities set package / module to make them easier to find / use.
 See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for 
 one example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4017) Progress bar in console

2015-01-20 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284014#comment-14284014
 ] 

Davies Liu commented on SPARK-4017:
---

It can be turned of by spark.ui.showConsoleProgress = false

BTW, what's configs for log4j? It should be turned off if logging level is INFO 
or below.

 Progress bar in console
 ---

 Key: SPARK-4017
 URL: https://issues.apache.org/jira/browse/SPARK-4017
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.2.0


 It will be nice to have a progress bar in console, then we could change the 
 default logging level to WARN.
 The progress bar should be in one line, could also be in the title of 
 terminal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module

2015-01-20 Thread Matthew Cornell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283993#comment-14283993
 ] 

Matthew Cornell commented on SPARK-4442:


[~srowen] Thanks for the tip. I tried compiled 1.2.0 using this command:
$ mvn package -DskipTests

But I could not find 'LocalSparkContext' in any jar:
$ find . -iname '*.jar' | xargs grep -i 'LocalSparkContext'

I'm recompiling without -DskipTests (it's taking a while) - would that cause 
anything to be added? Once the build is done I'll past the output. Until then - 
am I missing something that would cause the tests to be excluded?

 Move common unit test utilities into their own package / module
 ---

 Key: SPARK-4442
 URL: https://issues.apache.org/jira/browse/SPARK-4442
 Project: Spark
  Issue Type: Improvement
Reporter: Josh Rosen
Priority: Minor

 We should move generally-useful unit test fixtures / utility methods to their 
 own test utilities set package / module to make them easier to find / use.
 See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for 
 one example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5336) spark.executor.cores must not be less than spark.task.cpus

2015-01-20 Thread WangTaoTheTonic (JIRA)
WangTaoTheTonic created SPARK-5336:
--

 Summary: spark.executor.cores must not be less than spark.task.cpus
 Key: SPARK-5336
 URL: https://issues.apache.org/jira/browse/SPARK-5336
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: WangTaoTheTonic


If user set spark.executor.cores to be less than spark.task.cpus, task 
scheduler will fall in infinite loop, we should throw an exception.in that case.

In standalone and mesos mode, we should respect spark.task.cpus too, and I will 
file another JIRA to solve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module

2015-01-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284054#comment-14284054
 ] 

Sean Owen commented on SPARK-4442:
--

Hm, no works for me. Maybe {{mvn -DskipTests install}} the entire project 
first? although I wouldn't think that's necessary. Also I'm working off 
{{master}} although again should be the same thing from any release.

 Move common unit test utilities into their own package / module
 ---

 Key: SPARK-4442
 URL: https://issues.apache.org/jira/browse/SPARK-4442
 Project: Spark
  Issue Type: Improvement
Reporter: Josh Rosen
Priority: Minor

 We should move generally-useful unit test fixtures / utility methods to their 
 own test utilities set package / module to make them easier to find / use.
 See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for 
 one example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5336) spark.executor.cores must not be less than spark.task.cpus

2015-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284067#comment-14284067
 ] 

Apache Spark commented on SPARK-5336:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/4123

 spark.executor.cores must not be less than spark.task.cpus
 --

 Key: SPARK-5336
 URL: https://issues.apache.org/jira/browse/SPARK-5336
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: WangTaoTheTonic

 If user set spark.executor.cores to be less than spark.task.cpus, task 
 scheduler will fall in infinite loop, we should throw an exception.in that 
 case.
 In standalone and mesos mode, we should respect spark.task.cpus too, and I 
 will file another JIRA to solve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module

2015-01-20 Thread Matthew Cornell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284069#comment-14284069
 ] 

Matthew Cornell commented on SPARK-4442:


Thanks for sticking with me on this, Sean! I tried again from scratch with no 
luck. Maybe the downloaded sources have something crucial missing from master? 
Here's what I did:

# start with extracting 
http://apache.spinellicreations.com/spark/spark-1.2.0/spark-1.2.0.tgz
$ cd /Users/cornell/Downloads/spark-1.2.0/
$ mvn -DskipTests install
$ cd core
$ mvn jar:test-jar

- same warning:
[WARNING] JAR will be empty - no content was marked for inclusion!


 Move common unit test utilities into their own package / module
 ---

 Key: SPARK-4442
 URL: https://issues.apache.org/jira/browse/SPARK-4442
 Project: Spark
  Issue Type: Improvement
Reporter: Josh Rosen
Priority: Minor

 We should move generally-useful unit test fixtures / utility methods to their 
 own test utilities set package / module to make them easier to find / use.
 See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for 
 one example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module

2015-01-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284013#comment-14284013
 ] 

Sean Owen commented on SPARK-4442:
--

[~matthewcornell] Oops, I missed again. The test JARs aren't configured to be 
generated by the build as-is. But you can simply do this in {{core/}}:

{code}
 mvn jar:test-jar
...
 jar tf target/spark-core_2.10-1.3.0-SNAPSHOT-tests.jar | grep 
 LocalSparkContext
org/apache/spark/LocalSparkContext$.class
org/apache/spark/LocalSparkContext$class.class
org/apache/spark/LocalSparkContext.class
{code}

 Move common unit test utilities into their own package / module
 ---

 Key: SPARK-4442
 URL: https://issues.apache.org/jira/browse/SPARK-4442
 Project: Spark
  Issue Type: Improvement
Reporter: Josh Rosen
Priority: Minor

 We should move generally-useful unit test fixtures / utility methods to their 
 own test utilities set package / module to make them easier to find / use.
 See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for 
 one example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2015-01-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284012#comment-14284012
 ] 

Yin Huai commented on SPARK-2890:
-

[~btiernay] Oh, seems the comments thread of this JIRA is not quite clear on if 
this issue has been resolved. Actually, we have relaxed this restriction 
(https://github.com/apache/spark/pull/2209/files is the change). 

 Spark SQL should allow SELECT with duplicated columns
 -

 Key: SPARK-2890
 URL: https://issues.apache.org/jira/browse/SPARK-2890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang
Assignee: Michael Armbrust
 Fix For: 1.2.0


 Spark reported error java.lang.IllegalArgumentException with messages:
 java.lang.IllegalArgumentException: requirement failed: Found fields with the 
 same name.
 at scala.Predef$.require(Predef.scala:233)
 at 
 org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)
 at 
 org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
 at 
 org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)
 After trial and error, it seems it's caused by duplicated columns in my 
 select clause.
 I made the duplication on purpose for my code to parse correctly. I think we 
 should allow users to specify duplicated columns as return value.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5337) respect spark.task.cpus when launch executors

2015-01-20 Thread WangTaoTheTonic (JIRA)
WangTaoTheTonic created SPARK-5337:
--

 Summary: respect spark.task.cpus when launch executors
 Key: SPARK-5337
 URL: https://issues.apache.org/jira/browse/SPARK-5337
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: WangTaoTheTonic


In standalone mode, we did not respect spark.task.cpus when luanch executors. 
Some executors would have not enough cores to launch a single task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module

2015-01-20 Thread Matthew Cornell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284052#comment-14284052
 ] 

Matthew Cornell commented on SPARK-4442:


[~srowen] I might have misunderstood. I tried this:

$ cd dir/spark-1.2.0/core/
$ mvn jar:test-jar

But it says it created an empty jar (see output below). Any ideas re: what I'm 
doing wrong?


[INFO] Scanning for projects...
[INFO] 
[INFO] 
[INFO] Building Spark Project Core 1.2.0
[INFO] 
[INFO] 
[INFO] --- maven-jar-plugin:2.4:test-jar (default-cli) @ spark-core_2.10 ---
[WARNING] JAR will be empty - no content was marked for inclusion!
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 1.585 s
[INFO] Finished at: 2015-01-20T12:13:57-05:00
[INFO] Final Memory: 10M/81M
[INFO] 


 Move common unit test utilities into their own package / module
 ---

 Key: SPARK-4442
 URL: https://issues.apache.org/jira/browse/SPARK-4442
 Project: Spark
  Issue Type: Improvement
Reporter: Josh Rosen
Priority: Minor

 We should move generally-useful unit test fixtures / utility methods to their 
 own test utilities set package / module to make them easier to find / use.
 See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for 
 one example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2015-01-20 Thread Bob Tiernay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283958#comment-14283958
 ] 

Bob Tiernay edited comment on SPARK-2890 at 1/20/15 4:02 PM:
-

What if you request {{SELECT x.\*, y.\*}}? If there are 20 columns on each 
side, is the user required to specify them all?


was (Author: btiernay):
What if you request {{SELECT x.*, y.*}}? If there are 20 columns on each side, 
is the user required to specify them all?

 Spark SQL should allow SELECT with duplicated columns
 -

 Key: SPARK-2890
 URL: https://issues.apache.org/jira/browse/SPARK-2890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang
Assignee: Michael Armbrust
 Fix For: 1.2.0


 Spark reported error java.lang.IllegalArgumentException with messages:
 java.lang.IllegalArgumentException: requirement failed: Found fields with the 
 same name.
 at scala.Predef$.require(Predef.scala:233)
 at 
 org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)
 at 
 org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
 at 
 org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)
 After trial and error, it seems it's caused by duplicated columns in my 
 select clause.
 I made the duplication on purpose for my code to parse correctly. I think we 
 should allow users to specify duplicated columns as return value.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2015-01-20 Thread Bob Tiernay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283958#comment-14283958
 ] 

Bob Tiernay commented on SPARK-2890:


What if you request {{SELECT x.*, y.*}}? If there are 20 columns on each side, 
is the user required to specify them all?

 Spark SQL should allow SELECT with duplicated columns
 -

 Key: SPARK-2890
 URL: https://issues.apache.org/jira/browse/SPARK-2890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang
Assignee: Michael Armbrust
 Fix For: 1.2.0


 Spark reported error java.lang.IllegalArgumentException with messages:
 java.lang.IllegalArgumentException: requirement failed: Found fields with the 
 same name.
 at scala.Predef$.require(Predef.scala:233)
 at 
 org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)
 at 
 org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
 at 
 org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)
 After trial and error, it seems it's caused by duplicated columns in my 
 select clause.
 I made the duplication on purpose for my code to parse correctly. I think we 
 should allow users to specify duplicated columns as return value.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5335) Destroying cluster in VPC with --delete-groups fails to remove security groups

2015-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283936#comment-14283936
 ] 

Apache Spark commented on SPARK-5335:
-

User 'voukka' has created a pull request for this issue:
https://github.com/apache/spark/pull/4122

 Destroying cluster in VPC with --delete-groups fails to remove security 
 groups
 

 Key: SPARK-5335
 URL: https://issues.apache.org/jira/browse/SPARK-5335
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Vladimir Grigor

 When I try to remove security groups using option of the script, it fails 
 because in VPC one should remove security groups by id, not name as it is now.
 {code}
 $ ./spark-ec2 -k key20141114 -i ~/key.pem --region=eu-west-1 --delete-groups 
 destroy SparkByScript
 Are you sure you want to destroy the cluster SparkByScript?
 The following instances will be terminated:
 Searching for existing cluster SparkByScript...
 ALL DATA ON ALL NODES WILL BE LOST!!
 Destroy cluster SparkByScript (y/N): y
 Terminating master...
 Terminating slaves...
 Deleting security groups (this will take some time)...
 Waiting for cluster to enter 'terminated' state.
 Cluster is now in 'terminated' state. Waited 0 seconds.
 Attempt 1
 Deleting rules in security group SparkByScript-slaves
 Deleting rules in security group SparkByScript-master
 ERROR:boto:400 Bad Request
 ERROR:boto:?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid 
 value 'SparkByScript-slaves' for groupName. You may not reference Amazon VPC 
 security groups by name. Please use the corresponding id for this 
 operation./Message/Error/ErrorsRequestID60313fac-5d47-48dd-a8bf-e9832948c0a6/RequestID/Response
 Failed to delete security group SparkByScript-slaves
 ERROR:boto:400 Bad Request
 ERROR:boto:?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid 
 value 'SparkByScript-master' for groupName. You may not reference Amazon VPC 
 security groups by name. Please use the corresponding id for this 
 operation./Message/Error/ErrorsRequestID74ff8431-c0c1-4052-9ecb-c0adfa7eeeac/RequestID/Response
 Failed to delete security group SparkByScript-master
 Attempt 2
 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module

2015-01-20 Thread Matthew Cornell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283920#comment-14283920
 ] 

Matthew Cornell commented on SPARK-4442:


Please, as a new Spark (and Maven and SBT) user, having a jar I could simply 
drop into my IntelliJ project would be a life saver. Until then, would someone 
please sketch a little detail on how I could build the jar using the 1.2.0 
sources? Thanks!

 Move common unit test utilities into their own package / module
 ---

 Key: SPARK-4442
 URL: https://issues.apache.org/jira/browse/SPARK-4442
 Project: Spark
  Issue Type: Improvement
Reporter: Josh Rosen
Priority: Minor

 We should move generally-useful unit test fixtures / utility methods to their 
 own test utilities set package / module to make them easier to find / use.
 See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for 
 one example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-750) LocalSparkContext should be included in Spark JAR

2015-01-20 Thread Matthew Cornell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283919#comment-14283919
 ] 

Matthew Cornell commented on SPARK-750:
---

Please, as a new Spark (and Maven and SBT) user, having a jar I could simply 
drop into my IntelliJ project would be a life saver. Until then, would someone 
please sketch a little detail on how I could build the jar using the 1.2.0 
sources? Thanks!

 LocalSparkContext should be included in Spark JAR
 -

 Key: SPARK-750
 URL: https://issues.apache.org/jira/browse/SPARK-750
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Josh Rosen
Priority: Minor

 To aid third-party developers in writing unit tests with Spark, 
 LocalSparkContext should be included in the Spark JAR.  Right now, it appears 
 to be excluded because it is located in one of the Spark test directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException

2015-01-20 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5333.
---
Resolution: Fixed

 [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
 ---

 Key: SPARK-5333
 URL: https://issues.apache.org/jira/browse/SPARK-5333
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Jongyoul Lee
Assignee: Jongyoul Lee
Priority: Blocker
 Fix For: 1.3.0


 MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task 
 because serializedTask.remaining is 0.
 {code}
 Exception in thread Thread-6 java.nio.BufferUnderflowException
   at java.nio.Buffer.nextGetIndex(Buffer.java:498)
   at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46)
   at 
 org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81)
 {code}
 I've checked this bug with fine-grained mode. This is because 
 MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5287) Add defaultSizeOf to every data type

2015-01-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5287:

Summary: Add defaultSizeOf to every data type  (was: 
NativeType.defaultSizeOf should have default sizes of all NativeTypes.)

 Add defaultSizeOf to every data type
 

 Key: SPARK-5287
 URL: https://issues.apache.org/jira/browse/SPARK-5287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai

 Otherwise, we will failed to do stats estimation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5287) Add defaultSizeOf to every data type

2015-01-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5287:

Description: Right now, in NativeType, we defined some defaultSizes (it is 
actually missing some types) and for complex types, we calculate the default 
size at the place where we use the default size. We should add defaultSize to 
every data type.  (was: Otherwise, we will failed to do stats estimation. )

 Add defaultSizeOf to every data type
 

 Key: SPARK-5287
 URL: https://issues.apache.org/jira/browse/SPARK-5287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai

 Right now, in NativeType, we defined some defaultSizes (it is actually 
 missing some types) and for complex types, we calculate the default size at 
 the place where we use the default size. We should add defaultSize to every 
 data type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module

2015-01-20 Thread Matthew Cornell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284210#comment-14284210
 ] 

Matthew Cornell commented on SPARK-4442:


OK, progress: I cloned master and re-ran the commands and did end up getting 
spark-core_2.10-1.3.0-SNAPSHOT-tests.jar, which does contain LocalSparkContext. 
So I guess I've upgraded to 1.3.0 :-) Question, please: There is as second 
LocalSparkContext defined in:

graphx/src/test/scala/org/apache/spark/graphx/LocalSparkContext.scala

that did not get included in the mvn jar:test-jar command. I looked at pom.xml 
to try to figure out what that argument does, but all I found was a profile 
called 'java8-tests'. I couldn't find anywhere that mentioned:

dirspark/core/src/test/scala/org/apache/spark/LocalSparkContext.scala 

Any pointers would be appreciated!

 Move common unit test utilities into their own package / module
 ---

 Key: SPARK-4442
 URL: https://issues.apache.org/jira/browse/SPARK-4442
 Project: Spark
  Issue Type: Improvement
Reporter: Josh Rosen
Priority: Minor

 We should move generally-useful unit test fixtures / utility methods to their 
 own test utilities set package / module to make them easier to find / use.
 See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for 
 one example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException

2015-01-20 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5333:
--
Fix Version/s: 1.3.0

 [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
 ---

 Key: SPARK-5333
 URL: https://issues.apache.org/jira/browse/SPARK-5333
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Jongyoul Lee
Assignee: Jongyoul Lee
Priority: Blocker
 Fix For: 1.3.0


 MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task 
 because serializedTask.remaining is 0.
 {code}
 Exception in thread Thread-6 java.nio.BufferUnderflowException
   at java.nio.Buffer.nextGetIndex(Buffer.java:498)
   at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46)
   at 
 org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81)
 {code}
 I've checked this bug with fine-grained mode. This is because 
 MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException

2015-01-20 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284155#comment-14284155
 ] 

Josh Rosen commented on SPARK-5333:
---

Fixed by https://github.com/apache/spark/pull/4119

 [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
 ---

 Key: SPARK-5333
 URL: https://issues.apache.org/jira/browse/SPARK-5333
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Jongyoul Lee
Assignee: Jongyoul Lee
Priority: Blocker
 Fix For: 1.3.0


 MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task 
 because serializedTask.remaining is 0.
 {code}
 Exception in thread Thread-6 java.nio.BufferUnderflowException
   at java.nio.Buffer.nextGetIndex(Buffer.java:498)
   at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46)
   at 
 org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81)
 {code}
 I've checked this bug with fine-grained mode. This is because 
 MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException

2015-01-20 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5333:
--
Assignee: Jongyoul Lee

 [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
 ---

 Key: SPARK-5333
 URL: https://issues.apache.org/jira/browse/SPARK-5333
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Jongyoul Lee
Assignee: Jongyoul Lee
Priority: Blocker
 Fix For: 1.3.0


 MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task 
 because serializedTask.remaining is 0.
 {code}
 Exception in thread Thread-6 java.nio.BufferUnderflowException
   at java.nio.Buffer.nextGetIndex(Buffer.java:498)
   at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46)
   at 
 org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81)
 {code}
 I've checked this bug with fine-grained mode. This is because 
 MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module

2015-01-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284216#comment-14284216
 ] 

Sean Owen commented on SPARK-4442:
--

I don't think 1.3.0 should be different in this regard, or any release. The 
command is going to JAR up compiled test classes, so it's necessary for tests 
to be compiled first, but, install should have done that. You're referring to 
another class in the graphx module, so that won't be part of core's test code. 
java8-tests is not related. I'm not sure what you are looking for in the POM?

 Move common unit test utilities into their own package / module
 ---

 Key: SPARK-4442
 URL: https://issues.apache.org/jira/browse/SPARK-4442
 Project: Spark
  Issue Type: Improvement
Reporter: Josh Rosen
Priority: Minor

 We should move generally-useful unit test fixtures / utility methods to their 
 own test utilities set package / module to make them easier to find / use.
 See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for 
 one example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-750) LocalSparkContext should be included in Spark JAR

2015-01-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-750.
-
Resolution: Duplicate

I'm going to boldly fold this into SPARK-4442 as a more general, related 
request to expose test utilities explicitly.

 LocalSparkContext should be included in Spark JAR
 -

 Key: SPARK-750
 URL: https://issues.apache.org/jira/browse/SPARK-750
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Josh Rosen
Priority: Minor

 To aid third-party developers in writing unit tests with Spark, 
 LocalSparkContext should be included in the Spark JAR.  Right now, it appears 
 to be excluded because it is located in one of the Spark test directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException

2015-01-20 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284135#comment-14284135
 ] 

Josh Rosen commented on SPARK-5333:
---

Good catch.  I've created a link to the SPARK-4014 JIRA so that we don't forget 
to backport this patch, too, when porting that patch to earlier branches.

 [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
 ---

 Key: SPARK-5333
 URL: https://issues.apache.org/jira/browse/SPARK-5333
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Jongyoul Lee
Priority: Blocker

 MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task 
 because serializedTask.remaining is 0.
 {code}
 Exception in thread Thread-6 java.nio.BufferUnderflowException
   at java.nio.Buffer.nextGetIndex(Buffer.java:498)
   at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46)
   at 
 org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81)
 {code}
 I've checked this bug with fine-grained mode. This is because 
 MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4014) TaskContext.attemptId returns taskId

2015-01-20 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284137#comment-14284137
 ] 

Josh Rosen commented on SPARK-4014:
---

Note to self: when backporting this to any branches, also backport SPARK-4014 
(since that fixes a bug introduced here).

 TaskContext.attemptId returns taskId
 

 Key: SPARK-4014
 URL: https://issues.apache.org/jira/browse/SPARK-4014
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Yin Huai
Assignee: Josh Rosen
Priority: Minor
  Labels: backport-needed
 Fix For: 1.3.0


 In TaskRunner, we assign the taskId of a task to the attempId of the 
 corresponding TaskContext. Should we rename attemptId to taskId to avoid 
 confusion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4014) TaskContext.attemptId returns taskId

2015-01-20 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284137#comment-14284137
 ] 

Josh Rosen edited comment on SPARK-4014 at 1/20/15 6:12 PM:


Note to self: when backporting this to any branches, also backport SPARK-5333 
(since that fixes a bug introduced here).


was (Author: joshrosen):
Note to self: when backporting this to any branches, also backport SPARK-4014 
(since that fixes a bug introduced here).

 TaskContext.attemptId returns taskId
 

 Key: SPARK-4014
 URL: https://issues.apache.org/jira/browse/SPARK-4014
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Yin Huai
Assignee: Josh Rosen
Priority: Minor
  Labels: backport-needed
 Fix For: 1.3.0


 In TaskRunner, we assign the taskId of a task to the attempId of the 
 corresponding TaskContext. Should we rename attemptId to taskId to avoid 
 confusion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-01-20 Thread Manoj Samel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284335#comment-14284335
 ] 

Manoj Samel commented on SPARK-2243:


Is there a target release for this ?

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 0.7.0, 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at 

[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions

2015-01-20 Thread Kostas Sakellis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284341#comment-14284341
 ] 

Kostas Sakellis commented on SPARK-4630:


I agree that this should be build not assuming SchemaRDD. Like Dryad and Tez 
(which is basically Dryad) we should be able to use runtime statistics (as 
opposed to metastore stats) to compute the optimal partition numbers. 

I'm no Dryad expert but simply read through their papers:
1) Dryad: http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf
2) DryadLinq: 
http://research.microsoft.com/en-us/projects/DryadLINQ/DryadLINQ.pdf
3) Optimus: http://research.microsoft.com/pubs/185714/Optimus.pdf

The DryadLinq paper has a small section on dynamic partitioning. Section 4.2.2:
{quote}
Dynamic data partitioning sets the number of ver- tices in each stage (i.e., 
the number of partitions of each dataset) at run time based on the size of its 
input data. Traditional databases usually estimate dataset sizes statically, 
but these estimates can be very inaccurate, for ex- ample in the presence of 
correlated queries. DryadLINQ supports dynamic hash and range partitions—for 
range partitions both the number of partitions and the partition- ing key 
ranges are determined at run time by sampling the input dataset.
{quote}

The Optimus paper talks about more optimizations they did in their system that 
runs on top of Dryad. There are a lot of optimizations but dynamic partitioning 
is talked about ins Section 3.1. They describe creating a set of sampled 
histograms, one for each dependent partition, and then depending on the 
operation choose a combining strategy for the statistics. For example, for 
joins they do the product of the histograms. Using the stats from the 
histograms they determine how many vertices (partitions) to add to the graph 
processing. The paper they reference for creating the sampling histogram is 
http://www.mathcs.emory.edu/~cheung/papers/StreamDB/Histogram/1998-Chaudhuri-Histo.pdf
 - I haven't read it yet. They don't really get into how they bootstrap this - 
sampling the original datasources stored in the filesystem.

From what I can tell, [~lianhuiwang]'s patch assumes that all records are the 
same size since it solely looks at the map status and hadoop input sizes. I 
don't think this is good enough to make intelligent decisions as you also need 
to look at the record sizes to be able to prevent skew.

The partial DAG execution described in the Shark paper is similar to what Dryad 
does. [~rxin], why was this not pushed down to core Spark? Partial DAG 
execution could allow us to have a number of runtime optimizations that are 
currently not possible. 

 Dynamically determine optimal number of partitions
 --

 Key: SPARK-4630
 URL: https://issues.apache.org/jira/browse/SPARK-4630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

 Partition sizes play a big part in how fast stages execute during a Spark 
 job. There is a direct relationship between the size of partitions to the 
 number of tasks - larger partitions, fewer tasks. For better performance, 
 Spark has a sweet spot for how large partitions should be that get executed 
 by a task. If partitions are too small, then the user pays a disproportionate 
 cost in scheduling overhead. If the partitions are too large, then task 
 execution slows down due to gc pressure and spilling to disk.
 To increase performance of jobs, users often hand optimize the number(size) 
 of partitions that the next stage gets. Factors that come into play are:
 Incoming partition sizes from previous stage
 number of available executors
 available memory per executor (taking into account 
 spark.shuffle.memoryFraction)
 Spark has access to this data and so should be able to automatically do the 
 partition sizing for the user. This feature can be turned off/on with a 
 configuration option. 
 To make this happen, we propose modifying the DAGScheduler to take into 
 account partition sizes upon stage completion. Before scheduling the next 
 stage, the scheduler can examine the sizes of the partitions and determine 
 the appropriate number tasks to create. Since this change requires 
 non-trivial modifications to the DAGScheduler, a detailed design doc will be 
 attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5019:
-
Fix Version/s: 1.3.0

 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Travis Galoppo
Priority: Minor
 Fix For: 1.3.0


 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5019:
-
Priority: Minor  (was: Blocker)

 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor
 Fix For: 1.3.0


 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5019:
-
Assignee: Travis Galoppo

 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Travis Galoppo
Priority: Minor
 Fix For: 1.3.0


 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5019.
--
Resolution: Fixed

Fixed by https://github.com/apache/spark/pull/4088

 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Travis Galoppo
Priority: Minor
 Fix For: 1.3.0


 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size

2015-01-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5186.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3997
[https://github.com/apache/spark/pull/3997]

 Vector.equals  and Vector.hashCode are very inefficient and fail on 
 SparseVectors with large size
 -

 Key: SPARK-5186
 URL: https://issues.apache.org/jira/browse/SPARK-5186
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Derrick Burns
 Fix For: 1.3.0

   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 The implementation of Vector.equals and Vector.hashCode are correct but slow 
 for SparseVectors that are truly sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS

2015-01-20 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284908#comment-14284908
 ] 

Hari Shreedharan commented on SPARK-5342:
-

[~pwendell], [~tgraves], [~vanzin], [~andrewor14] - Please take a look.

 Allow long running Spark apps to run on secure YARN/HDFS
 

 Key: SPARK-5342
 URL: https://issues.apache.org/jira/browse/SPARK-5342
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Hari Shreedharan
 Attachments: SparkYARN.pdf


 Currently, Spark apps cannot write to HDFS after the delegation tokens reach 
 their expiry, which maxes out at 7 days. We must find a way to ensure that we 
 can run applications for longer - for example, spark streaming apps are 
 expected to run forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS

2015-01-20 Thread Hari Shreedharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Shreedharan updated SPARK-5342:

Attachment: SparkYARN.pdf

Design doc with proposed design. Original design doc with comments access: 
https://docs.google.com/document/d/1ECBZTprOEHPueXcG-w3GibpoWgLccHJwU62pNxYM5oU/edit?usp=sharing

 Allow long running Spark apps to run on secure YARN/HDFS
 

 Key: SPARK-5342
 URL: https://issues.apache.org/jira/browse/SPARK-5342
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Hari Shreedharan
 Attachments: SparkYARN.pdf


 Currently, Spark apps cannot write to HDFS after the delegation tokens reach 
 their expiry, which maxes out at 7 days. We must find a way to ensure that we 
 can run applications for longer - for example, spark streaming apps are 
 expected to run forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4923) Add Developer API to REPL to allow re-publishing the REPL jar

2015-01-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4923:
---
Summary: Add Developer API to REPL to allow re-publishing the REPL jar  
(was: Maven build should keep publishing spark-repl)

 Add Developer API to REPL to allow re-publishing the REPL jar
 -

 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical
  Labels: shell
 Attachments: 
 SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Spark-repl installation and deployment has been discontinued (see 
 SPARK-3452). But its in the dependency list of a few projects that extends 
 its initialization process.
 Please remove the 'skip' setting in spark-repl and make it an 'official' API 
 to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4923) Add Developer API to REPL to allow re-publishing the REPL jar

2015-01-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4923:
---
Assignee: Chip Senkbeil

 Add Developer API to REPL to allow re-publishing the REPL jar
 -

 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Assignee: Chip Senkbeil
Priority: Critical
  Labels: shell
 Attachments: 
 SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Spark-repl installation and deployment has been discontinued (see 
 SPARK-3452). But its in the dependency list of a few projects that extends 
 its initialization process.
 Please remove the 'skip' setting in spark-repl and make it an 'official' API 
 to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5323) Row shouldn't extend Seq

2015-01-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5323.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4115
[https://github.com/apache/spark/pull/4115]

 Row shouldn't extend Seq
 

 Key: SPARK-5323
 URL: https://issues.apache.org/jira/browse/SPARK-5323
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.3.0


 Extending Seq comes at a huge cost:
 1. Bytecode bloat (the Row constructor now has to make about 20 static calls 
 to the init method of various constructors.
 2.  Documentation bloat (added hundreds of methods most of them are 
 irrelevant).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2

2015-01-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5289:
---
Fix Version/s: 1.2.1

 Backport publishing of repl, yarn into branch-1.2
 -

 Key: SPARK-5289
 URL: https://issues.apache.org/jira/browse/SPARK-5289
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.2.1


 In SPARK-3452 we did some clean-up of published artifacts that turned out to 
 adversely affect some users. This has been mostly patched up in master via 
 SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
 modules, they were fixed in SPARK-4048 as part of a larger change that only 
 went into master.
 Those pieces should be backported to Spark 1.2 to allow publishing in a 1.2.1 
 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2015-01-20 Thread Stephen Boesch (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284470#comment-14284470
 ] 

Stephen Boesch commented on SPARK-4259:
---

Xiangrui has provided valuable feedback. His latest recommendation points out 
that the Gaussian SImilarities will result in a small proportion of the input 
vertices having non-zero (or nearly zero) value . That ratio may then represent 
the out-degree of each vertex of the graph.  The graph edges will represent the 
sparse (non-zero) matrix entries of the Normalized Affinity matrix W - so W_ij 
that have the non-zero entries.  The algorithm thus bears similarities to 
PageRank.  

We are using the Power Iteration Clustering algorithm. In each iteration of the 
PIC the components of the estimated Eigenvector - represented by vertices in 
the Graph - are updated via Graph.aggregateMessages execution. 

Further input from Xiangrui: 

The graph is sparse, we don’t need to store edges with 0 similarity. We can 
assume that the average degree is D and then the number of edges is D N, where 
N is the number of vertices. It should be much less than N^2.



 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
Assignee: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5341) Support maven coordinates in spark-shell and spark-submit

2015-01-20 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-5341:
--

 Summary: Support maven coordinates in spark-shell and spark-submit
 Key: SPARK-5341
 URL: https://issues.apache.org/jira/browse/SPARK-5341
 Project: Spark
  Issue Type: New Feature
  Components: Deploy, Spark Shell
Reporter: Burak Yavuz


This feature will allow users to provide the maven coordinates of jars they 
wish to use in their spark application. Coordinates can be a comma-delimited 
list and be supplied like:
```spark-submit --maven org.apache.example.a,org.apache.example.b```
This feature will also be added to spark-shell (where it is more critical to 
have this feature)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size

2015-01-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5186:
-
Assignee: yuhao yang

 Vector.equals  and Vector.hashCode are very inefficient and fail on 
 SparseVectors with large size
 -

 Key: SPARK-5186
 URL: https://issues.apache.org/jira/browse/SPARK-5186
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Derrick Burns
Assignee: yuhao yang
 Fix For: 1.3.0

   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 The implementation of Vector.equals and Vector.hashCode are correct but slow 
 for SparseVectors that are truly sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5287) Add defaultSizeOf to every data type

2015-01-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5287.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Add defaultSizeOf to every data type
 

 Key: SPARK-5287
 URL: https://issues.apache.org/jira/browse/SPARK-5287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.3.0


 Right now, in NativeType, we defined some defaultSizes (it is actually 
 missing some types) and for complex types, we calculate the default size at 
 the place where we use the default size. We should add defaultSize to every 
 data type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5287) Add defaultSizeOf to every data type

2015-01-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5287:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-5166

 Add defaultSizeOf to every data type
 

 Key: SPARK-5287
 URL: https://issues.apache.org/jira/browse/SPARK-5287
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.3.0


 Right now, in NativeType, we defined some defaultSizes (it is actually 
 missing some types) and for complex types, we calculate the default size at 
 the place where we use the default size. We should add defaultSize to every 
 data type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5287) Add defaultSizeOf to every data type

2015-01-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5287:
---
Assignee: Yin Huai

 Add defaultSizeOf to every data type
 

 Key: SPARK-5287
 URL: https://issues.apache.org/jira/browse/SPARK-5287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.3.0


 Right now, in NativeType, we defined some defaultSizes (it is actually 
 missing some types) and for complex types, we calculate the default size at 
 the place where we use the default size. We should add defaultSize to every 
 data type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS

2015-01-20 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284914#comment-14284914
 ] 

Hari Shreedharan commented on SPARK-5342:
-

Thanks [~adhoot] for helping with investigating the solution on the YARN side.

 Allow long running Spark apps to run on secure YARN/HDFS
 

 Key: SPARK-5342
 URL: https://issues.apache.org/jira/browse/SPARK-5342
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Hari Shreedharan
 Attachments: SparkYARN.pdf


 Currently, Spark apps cannot write to HDFS after the delegation tokens reach 
 their expiry, which maxes out at 7 days. We must find a way to ensure that we 
 can run applications for longer - for example, spark streaming apps are 
 expected to run forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5135) Add support for describe [extended] table to DDL in SQLContext

2015-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284915#comment-14284915
 ] 

Apache Spark commented on SPARK-5135:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4127

 Add support for describe [extended] table to DDL in SQLContext
 --

 Key: SPARK-5135
 URL: https://issues.apache.org/jira/browse/SPARK-5135
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.3.0
Reporter: shengli
Priority: Minor
 Fix For: 1.3.0

   Original Estimate: 72h
  Remaining Estimate: 72h

 Support Describe Table Command.
 describe [extended] tableName.
 This also support external datasource table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause

2015-01-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-4296.
---
Resolution: Fixed

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0
Reporter: Shixiong Zhu
Assignee: Cheng Lian
Priority: Blocker

 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3439) Add Canopy Clustering Algorithm

2015-01-20 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284584#comment-14284584
 ] 

Xiangrui Meng commented on SPARK-3439:
--

[~angellandros] Are you interested in contributing canopy clustering to MLlib? 
It would be nice if you can describe the proposed API first (input type, output 
type, and parameters) and the complexity.

[~yuu.ishik...@gmail.com] I've assigned this ticket to [~angellandros]. Please 
let me know if you are working on it.

 Add Canopy Clustering Algorithm
 ---

 Key: SPARK-3439
 URL: https://issues.apache.org/jira/browse/SPARK-3439
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa
Priority: Minor

 The canopy clustering algorithm is an unsupervised pre-clustering algorithm. 
 It is often used as a preprocessing step for the K-means algorithm or the 
 Hierarchical clustering algorithm. It is intended to speed up clustering 
 operations on large data sets, where using another algorithm directly may be 
 impractical due to the size of the data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3439) Add Canopy Clustering Algorithm

2015-01-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3439:
-
Assignee: Muhammad-Ali A'rabi

 Add Canopy Clustering Algorithm
 ---

 Key: SPARK-3439
 URL: https://issues.apache.org/jira/browse/SPARK-3439
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa
Assignee: Muhammad-Ali A'rabi
Priority: Minor

 The canopy clustering algorithm is an unsupervised pre-clustering algorithm. 
 It is often used as a preprocessing step for the K-means algorithm or the 
 Hierarchical clustering algorithm. It is intended to speed up clustering 
 operations on large data sets, where using another algorithm directly may be 
 impractical due to the size of the data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS

2015-01-20 Thread Hari Shreedharan (JIRA)
Hari Shreedharan created SPARK-5342:
---

 Summary: Allow long running Spark apps to run on secure YARN/HDFS
 Key: SPARK-5342
 URL: https://issues.apache.org/jira/browse/SPARK-5342
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Hari Shreedharan


Currently, Spark apps cannot write to HDFS after the delegation tokens reach 
their expiry, which maxes out at 7 days. We must find a way to ensure that we 
can run applications for longer - for example, spark streaming apps are 
expected to run forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS

2015-01-20 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284908#comment-14284908
 ] 

Hari Shreedharan edited comment on SPARK-5342 at 1/21/15 12:46 AM:
---

[~pwendell], [~tgraves], [~sandyr], [~vanzin], [~andrewor14] - Please take a 
look.


was (Author: hshreedharan):
[~pwendell], [~tgraves], [~vanzin], [~andrewor14] - Please take a look.

 Allow long running Spark apps to run on secure YARN/HDFS
 

 Key: SPARK-5342
 URL: https://issues.apache.org/jira/browse/SPARK-5342
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Hari Shreedharan
 Attachments: SparkYARN.pdf


 Currently, Spark apps cannot write to HDFS after the delegation tokens reach 
 their expiry, which maxes out at 7 days. We must find a way to ensure that we 
 can run applications for longer - for example, spark streaming apps are 
 expected to run forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5275) pyspark.streaming is not included in assembly jar

2015-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284924#comment-14284924
 ] 

Apache Spark commented on SPARK-5275:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4128

 pyspark.streaming is not included in assembly jar
 -

 Key: SPARK-5275
 URL: https://issues.apache.org/jira/browse/SPARK-5275
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Priority: Blocker

 The pyspark.streaming is not included in assembly jar of spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5294) Hide tables in AllStagePages for Active Stages, Completed Stages and Failed Stages when they are empty

2015-01-20 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5294:
--
Assignee: Kousuke Saruta

 Hide tables in AllStagePages for Active Stages, Completed Stages and Failed 
 Stages when they are empty
 

 Key: SPARK-5294
 URL: https://issues.apache.org/jira/browse/SPARK-5294
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
 Fix For: 1.3.0


 Related to SPARK-5228, AllStagesPage also should hide the table for  
 ActiveStages, CompleteStages and FailedStages when they are empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5294) Hide tables in AllStagePages for Active Stages, Completed Stages and Failed Stages when they are empty

2015-01-20 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5294.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4083
[https://github.com/apache/spark/pull/4083]

 Hide tables in AllStagePages for Active Stages, Completed Stages and Failed 
 Stages when they are empty
 

 Key: SPARK-5294
 URL: https://issues.apache.org/jira/browse/SPARK-5294
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Kousuke Saruta
 Fix For: 1.3.0


 Related to SPARK-5228, AllStagesPage also should hide the table for  
 ActiveStages, CompleteStages and FailedStages when they are empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4923) Add Developer API to REPL to allow re-publishing the REPL jar

2015-01-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4923.

  Resolution: Fixed
Target Version/s: 1.3.0  (was: 1.3.0, 1.2.1)

I updated the title of this to reflect the work that actually happened in 
Chip's patch. And SPARK-5289 is tracking publishing of the artifacts.

 Add Developer API to REPL to allow re-publishing the REPL jar
 -

 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Assignee: Chip Senkbeil
Priority: Critical
  Labels: shell
 Attachments: 
 SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Spark-repl installation and deployment has been discontinued (see 
 SPARK-3452). But its in the dependency list of a few projects that extends 
 its initialization process.
 Please remove the 'skip' setting in spark-repl and make it an 'official' API 
 to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2015-01-20 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284561#comment-14284561
 ] 

Xiangrui Meng commented on SPARK-4259:
--

Note: [~javadba]'s update is from an offline discussion we had. The algorithm 
we plan to implement is described in the paper Power Iteration Clustering (PIC) 
(http://www.icml2010.org/papers/387.pdf) and the notation is adapted from there.

 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
Assignee: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5144) spark-yarn module should be published

2015-01-20 Thread David McWhorter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284662#comment-14284662
 ] 

David McWhorter commented on SPARK-5144:


Similar problem here, building an uber-jar to submit a spark job 
problematically and get: Error: Could not load YARN classes. This copy of 
Spark may not have been compiled with YARN support.  Downgrading to a previous 
version of spark for now...

 spark-yarn module should be published
 -

 Key: SPARK-5144
 URL: https://issues.apache.org/jira/browse/SPARK-5144
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Aniket Bhatnagar

 We disabled publishing of certain modules in SPARK-3452. One of such modules 
 is spark-yarn. This breaks applications that submit spark jobs 
 programatically with master set as yarn-client. This is because SparkContext 
 is dependent on classes from yarn-client module to submit the YARN 
 application. 
 Here is the stack trace that you get if you submit the spark job without 
 yarn-client dependency:
 2015-01-07 14:39:22,799 [pool-10-thread-13] [info] o.a.s.s.MemoryStore - 
 MemoryStore started with capacity 731.7 MB
 Exception in thread pool-10-thread-13 java.lang.ExceptionInInitializerError
 at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784)
 at org.apache.spark.storage.BlockManager.init(BlockManager.scala:105)
 at org.apache.spark.storage.BlockManager.init(BlockManager.scala:180)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292)
 at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159)
 at org.apache.spark.SparkContext.init(SparkContext.scala:232)
 at com.myimpl.Server:23)
 at scala.util.Success$$anonfun$map$1.apply(Try.scala:236)
 at scala.util.Try$.apply(Try.scala:191)
 at scala.util.Success.map(Try.scala:236)
 at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23)
 at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23)
 at scala.util.Success$$anonfun$map$1.apply(Try.scala:236)
 at scala.util.Try$.apply(Try.scala:191)
 at scala.util.Success.map(Try.scala:236)
 at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
 at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.spark.SparkException: Unable to load YARN support
 at 
 org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199)
 at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194)
 at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
 ... 27 more
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:190)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:195)
 ... 29 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2015-01-20 Thread David McWhorter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284659#comment-14284659
 ] 

David McWhorter commented on SPARK-3452:


Same problem here -- if spark-yarn is not available, what is the correct way to 
submit yarn jobs programatically?

 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical
 Fix For: 1.2.0


 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-20 Thread Ameet Talwalkar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284903#comment-14284903
 ] 

Ameet Talwalkar commented on SPARK-3789:


Great.  I hope this can make it into 1.3.

On Tue, Jan 20, 2015 at 12:45 PM, Kushal Datta (JIRA) j...@apache.org



 Python bindings for GraphX
 --

 Key: SPARK-3789
 URL: https://issues.apache.org/jira/browse/SPARK-3789
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, PySpark
Reporter: Ameet Talwalkar
Assignee: Kushal Datta





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5337) respect spark.task.cpus when launch executors

2015-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284986#comment-14284986
 ] 

Apache Spark commented on SPARK-5337:
-

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/4129

 respect spark.task.cpus when launch executors
 -

 Key: SPARK-5337
 URL: https://issues.apache.org/jira/browse/SPARK-5337
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: WangTaoTheTonic

 In standalone mode, we did not respect spark.task.cpus when luanch executors. 
 Some executors would have not enough cores to launch a single task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2015-01-20 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285058#comment-14285058
 ] 

Derrick Burns commented on SPARK-2620:
--

Thanks for the info!   It would seem to me that the latter is a bug in the
Scala compiler.  Specifically, if one wanted an isInstanceOf check that
ignored the outer class, it would seem natural to encode that as:

{code}
x.isInstanceOf[a#B]
{code}



On Tue, Jan 20, 2015 at 5:21 PM, Tobias Schlatter (JIRA) j...@apache.org



 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.0.0, 1.1.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Assignee: Tobias Schlatter
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2015-01-20 Thread Tobias Schlatter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284969#comment-14284969
 ] 

Tobias Schlatter commented on SPARK-2620:
-

I am currently looking into the various issues in the REPL. This one is caused 
by the fact that the Spark REPL (unlike the Scala REPL) uses classes instead of 
objects to wrap user code. This leads to serialized case classes having 
different outer pointers and therefore do not equal.

Fun fact:

Given:
{code}
class A {
  class B
}

val a = new A
{code}

{code}
x match {
  case _: a.B = true
  case _ = false
}
{code}

and

{code}
x.isInstanceOf[a.B]
{code}

are not equivalent (former checks outer pointer, latter does not).

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.0.0, 1.1.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Assignee: Tobias Schlatter
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5343) ShortestPaths traverses backwards

2015-01-20 Thread Michael Malak (JIRA)
Michael Malak created SPARK-5343:


 Summary: ShortestPaths traverses backwards
 Key: SPARK-5343
 URL: https://issues.apache.org/jira/browse/SPARK-5343
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
Reporter: Michael Malak


GraphX ShortestPaths seems to be following edges backwards instead of forwards:

import org.apache.spark.graphx._
val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), 
sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L,

lib.ShortestPaths.run(g,Array(3)).vertices.collect
res1: Array[(org.apache.spark.graphx.VertexId, 
org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 
- 0)), (2,Map()))

lib.ShortestPaths.run(g,Array(1)).vertices.collect

res2: Array[(org.apache.spark.graphx.VertexId, 
org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), 
(3,Map(1 - 2)), (2,Map(1 - 1)))

The following changes may be what will make it run forward:

Change one occurrence of src to dst in
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64

Change three occurrences of dst to src in
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS

2015-01-20 Thread Hari Shreedharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Shreedharan updated SPARK-5342:

Attachment: SparkYARN.pdf

Minor updates.

 Allow long running Spark apps to run on secure YARN/HDFS
 

 Key: SPARK-5342
 URL: https://issues.apache.org/jira/browse/SPARK-5342
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Hari Shreedharan
 Attachments: SparkYARN.pdf


 Currently, Spark apps cannot write to HDFS after the delegation tokens reach 
 their expiry, which maxes out at 7 days. We must find a way to ensure that we 
 can run applications for longer - for example, spark streaming apps are 
 expected to run forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS

2015-01-20 Thread Hari Shreedharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Shreedharan updated SPARK-5342:

Attachment: (was: SparkYARN.pdf)

 Allow long running Spark apps to run on secure YARN/HDFS
 

 Key: SPARK-5342
 URL: https://issues.apache.org/jira/browse/SPARK-5342
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Hari Shreedharan
 Attachments: SparkYARN.pdf


 Currently, Spark apps cannot write to HDFS after the delegation tokens reach 
 their expiry, which maxes out at 7 days. We must find a way to ensure that we 
 can run applications for longer - for example, spark streaming apps are 
 expected to run forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5331) Spark workers can't find tachyon master as spark-ec2 doesn't set spark.tachyonStore.url

2015-01-20 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-5331:
---
Component/s: EC2
Description: 
ps -ef | grep Tachyon 
shows Tachyon running on the master (and the slave) node with correct setting:
-Dtachyon.master.hostname=ec2-54-252-156-187.ap-southeast-2.compute.amazonaws.com

However from stderr log on worker running the SparkTachyonPi example:

15/01/20 06:00:56 INFO CacheManager: Partition rdd_0_0 not found, computing it
15/01/20 06:00:56 INFO : Trying to connect master @ localhost/127.0.0.1:19998
15/01/20 06:00:56 ERROR : Failed to connect (1) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:57 ERROR : Failed to connect (2) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:58 ERROR : Failed to connect (3) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:59 ERROR : Failed to connect (4) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:01:00 ERROR : Failed to connect (5) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:01:01 WARN TachyonBlockManager: Attempt 1 to create tachyon dir 
null failed
java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 
after 5 attempts
at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011)
at tachyon.client.TachyonFS.exist(TachyonFS.java:633)
at 
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117)
at 
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106)
at 
org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57)
at 
org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:94)
at 
org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:88)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:773)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master 
localhost/127.0.0.1:19998 after 5 attempts
at tachyon.master.MasterClient.connect(MasterClient.java:178)
at tachyon.client.TachyonFS.connect(TachyonFS.java:290)
... 28 more
Caused by: tachyon.org.apache.thrift.transport.TTransportException: 
java.net.ConnectException: Connection refused
at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at 
tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
at tachyon.master.MasterClient.connect(MasterClient.java:156)
... 29 more
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   

[jira] [Commented] (SPARK-5262) coalesce should allow NullType and 1 another type in parameters

2015-01-20 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285179#comment-14285179
 ] 

Adrian Wang commented on SPARK-5262:


Currently if you try coalesce in hivecontext, it will use hive udf instead of 
scala build-in method.

 coalesce should allow NullType and 1 another type in parameters
 ---

 Key: SPARK-5262
 URL: https://issues.apache.org/jira/browse/SPARK-5262
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang

 Currently Coalesce(null, 1, null) would throw exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection

2015-01-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-4959:

Labels: backport-needed  (was: )

 Attributes are case sensitive when using a select query from a projection
 -

 Key: SPARK-4959
 URL: https://issues.apache.org/jira/browse/SPARK-4959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andy Konwinski
Priority: Critical
  Labels: backport-needed

 Per [~marmbrus], see this line of code, where we should be using an attribute 
 map
  
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147
 To reproduce, i ran the following in the Spark shell:
 {code}
 import sqlContext._
 sql(drop table if exists test)
 sql(create table test (col1 string))
 sql(insert into table test select hi from prejoined limit 1)
 val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: 
 col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil
 sqlContext.table(test).select(projection:_*).registerTempTable(test2)
 # This succeeds.
 sql(select CaseSensitiveColName from test2).first()
 # This fails with java.util.NoSuchElementException: key not found: 
 casesensitivecolname#23046
 sql(select casesensitivecolname from test2).first()
 {code}
 The full stack trace printed for the final command that is failing: 
 {code}
 java.util.NoSuchElementException: key not found: casesensitivecolname#23046
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
   at 
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5257) SparseVector indices must be non-negative

2015-01-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5257.
--
Resolution: Won't Fix

[~MechCoder] I will resolve as WontFix. With ~1100 open JIRAs unfortunately I 
don't think you can assume that JIRAs has been reviewed by someone with 
authority to commit. Almost all of them are merely submitted. If in doubt, ask 
for comments first before beginning work.

 SparseVector indices must be non-negative
 -

 Key: SPARK-5257
 URL: https://issues.apache.org/jira/browse/SPARK-5257
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Derrick Burns
Priority: Minor
   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 The description of SparseVector suggests only that the indices have to be 
 distinct integers.  However the code for the constructor that takes an array 
 of (index, value) tuples assumes that the indices are non-negative.
 Either the code must be changed or the description should be changed.  
 This arose when I generated indices via hashing and converting the hash 
 values to (signed) integers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection

2015-01-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4959:
---
Priority: Blocker  (was: Critical)

 Attributes are case sensitive when using a select query from a projection
 -

 Key: SPARK-4959
 URL: https://issues.apache.org/jira/browse/SPARK-4959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andy Konwinski
Priority: Blocker
  Labels: backport-needed

 Per [~marmbrus], see this line of code, where we should be using an attribute 
 map
  
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147
 To reproduce, i ran the following in the Spark shell:
 {code}
 import sqlContext._
 sql(drop table if exists test)
 sql(create table test (col1 string))
 sql(insert into table test select hi from prejoined limit 1)
 val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: 
 col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil
 sqlContext.table(test).select(projection:_*).registerTempTable(test2)
 # This succeeds.
 sql(select CaseSensitiveColName from test2).first()
 # This fails with java.util.NoSuchElementException: key not found: 
 casesensitivecolname#23046
 sql(select casesensitivecolname from test2).first()
 {code}
 The full stack trace printed for the final command that is failing: 
 {code}
 java.util.NoSuchElementException: key not found: casesensitivecolname#23046
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
   at 
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection

2015-01-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4959:
---
Assignee: Cheng Hao

 Attributes are case sensitive when using a select query from a projection
 -

 Key: SPARK-4959
 URL: https://issues.apache.org/jira/browse/SPARK-4959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andy Konwinski
Assignee: Cheng Hao
Priority: Blocker
  Labels: backport-needed
 Fix For: 1.3.0


 Per [~marmbrus], see this line of code, where we should be using an attribute 
 map
  
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147
 To reproduce, i ran the following in the Spark shell:
 {code}
 import sqlContext._
 sql(drop table if exists test)
 sql(create table test (col1 string))
 sql(insert into table test select hi from prejoined limit 1)
 val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: 
 col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil
 sqlContext.table(test).select(projection:_*).registerTempTable(test2)
 # This succeeds.
 sql(select CaseSensitiveColName from test2).first()
 # This fails with java.util.NoSuchElementException: key not found: 
 casesensitivecolname#23046
 sql(select casesensitivecolname from test2).first()
 {code}
 The full stack trace printed for the final command that is failing: 
 {code}
 java.util.NoSuchElementException: key not found: casesensitivecolname#23046
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
   at 
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection

2015-01-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4959:
---
Fix Version/s: 1.3.0

 Attributes are case sensitive when using a select query from a projection
 -

 Key: SPARK-4959
 URL: https://issues.apache.org/jira/browse/SPARK-4959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andy Konwinski
Priority: Blocker
  Labels: backport-needed
 Fix For: 1.3.0


 Per [~marmbrus], see this line of code, where we should be using an attribute 
 map
  
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147
 To reproduce, i ran the following in the Spark shell:
 {code}
 import sqlContext._
 sql(drop table if exists test)
 sql(create table test (col1 string))
 sql(insert into table test select hi from prejoined limit 1)
 val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: 
 col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil
 sqlContext.table(test).select(projection:_*).registerTempTable(test2)
 # This succeeds.
 sql(select CaseSensitiveColName from test2).first()
 # This fails with java.util.NoSuchElementException: key not found: 
 casesensitivecolname#23046
 sql(select casesensitivecolname from test2).first()
 {code}
 The full stack trace printed for the final command that is failing: 
 {code}
 java.util.NoSuchElementException: key not found: casesensitivecolname#23046
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
   at 
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection

2015-01-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4959:
---
Fix Version/s: (was: 1.2.1)

 Attributes are case sensitive when using a select query from a projection
 -

 Key: SPARK-4959
 URL: https://issues.apache.org/jira/browse/SPARK-4959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andy Konwinski
Assignee: Cheng Hao
Priority: Blocker
  Labels: backport-needed
 Fix For: 1.3.0


 Per [~marmbrus], see this line of code, where we should be using an attribute 
 map
  
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147
 To reproduce, i ran the following in the Spark shell:
 {code}
 import sqlContext._
 sql(drop table if exists test)
 sql(create table test (col1 string))
 sql(insert into table test select hi from prejoined limit 1)
 val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: 
 col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil
 sqlContext.table(test).select(projection:_*).registerTempTable(test2)
 # This succeeds.
 sql(select CaseSensitiveColName from test2).first()
 # This fails with java.util.NoSuchElementException: key not found: 
 casesensitivecolname#23046
 sql(select casesensitivecolname from test2).first()
 {code}
 The full stack trace printed for the final command that is failing: 
 {code}
 java.util.NoSuchElementException: key not found: casesensitivecolname#23046
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
   at 
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4959) Attributes are case sensitive when using a select query from a projection

2015-01-20 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285262#comment-14285262
 ] 

Patrick Wendell commented on SPARK-4959:


Excuse my last comment, it was on the wrong JIRA.

 Attributes are case sensitive when using a select query from a projection
 -

 Key: SPARK-4959
 URL: https://issues.apache.org/jira/browse/SPARK-4959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andy Konwinski
Assignee: Cheng Hao
Priority: Blocker
  Labels: backport-needed
 Fix For: 1.3.0


 Per [~marmbrus], see this line of code, where we should be using an attribute 
 map
  
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147
 To reproduce, i ran the following in the Spark shell:
 {code}
 import sqlContext._
 sql(drop table if exists test)
 sql(create table test (col1 string))
 sql(insert into table test select hi from prejoined limit 1)
 val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: 
 col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil
 sqlContext.table(test).select(projection:_*).registerTempTable(test2)
 # This succeeds.
 sql(select CaseSensitiveColName from test2).first()
 # This fails with java.util.NoSuchElementException: key not found: 
 casesensitivecolname#23046
 sql(select casesensitivecolname from test2).first()
 {code}
 The full stack trace printed for the final command that is failing: 
 {code}
 java.util.NoSuchElementException: key not found: casesensitivecolname#23046
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
   at 
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-5344) HistoryServer cannot recognize that inprogress file was renamed to completed file

2015-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285261#comment-14285261
 ] 

Apache Spark commented on SPARK-5344:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/4132

 HistoryServer cannot recognize that inprogress file was renamed to completed 
 file
 -

 Key: SPARK-5344
 URL: https://issues.apache.org/jira/browse/SPARK-5344
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Kousuke Saruta

 FsHistoryProvider tries to update application status but if checkForLogs is 
 called before .inprogress file is renamed to completed file, the file is not 
 recognized as completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4959) Attributes are case sensitive when using a select query from a projection

2015-01-20 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285258#comment-14285258
 ] 

Patrick Wendell edited comment on SPARK-4959 at 1/21/15 6:47 AM:
-

Note that in the 1.2 branch this was fixed by 
https://github.com/apache/spark/pull/3987 (per discussion with [~lian cheng]).


was (Author: pwendell):
Note that in the 1.2 branch this was fixed by 
https://github.com/apache/spark/pull/3987 (per discussion with @cheng lian).

 Attributes are case sensitive when using a select query from a projection
 -

 Key: SPARK-4959
 URL: https://issues.apache.org/jira/browse/SPARK-4959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andy Konwinski
Assignee: Cheng Hao
Priority: Blocker
  Labels: backport-needed
 Fix For: 1.3.0, 1.2.1


 Per [~marmbrus], see this line of code, where we should be using an attribute 
 map
  
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147
 To reproduce, i ran the following in the Spark shell:
 {code}
 import sqlContext._
 sql(drop table if exists test)
 sql(create table test (col1 string))
 sql(insert into table test select hi from prejoined limit 1)
 val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: 
 col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil
 sqlContext.table(test).select(projection:_*).registerTempTable(test2)
 # This succeeds.
 sql(select CaseSensitiveColName from test2).first()
 # This fails with java.util.NoSuchElementException: key not found: 
 casesensitivecolname#23046
 sql(select casesensitivecolname from test2).first()
 {code}
 The full stack trace printed for the final command that is failing: 
 {code}
 java.util.NoSuchElementException: key not found: casesensitivecolname#23046
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
   at 
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446)
   at 

[jira] [Updated] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-01-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5021:
-
Assignee: Manoj Kumar

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5276) pyspark.streaming is not included in assembly jar

2015-01-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5276.

Resolution: Duplicate

 pyspark.streaming is not included in assembly jar
 -

 Key: SPARK-5276
 URL: https://issues.apache.org/jira/browse/SPARK-5276
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Priority: Blocker

 The pyspark.streaming is not included in assembly jar of spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5262) coalesce should allow NullType and 1 another type in parameters

2015-01-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285187#comment-14285187
 ] 

Yin Huai commented on SPARK-5262:
-

OK i see. In HiveContext, we are still using Hive's UDF. Actually, it will be 
good to do the work of this JIRA and SPARK-5244 together. 

 coalesce should allow NullType and 1 another type in parameters
 ---

 Key: SPARK-5262
 URL: https://issues.apache.org/jira/browse/SPARK-5262
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang

 Currently Coalesce(null, 1, null) would throw exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4959) Attributes are case sensitive when using a select query from a projection

2015-01-20 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285258#comment-14285258
 ] 

Patrick Wendell commented on SPARK-4959:


Note that in the 1.2 branch this was fixed by 
https://github.com/apache/spark/pull/3987 (per discussion with @cheng lian).

 Attributes are case sensitive when using a select query from a projection
 -

 Key: SPARK-4959
 URL: https://issues.apache.org/jira/browse/SPARK-4959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andy Konwinski
Assignee: Cheng Hao
Priority: Blocker
  Labels: backport-needed
 Fix For: 1.3.0, 1.2.1


 Per [~marmbrus], see this line of code, where we should be using an attribute 
 map
  
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147
 To reproduce, i ran the following in the Spark shell:
 {code}
 import sqlContext._
 sql(drop table if exists test)
 sql(create table test (col1 string))
 sql(insert into table test select hi from prejoined limit 1)
 val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: 
 col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil
 sqlContext.table(test).select(projection:_*).registerTempTable(test2)
 # This succeeds.
 sql(select CaseSensitiveColName from test2).first()
 # This fails with java.util.NoSuchElementException: key not found: 
 casesensitivecolname#23046
 sql(select casesensitivecolname from test2).first()
 {code}
 The full stack trace printed for the final command that is failing: 
 {code}
 java.util.NoSuchElementException: key not found: casesensitivecolname#23046
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
   at 
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-5344) HistoryServer cannot recognize that inprogress file was renamed to completed file

2015-01-20 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-5344:
--
Description: FsHistoryProvider tries to update application status but if 
checkForLogs is called before .inprogress file is renamed to completed file, 
the file is not recognized as completed.  (was: FsHistoryProvider tries to 
updates application status but if checkForLogs is called before .inprogress 
file is renamed to completed file, the file is not recognized as completed.)

 HistoryServer cannot recognize that inprogress file was renamed to completed 
 file
 -

 Key: SPARK-5344
 URL: https://issues.apache.org/jira/browse/SPARK-5344
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Kousuke Saruta

 FsHistoryProvider tries to update application status but if checkForLogs is 
 called before .inprogress file is renamed to completed file, the file is not 
 recognized as completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection

2015-01-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4959:
---
Fix Version/s: 1.2.1

 Attributes are case sensitive when using a select query from a projection
 -

 Key: SPARK-4959
 URL: https://issues.apache.org/jira/browse/SPARK-4959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andy Konwinski
Assignee: Cheng Hao
Priority: Blocker
  Labels: backport-needed
 Fix For: 1.3.0, 1.2.1


 Per [~marmbrus], see this line of code, where we should be using an attribute 
 map
  
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147
 To reproduce, i ran the following in the Spark shell:
 {code}
 import sqlContext._
 sql(drop table if exists test)
 sql(create table test (col1 string))
 sql(insert into table test select hi from prejoined limit 1)
 val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: 
 col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil
 sqlContext.table(test).select(projection:_*).registerTempTable(test2)
 # This succeeds.
 sql(select CaseSensitiveColName from test2).first()
 # This fails with java.util.NoSuchElementException: key not found: 
 casesensitivecolname#23046
 sql(select casesensitivecolname from test2).first()
 {code}
 The full stack trace printed for the final command that is failing: 
 {code}
 java.util.NoSuchElementException: key not found: casesensitivecolname#23046
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
   at 
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5344) HistoryServer cannot recognize that inprogress file was renamed to completed file

2015-01-20 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-5344:
--
Description: FsHistoryProvider tries to updates application status but if 
checkForLogs is called before .inprogress file is renamed to completed file, 
the file is not recognized as completed.  (was: FsHistoryProvider, tries to 
updates application status but if checkForLogs is called before .inprogress 
file is renamed to completed file, the file is not recognized as completed.)

 HistoryServer cannot recognize that inprogress file was renamed to completed 
 file
 -

 Key: SPARK-5344
 URL: https://issues.apache.org/jira/browse/SPARK-5344
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Kousuke Saruta

 FsHistoryProvider tries to updates application status but if checkForLogs is 
 called before .inprogress file is renamed to completed file, the file is not 
 recognized as completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >