[jira] [Commented] (SPARK-3276) Provide a API to specify whether the old files need to be ignored in file input text DStream
[ https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283662#comment-14283662 ] Jack Hu commented on SPARK-3276: With some cases, the old files (older than current spark system time) are needed: if you have a fixed list in hdfs you want to correlate to the input stream, then you need to load it from the file system. As the newFilesOnly options, it breaks on spark 1.2 (It works on 1.1). Provide a API to specify whether the old files need to be ignored in file input text DStream Key: SPARK-3276 URL: https://issues.apache.org/jira/browse/SPARK-3276 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Jack Hu Priority: Minor Currently, only one API called textFileStream in StreamingContext to specify the text file dstream, which ignores the old files always. On some times, the old files is still useful. Need a API to let user choose whether the old files need to be ingored or not . The API currently in StreamingContext: def textFileStream(directory: String): DStream[String] = { fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5334) NullPointerException when getting files from S3 (hadoop 2.3+)
Kevin (Sangwoo) Kim created SPARK-5334: -- Summary: NullPointerException when getting files from S3 (hadoop 2.3+) Key: SPARK-5334 URL: https://issues.apache.org/jira/browse/SPARK-5334 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: Spark 1.2 built with Hadoop 2.3+ Reporter: Kevin (Sangwoo) Kim In Spark 1.2 built with Hadoop 2.3+, unable to get files from AWS S3. Same codes works well with same setup in Spark built with Hadoop 2.2-. I saw that jets3t version changed in profile with Hadoop 2.3+, I guess there might be an issue with it. === scala sc.textFile(s3n://logs/log.2014-12-05.gz).count 15/01/20 11:22:40 INFO MemoryStore: ensureFreeSpace(104533) called with curMem=0, maxMem=27783541555 15/01/20 11:22:40 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 102.1 KB, free 25.9 GB) java.lang.NullPointerException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433) at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57) at org.apache.hadoop.fs.Globber.glob(Globber.java:248) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1157) at org.apache.spark.rdd.RDD.count(RDD.scala:904) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Created] (SPARK-5332) Efficient way to deal with ExecutorLost
Liang-Chi Hsieh created SPARK-5332: -- Summary: Efficient way to deal with ExecutorLost Key: SPARK-5332 URL: https://issues.apache.org/jira/browse/SPARK-5332 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Currently, the handler of the case when an executor being lost in DAGScheduler (handleExecutorLost) looks not efficient. This pr tries to add a bit of extra information to Stage class to improve that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
[ https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5333: Summary: [Mesos] MesosTaskLaunchData occurs BufferUnderflowException (was: [Mesos]MesosTaskLaunchData occurs BufferUnderflowException) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException --- Key: SPARK-5333 URL: https://issues.apache.org/jira/browse/SPARK-5333 Project: Spark Issue Type: Bug Components: Mesos Reporter: Jongyoul Lee Priority: Blocker MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task because serializedTask.remaining is 0. {code} Exception in thread Thread-6 java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:498) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355) at org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46) at org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81) {code} I've checked this bug with fine-grained mode. This is because MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4803) Duplicate RegisterReceiver messages sent from ReceiverSupervisor
[ https://issues.apache.org/jira/browse/SPARK-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-4803. -- Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Duplicate RegisterReceiver messages sent from ReceiverSupervisor Key: SPARK-4803 URL: https://issues.apache.org/jira/browse/SPARK-4803 Project: Spark Issue Type: Bug Components: Streaming Reporter: Ilayaperumal Gopinathan Priority: Trivial Fix For: 1.3.0, 1.2.1 The ReceiverTracker receivers `RegisterReceiver` messages two times 1) When the actor at `ReceiverSupervisorImpl`'s preStart is invoked 2) After the receiver is started at the executor `onReceiverStart()` at `ReceiverSupervisorImpl` Though the 'RegisterReceiver' message uses the same streamId and the receiverInfo gets updated every time the message is processed at the `ReceiverTracker`, it makes sense to call register receiver only after the receiver is started. or, am I missing something here? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5311) EventLoggingListener throws exception if log directory does not exist
[ https://issues.apache.org/jira/browse/SPARK-5311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283732#comment-14283732 ] Apache Spark commented on SPARK-5311: - User 'ganonp' has created a pull request for this issue: https://github.com/apache/spark/pull/4120 EventLoggingListener throws exception if log directory does not exist - Key: SPARK-5311 URL: https://issues.apache.org/jira/browse/SPARK-5311 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Reporter: Josh Rosen Priority: Blocker If the log directory does not exist, EventLoggingListener throws an IllegalArgumentException. Here's a simple reproduction (using the master branch (1.3.0)): {code} ./bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/tmp/nonexistent-dir {code} where /tmp/nonexistent-dir is a directory that doesn't exist and /tmp exists. This results in the following exception: {code} 15/01/18 17:10:44 INFO HttpServer: Starting HTTP Server 15/01/18 17:10:44 INFO Utils: Successfully started service 'HTTP file server' on port 62729. 15/01/18 17:10:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 15/01/18 17:10:44 INFO Utils: Successfully started service 'SparkUI' on port 4041. 15/01/18 17:10:44 INFO SparkUI: Started SparkUI at http://joshs-mbp.att.net:4041 15/01/18 17:10:45 INFO Executor: Using REPL class URI: http://192.168.1.248:62726 15/01/18 17:10:45 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkdri...@joshs-mbp.att.net:62728/user/HeartbeatReceiver 15/01/18 17:10:45 INFO NettyBlockTransferService: Server created on 62730 15/01/18 17:10:45 INFO BlockManagerMaster: Trying to register BlockManager 15/01/18 17:10:45 INFO BlockManagerMasterActor: Registering block manager localhost:62730 with 265.4 MB RAM, BlockManagerId(driver, localhost, 62730) 15/01/18 17:10:45 INFO BlockManagerMaster: Registered BlockManager java.lang.IllegalArgumentException: Log directory /tmp/nonexistent-dir does not exist. at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:90) at org.apache.spark.SparkContext.init(SparkContext.scala:363) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:123) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:270) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:60) at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:147) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:60) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:60) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:962) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) at
[jira] [Commented] (SPARK-5332) Efficient way to deal with ExecutorLost
[ https://issues.apache.org/jira/browse/SPARK-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283648#comment-14283648 ] Apache Spark commented on SPARK-5332: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4118 Efficient way to deal with ExecutorLost --- Key: SPARK-5332 URL: https://issues.apache.org/jira/browse/SPARK-5332 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Currently, the handler of the case when an executor being lost in DAGScheduler (handleExecutorLost) looks not efficient. This pr tries to add a bit of extra information to Stage class to improve that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5333) [Mesos]MesosTaskLaunchData occurs BufferUnderflowException
Jongyoul Lee created SPARK-5333: --- Summary: [Mesos]MesosTaskLaunchData occurs BufferUnderflowException Key: SPARK-5333 URL: https://issues.apache.org/jira/browse/SPARK-5333 Project: Spark Issue Type: Bug Components: Mesos Reporter: Jongyoul Lee Priority: Blocker MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task because serializedTask.remaining is 0. {code} Exception in thread Thread-6 java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:498) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355) at org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46) at org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81) {code} I've checked this bug with fine-grained mode. This is because MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
[ https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283669#comment-14283669 ] Apache Spark commented on SPARK-5333: - User 'jongyoul' has created a pull request for this issue: https://github.com/apache/spark/pull/4119 [Mesos] MesosTaskLaunchData occurs BufferUnderflowException --- Key: SPARK-5333 URL: https://issues.apache.org/jira/browse/SPARK-5333 Project: Spark Issue Type: Bug Components: Mesos Reporter: Jongyoul Lee Priority: Blocker MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task because serializedTask.remaining is 0. {code} Exception in thread Thread-6 java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:498) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355) at org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46) at org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81) {code} I've checked this bug with fine-grained mode. This is because MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
[ https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5333: Target Version/s: 1.3.0 (was: 1.3.0, 1.2.1) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException --- Key: SPARK-5333 URL: https://issues.apache.org/jira/browse/SPARK-5333 Project: Spark Issue Type: Bug Components: Mesos Reporter: Jongyoul Lee Priority: Blocker MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task because serializedTask.remaining is 0. {code} Exception in thread Thread-6 java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:498) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355) at org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46) at org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81) {code} I've checked this bug with fine-grained mode. This is because MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283809#comment-14283809 ] Lianhui Wang edited comment on SPARK-4630 at 1/20/15 1:28 PM: -- I think it is better that we use stage's output size to decide number of next stage's partiton. Because RDD is one of stage's operator and number of partition is only related to shuffle,i think stage's statistics is better than RDD. also in SQL,there are many optimizations to choose different physical plan based on statistics. example: hash join or sort merge join.but this is another thing. [~sandyr] what you said before depends on number of partitions in Map Writer is very large, so reducer can fetch data using range partition. if initial number of parititons is small, we need to repartition data and it is very expensive to scan data twice. I donot know is there a better way in this situation. so I currently use input size of parent stage to determine on number of a stage's partition. was (Author: lianhuiwang): I think it is better that we use stage's output size to decide number of next stage's partiton. Because RDD is one of stage's operator and number of partition is only related to shuffle,i think stage's statistics is better than RDD. also in SQL,there are many optimizations to choose different physical plan based on statistics. example: hash join or sort merge join.but this is another thing. [~sandyr] what you said before depends on number of partitions in Map Writer is very large, so reducer can fetch data using range partition. if initial number of parititons is small, we need to repartition data and it is very expensive to scan data twice. I donot know is there a better way in this situation. Dynamically determine optimal number of partitions -- Key: SPARK-4630 URL: https://issues.apache.org/jira/browse/SPARK-4630 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kostas Sakellis Assignee: Kostas Sakellis Partition sizes play a big part in how fast stages execute during a Spark job. There is a direct relationship between the size of partitions to the number of tasks - larger partitions, fewer tasks. For better performance, Spark has a sweet spot for how large partitions should be that get executed by a task. If partitions are too small, then the user pays a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution slows down due to gc pressure and spilling to disk. To increase performance of jobs, users often hand optimize the number(size) of partitions that the next stage gets. Factors that come into play are: Incoming partition sizes from previous stage number of available executors available memory per executor (taking into account spark.shuffle.memoryFraction) Spark has access to this data and so should be able to automatically do the partition sizing for the user. This feature can be turned off/on with a configuration option. To make this happen, we propose modifying the DAGScheduler to take into account partition sizes upon stage completion. Before scheduling the next stage, the scheduler can examine the sizes of the partitions and determine the appropriate number tasks to create. Since this change requires non-trivial modifications to the DAGScheduler, a detailed design doc will be attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4017) Progress bar in console
[ https://issues.apache.org/jira/browse/SPARK-4017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283848#comment-14283848 ] Paul Wolfe commented on SPARK-4017: --- Hello, was wondering if there is a way to turn this feature off? Clutters log files in java spark applications. Progress bar in console --- Key: SPARK-4017 URL: https://issues.apache.org/jira/browse/SPARK-4017 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.2.0 It will be nice to have a progress bar in console, then we could change the default logging level to WARN. The progress bar should be in one line, could also be in the title of terminal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283809#comment-14283809 ] Lianhui Wang commented on SPARK-4630: - I think it is better that we use stage's output size to decide number of next stage's partiton. Because RDD is one of stage's operator and number of partition is only related to shuffle,i think stage's statistics is better than RDD. also in SQL,there are many optimizations to choose different physical plan based on statistics. example: hash join or sort merge join.but this is another thing. [~sandyr] what you said before depends on number of partitions in Map Writer is very large, so reducer can fetch data using range partition. if initial number of parititons is small, we need to repartition data and it is very expensive to scan data twice. I donot know is there a better way in this situation. Dynamically determine optimal number of partitions -- Key: SPARK-4630 URL: https://issues.apache.org/jira/browse/SPARK-4630 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kostas Sakellis Assignee: Kostas Sakellis Partition sizes play a big part in how fast stages execute during a Spark job. There is a direct relationship between the size of partitions to the number of tasks - larger partitions, fewer tasks. For better performance, Spark has a sweet spot for how large partitions should be that get executed by a task. If partitions are too small, then the user pays a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution slows down due to gc pressure and spilling to disk. To increase performance of jobs, users often hand optimize the number(size) of partitions that the next stage gets. Factors that come into play are: Incoming partition sizes from previous stage number of available executors available memory per executor (taking into account spark.shuffle.memoryFraction) Spark has access to this data and so should be able to automatically do the partition sizing for the user. This feature can be turned off/on with a configuration option. To make this happen, we propose modifying the DAGScheduler to take into account partition sizes upon stage completion. Before scheduling the next stage, the scheduler can examine the sizes of the partitions and determine the appropriate number tasks to create. Since this change requires non-trivial modifications to the DAGScheduler, a detailed design doc will be attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5328) Update PySpark MLlib NaiveBayes API to take model type parameter for Bernoulli fit
[ https://issues.apache.org/jira/browse/SPARK-5328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283887#comment-14283887 ] RJ Nowling commented on SPARK-5328: --- The Python API for Naive Bayes is located in python/pyspark/mllib/classification.py . The Python implementation calls the Scala implementation for training through the interface in mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala . The classes in classification.py will need to be updated (with additional pydoc tests), a new method will need to be added to PythonMLLibAPI.scala, and the Python portion of docs/mllib-naive-bayes.md will need to be updated. Update PySpark MLlib NaiveBayes API to take model type parameter for Bernoulli fit -- Key: SPARK-5328 URL: https://issues.apache.org/jira/browse/SPARK-5328 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Leah McGuire Priority: Minor Labels: mllib [SPARK-4894] Adds Bernoulli-variant of Naive Bayes adds Bernoulli fitting to NaiveBayes.scala need to update python API to accept model type parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module
[ https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283941#comment-14283941 ] Sean Owen commented on SPARK-4442: -- [~matthewcornell] Normally I'd say you don't need to build any JARs yourself, and shouldn't bother manually managing JARs; just use Maven or SBT and write in the dependencies you want. But I see that Spark doesn't actually publish test artifacts. (Which to be fair would be unusual. But [~joshrosen] is that not the simplest way to expose this?). You can mvn package as shown on the Building Spark documentation, and you'll end up with a bunch of artifacts in core/target, including the test JAR file containing Spark's test code and thus any utility code you want from there. Move common unit test utilities into their own package / module --- Key: SPARK-4442 URL: https://issues.apache.org/jira/browse/SPARK-4442 Project: Spark Issue Type: Improvement Reporter: Josh Rosen Priority: Minor We should move generally-useful unit test fixtures / utility methods to their own test utilities set package / module to make them easier to find / use. See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4017) Progress bar in console
[ https://issues.apache.org/jira/browse/SPARK-4017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284014#comment-14284014 ] Davies Liu commented on SPARK-4017: --- It can be turned of by spark.ui.showConsoleProgress = false BTW, what's configs for log4j? It should be turned off if logging level is INFO or below. Progress bar in console --- Key: SPARK-4017 URL: https://issues.apache.org/jira/browse/SPARK-4017 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.2.0 It will be nice to have a progress bar in console, then we could change the default logging level to WARN. The progress bar should be in one line, could also be in the title of terminal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module
[ https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283993#comment-14283993 ] Matthew Cornell commented on SPARK-4442: [~srowen] Thanks for the tip. I tried compiled 1.2.0 using this command: $ mvn package -DskipTests But I could not find 'LocalSparkContext' in any jar: $ find . -iname '*.jar' | xargs grep -i 'LocalSparkContext' I'm recompiling without -DskipTests (it's taking a while) - would that cause anything to be added? Once the build is done I'll past the output. Until then - am I missing something that would cause the tests to be excluded? Move common unit test utilities into their own package / module --- Key: SPARK-4442 URL: https://issues.apache.org/jira/browse/SPARK-4442 Project: Spark Issue Type: Improvement Reporter: Josh Rosen Priority: Minor We should move generally-useful unit test fixtures / utility methods to their own test utilities set package / module to make them easier to find / use. See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5336) spark.executor.cores must not be less than spark.task.cpus
WangTaoTheTonic created SPARK-5336: -- Summary: spark.executor.cores must not be less than spark.task.cpus Key: SPARK-5336 URL: https://issues.apache.org/jira/browse/SPARK-5336 Project: Spark Issue Type: Bug Components: YARN Reporter: WangTaoTheTonic If user set spark.executor.cores to be less than spark.task.cpus, task scheduler will fall in infinite loop, we should throw an exception.in that case. In standalone and mesos mode, we should respect spark.task.cpus too, and I will file another JIRA to solve that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module
[ https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284054#comment-14284054 ] Sean Owen commented on SPARK-4442: -- Hm, no works for me. Maybe {{mvn -DskipTests install}} the entire project first? although I wouldn't think that's necessary. Also I'm working off {{master}} although again should be the same thing from any release. Move common unit test utilities into their own package / module --- Key: SPARK-4442 URL: https://issues.apache.org/jira/browse/SPARK-4442 Project: Spark Issue Type: Improvement Reporter: Josh Rosen Priority: Minor We should move generally-useful unit test fixtures / utility methods to their own test utilities set package / module to make them easier to find / use. See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5336) spark.executor.cores must not be less than spark.task.cpus
[ https://issues.apache.org/jira/browse/SPARK-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284067#comment-14284067 ] Apache Spark commented on SPARK-5336: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/4123 spark.executor.cores must not be less than spark.task.cpus -- Key: SPARK-5336 URL: https://issues.apache.org/jira/browse/SPARK-5336 Project: Spark Issue Type: Bug Components: YARN Reporter: WangTaoTheTonic If user set spark.executor.cores to be less than spark.task.cpus, task scheduler will fall in infinite loop, we should throw an exception.in that case. In standalone and mesos mode, we should respect spark.task.cpus too, and I will file another JIRA to solve that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module
[ https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284069#comment-14284069 ] Matthew Cornell commented on SPARK-4442: Thanks for sticking with me on this, Sean! I tried again from scratch with no luck. Maybe the downloaded sources have something crucial missing from master? Here's what I did: # start with extracting http://apache.spinellicreations.com/spark/spark-1.2.0/spark-1.2.0.tgz $ cd /Users/cornell/Downloads/spark-1.2.0/ $ mvn -DskipTests install $ cd core $ mvn jar:test-jar - same warning: [WARNING] JAR will be empty - no content was marked for inclusion! Move common unit test utilities into their own package / module --- Key: SPARK-4442 URL: https://issues.apache.org/jira/browse/SPARK-4442 Project: Spark Issue Type: Improvement Reporter: Josh Rosen Priority: Minor We should move generally-useful unit test fixtures / utility methods to their own test utilities set package / module to make them easier to find / use. See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module
[ https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284013#comment-14284013 ] Sean Owen commented on SPARK-4442: -- [~matthewcornell] Oops, I missed again. The test JARs aren't configured to be generated by the build as-is. But you can simply do this in {{core/}}: {code} mvn jar:test-jar ... jar tf target/spark-core_2.10-1.3.0-SNAPSHOT-tests.jar | grep LocalSparkContext org/apache/spark/LocalSparkContext$.class org/apache/spark/LocalSparkContext$class.class org/apache/spark/LocalSparkContext.class {code} Move common unit test utilities into their own package / module --- Key: SPARK-4442 URL: https://issues.apache.org/jira/browse/SPARK-4442 Project: Spark Issue Type: Improvement Reporter: Josh Rosen Priority: Minor We should move generally-useful unit test fixtures / utility methods to their own test utilities set package / module to make them easier to find / use. See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284012#comment-14284012 ] Yin Huai commented on SPARK-2890: - [~btiernay] Oh, seems the comments thread of this JIRA is not quite clear on if this issue has been resolved. Actually, we have relaxed this restriction (https://github.com/apache/spark/pull/2209/files is the change). Spark SQL should allow SELECT with duplicated columns - Key: SPARK-2890 URL: https://issues.apache.org/jira/browse/SPARK-2890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Assignee: Michael Armbrust Fix For: 1.2.0 Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317) at org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) After trial and error, it seems it's caused by duplicated columns in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5337) respect spark.task.cpus when launch executors
WangTaoTheTonic created SPARK-5337: -- Summary: respect spark.task.cpus when launch executors Key: SPARK-5337 URL: https://issues.apache.org/jira/browse/SPARK-5337 Project: Spark Issue Type: Bug Components: Spark Core Reporter: WangTaoTheTonic In standalone mode, we did not respect spark.task.cpus when luanch executors. Some executors would have not enough cores to launch a single task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module
[ https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284052#comment-14284052 ] Matthew Cornell commented on SPARK-4442: [~srowen] I might have misunderstood. I tried this: $ cd dir/spark-1.2.0/core/ $ mvn jar:test-jar But it says it created an empty jar (see output below). Any ideas re: what I'm doing wrong? [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Spark Project Core 1.2.0 [INFO] [INFO] [INFO] --- maven-jar-plugin:2.4:test-jar (default-cli) @ spark-core_2.10 --- [WARNING] JAR will be empty - no content was marked for inclusion! [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 1.585 s [INFO] Finished at: 2015-01-20T12:13:57-05:00 [INFO] Final Memory: 10M/81M [INFO] Move common unit test utilities into their own package / module --- Key: SPARK-4442 URL: https://issues.apache.org/jira/browse/SPARK-4442 Project: Spark Issue Type: Improvement Reporter: Josh Rosen Priority: Minor We should move generally-useful unit test fixtures / utility methods to their own test utilities set package / module to make them easier to find / use. See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283958#comment-14283958 ] Bob Tiernay edited comment on SPARK-2890 at 1/20/15 4:02 PM: - What if you request {{SELECT x.\*, y.\*}}? If there are 20 columns on each side, is the user required to specify them all? was (Author: btiernay): What if you request {{SELECT x.*, y.*}}? If there are 20 columns on each side, is the user required to specify them all? Spark SQL should allow SELECT with duplicated columns - Key: SPARK-2890 URL: https://issues.apache.org/jira/browse/SPARK-2890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Assignee: Michael Armbrust Fix For: 1.2.0 Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317) at org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) After trial and error, it seems it's caused by duplicated columns in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283958#comment-14283958 ] Bob Tiernay commented on SPARK-2890: What if you request {{SELECT x.*, y.*}}? If there are 20 columns on each side, is the user required to specify them all? Spark SQL should allow SELECT with duplicated columns - Key: SPARK-2890 URL: https://issues.apache.org/jira/browse/SPARK-2890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Assignee: Michael Armbrust Fix For: 1.2.0 Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317) at org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) After trial and error, it seems it's caused by duplicated columns in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5335) Destroying cluster in VPC with --delete-groups fails to remove security groups
[ https://issues.apache.org/jira/browse/SPARK-5335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283936#comment-14283936 ] Apache Spark commented on SPARK-5335: - User 'voukka' has created a pull request for this issue: https://github.com/apache/spark/pull/4122 Destroying cluster in VPC with --delete-groups fails to remove security groups Key: SPARK-5335 URL: https://issues.apache.org/jira/browse/SPARK-5335 Project: Spark Issue Type: Bug Components: EC2 Reporter: Vladimir Grigor When I try to remove security groups using option of the script, it fails because in VPC one should remove security groups by id, not name as it is now. {code} $ ./spark-ec2 -k key20141114 -i ~/key.pem --region=eu-west-1 --delete-groups destroy SparkByScript Are you sure you want to destroy the cluster SparkByScript? The following instances will be terminated: Searching for existing cluster SparkByScript... ALL DATA ON ALL NODES WILL BE LOST!! Destroy cluster SparkByScript (y/N): y Terminating master... Terminating slaves... Deleting security groups (this will take some time)... Waiting for cluster to enter 'terminated' state. Cluster is now in 'terminated' state. Waited 0 seconds. Attempt 1 Deleting rules in security group SparkByScript-slaves Deleting rules in security group SparkByScript-master ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'SparkByScript-slaves' for groupName. You may not reference Amazon VPC security groups by name. Please use the corresponding id for this operation./Message/Error/ErrorsRequestID60313fac-5d47-48dd-a8bf-e9832948c0a6/RequestID/Response Failed to delete security group SparkByScript-slaves ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'SparkByScript-master' for groupName. You may not reference Amazon VPC security groups by name. Please use the corresponding id for this operation./Message/Error/ErrorsRequestID74ff8431-c0c1-4052-9ecb-c0adfa7eeeac/RequestID/Response Failed to delete security group SparkByScript-master Attempt 2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module
[ https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283920#comment-14283920 ] Matthew Cornell commented on SPARK-4442: Please, as a new Spark (and Maven and SBT) user, having a jar I could simply drop into my IntelliJ project would be a life saver. Until then, would someone please sketch a little detail on how I could build the jar using the 1.2.0 sources? Thanks! Move common unit test utilities into their own package / module --- Key: SPARK-4442 URL: https://issues.apache.org/jira/browse/SPARK-4442 Project: Spark Issue Type: Improvement Reporter: Josh Rosen Priority: Minor We should move generally-useful unit test fixtures / utility methods to their own test utilities set package / module to make them easier to find / use. See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-750) LocalSparkContext should be included in Spark JAR
[ https://issues.apache.org/jira/browse/SPARK-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283919#comment-14283919 ] Matthew Cornell commented on SPARK-750: --- Please, as a new Spark (and Maven and SBT) user, having a jar I could simply drop into my IntelliJ project would be a life saver. Until then, would someone please sketch a little detail on how I could build the jar using the 1.2.0 sources? Thanks! LocalSparkContext should be included in Spark JAR - Key: SPARK-750 URL: https://issues.apache.org/jira/browse/SPARK-750 Project: Spark Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Josh Rosen Priority: Minor To aid third-party developers in writing unit tests with Spark, LocalSparkContext should be included in the Spark JAR. Right now, it appears to be excluded because it is located in one of the Spark test directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
[ https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5333. --- Resolution: Fixed [Mesos] MesosTaskLaunchData occurs BufferUnderflowException --- Key: SPARK-5333 URL: https://issues.apache.org/jira/browse/SPARK-5333 Project: Spark Issue Type: Bug Components: Mesos Reporter: Jongyoul Lee Assignee: Jongyoul Lee Priority: Blocker Fix For: 1.3.0 MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task because serializedTask.remaining is 0. {code} Exception in thread Thread-6 java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:498) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355) at org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46) at org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81) {code} I've checked this bug with fine-grained mode. This is because MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5287) Add defaultSizeOf to every data type
[ https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5287: Summary: Add defaultSizeOf to every data type (was: NativeType.defaultSizeOf should have default sizes of all NativeTypes.) Add defaultSizeOf to every data type Key: SPARK-5287 URL: https://issues.apache.org/jira/browse/SPARK-5287 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Otherwise, we will failed to do stats estimation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5287) Add defaultSizeOf to every data type
[ https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5287: Description: Right now, in NativeType, we defined some defaultSizes (it is actually missing some types) and for complex types, we calculate the default size at the place where we use the default size. We should add defaultSize to every data type. (was: Otherwise, we will failed to do stats estimation. ) Add defaultSizeOf to every data type Key: SPARK-5287 URL: https://issues.apache.org/jira/browse/SPARK-5287 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Right now, in NativeType, we defined some defaultSizes (it is actually missing some types) and for complex types, we calculate the default size at the place where we use the default size. We should add defaultSize to every data type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module
[ https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284210#comment-14284210 ] Matthew Cornell commented on SPARK-4442: OK, progress: I cloned master and re-ran the commands and did end up getting spark-core_2.10-1.3.0-SNAPSHOT-tests.jar, which does contain LocalSparkContext. So I guess I've upgraded to 1.3.0 :-) Question, please: There is as second LocalSparkContext defined in: graphx/src/test/scala/org/apache/spark/graphx/LocalSparkContext.scala that did not get included in the mvn jar:test-jar command. I looked at pom.xml to try to figure out what that argument does, but all I found was a profile called 'java8-tests'. I couldn't find anywhere that mentioned: dirspark/core/src/test/scala/org/apache/spark/LocalSparkContext.scala Any pointers would be appreciated! Move common unit test utilities into their own package / module --- Key: SPARK-4442 URL: https://issues.apache.org/jira/browse/SPARK-4442 Project: Spark Issue Type: Improvement Reporter: Josh Rosen Priority: Minor We should move generally-useful unit test fixtures / utility methods to their own test utilities set package / module to make them easier to find / use. See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
[ https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5333: -- Fix Version/s: 1.3.0 [Mesos] MesosTaskLaunchData occurs BufferUnderflowException --- Key: SPARK-5333 URL: https://issues.apache.org/jira/browse/SPARK-5333 Project: Spark Issue Type: Bug Components: Mesos Reporter: Jongyoul Lee Assignee: Jongyoul Lee Priority: Blocker Fix For: 1.3.0 MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task because serializedTask.remaining is 0. {code} Exception in thread Thread-6 java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:498) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355) at org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46) at org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81) {code} I've checked this bug with fine-grained mode. This is because MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
[ https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284155#comment-14284155 ] Josh Rosen commented on SPARK-5333: --- Fixed by https://github.com/apache/spark/pull/4119 [Mesos] MesosTaskLaunchData occurs BufferUnderflowException --- Key: SPARK-5333 URL: https://issues.apache.org/jira/browse/SPARK-5333 Project: Spark Issue Type: Bug Components: Mesos Reporter: Jongyoul Lee Assignee: Jongyoul Lee Priority: Blocker Fix For: 1.3.0 MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task because serializedTask.remaining is 0. {code} Exception in thread Thread-6 java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:498) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355) at org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46) at org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81) {code} I've checked this bug with fine-grained mode. This is because MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
[ https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5333: -- Assignee: Jongyoul Lee [Mesos] MesosTaskLaunchData occurs BufferUnderflowException --- Key: SPARK-5333 URL: https://issues.apache.org/jira/browse/SPARK-5333 Project: Spark Issue Type: Bug Components: Mesos Reporter: Jongyoul Lee Assignee: Jongyoul Lee Priority: Blocker Fix For: 1.3.0 MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task because serializedTask.remaining is 0. {code} Exception in thread Thread-6 java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:498) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355) at org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46) at org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81) {code} I've checked this bug with fine-grained mode. This is because MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module
[ https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284216#comment-14284216 ] Sean Owen commented on SPARK-4442: -- I don't think 1.3.0 should be different in this regard, or any release. The command is going to JAR up compiled test classes, so it's necessary for tests to be compiled first, but, install should have done that. You're referring to another class in the graphx module, so that won't be part of core's test code. java8-tests is not related. I'm not sure what you are looking for in the POM? Move common unit test utilities into their own package / module --- Key: SPARK-4442 URL: https://issues.apache.org/jira/browse/SPARK-4442 Project: Spark Issue Type: Improvement Reporter: Josh Rosen Priority: Minor We should move generally-useful unit test fixtures / utility methods to their own test utilities set package / module to make them easier to find / use. See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-750) LocalSparkContext should be included in Spark JAR
[ https://issues.apache.org/jira/browse/SPARK-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-750. - Resolution: Duplicate I'm going to boldly fold this into SPARK-4442 as a more general, related request to expose test utilities explicitly. LocalSparkContext should be included in Spark JAR - Key: SPARK-750 URL: https://issues.apache.org/jira/browse/SPARK-750 Project: Spark Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Josh Rosen Priority: Minor To aid third-party developers in writing unit tests with Spark, LocalSparkContext should be included in the Spark JAR. Right now, it appears to be excluded because it is located in one of the Spark test directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5333) [Mesos] MesosTaskLaunchData occurs BufferUnderflowException
[ https://issues.apache.org/jira/browse/SPARK-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284135#comment-14284135 ] Josh Rosen commented on SPARK-5333: --- Good catch. I've created a link to the SPARK-4014 JIRA so that we don't forget to backport this patch, too, when porting that patch to earlier branches. [Mesos] MesosTaskLaunchData occurs BufferUnderflowException --- Key: SPARK-5333 URL: https://issues.apache.org/jira/browse/SPARK-5333 Project: Spark Issue Type: Bug Components: Mesos Reporter: Jongyoul Lee Priority: Blocker MesosTaskLaunchData occurs exception when MesosExecutorBackend launches task because serializedTask.remaining is 0. {code} Exception in thread Thread-6 java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:498) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355) at org.apache.spark.scheduler.cluster.mesos.MesosTaskLaunchData$.fromByteString(MesosTaskLaunchData.scala:46) at org.apache.spark.executor.MesosExecutorBackend.launchTask(MesosExecutorBackend.scala:81) {code} I've checked this bug with fine-grained mode. This is because MesosTaskLaunchData.toByteString doesn't rewind byteBuffer after they put data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4014) TaskContext.attemptId returns taskId
[ https://issues.apache.org/jira/browse/SPARK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284137#comment-14284137 ] Josh Rosen commented on SPARK-4014: --- Note to self: when backporting this to any branches, also backport SPARK-4014 (since that fixes a bug introduced here). TaskContext.attemptId returns taskId Key: SPARK-4014 URL: https://issues.apache.org/jira/browse/SPARK-4014 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Yin Huai Assignee: Josh Rosen Priority: Minor Labels: backport-needed Fix For: 1.3.0 In TaskRunner, we assign the taskId of a task to the attempId of the corresponding TaskContext. Should we rename attemptId to taskId to avoid confusion? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4014) TaskContext.attemptId returns taskId
[ https://issues.apache.org/jira/browse/SPARK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284137#comment-14284137 ] Josh Rosen edited comment on SPARK-4014 at 1/20/15 6:12 PM: Note to self: when backporting this to any branches, also backport SPARK-5333 (since that fixes a bug introduced here). was (Author: joshrosen): Note to self: when backporting this to any branches, also backport SPARK-4014 (since that fixes a bug introduced here). TaskContext.attemptId returns taskId Key: SPARK-4014 URL: https://issues.apache.org/jira/browse/SPARK-4014 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Yin Huai Assignee: Josh Rosen Priority: Minor Labels: backport-needed Fix For: 1.3.0 In TaskRunner, we assign the taskId of a task to the attempId of the corresponding TaskContext. Should we rename attemptId to taskId to avoid confusion? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284335#comment-14284335 ] Manoj Samel commented on SPARK-2243: Is there a target release for this ? Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 0.7.0, 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at
[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284341#comment-14284341 ] Kostas Sakellis commented on SPARK-4630: I agree that this should be build not assuming SchemaRDD. Like Dryad and Tez (which is basically Dryad) we should be able to use runtime statistics (as opposed to metastore stats) to compute the optimal partition numbers. I'm no Dryad expert but simply read through their papers: 1) Dryad: http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf 2) DryadLinq: http://research.microsoft.com/en-us/projects/DryadLINQ/DryadLINQ.pdf 3) Optimus: http://research.microsoft.com/pubs/185714/Optimus.pdf The DryadLinq paper has a small section on dynamic partitioning. Section 4.2.2: {quote} Dynamic data partitioning sets the number of ver- tices in each stage (i.e., the number of partitions of each dataset) at run time based on the size of its input data. Traditional databases usually estimate dataset sizes statically, but these estimates can be very inaccurate, for ex- ample in the presence of correlated queries. DryadLINQ supports dynamic hash and range partitions—for range partitions both the number of partitions and the partition- ing key ranges are determined at run time by sampling the input dataset. {quote} The Optimus paper talks about more optimizations they did in their system that runs on top of Dryad. There are a lot of optimizations but dynamic partitioning is talked about ins Section 3.1. They describe creating a set of sampled histograms, one for each dependent partition, and then depending on the operation choose a combining strategy for the statistics. For example, for joins they do the product of the histograms. Using the stats from the histograms they determine how many vertices (partitions) to add to the graph processing. The paper they reference for creating the sampling histogram is http://www.mathcs.emory.edu/~cheung/papers/StreamDB/Histogram/1998-Chaudhuri-Histo.pdf - I haven't read it yet. They don't really get into how they bootstrap this - sampling the original datasources stored in the filesystem. From what I can tell, [~lianhuiwang]'s patch assumes that all records are the same size since it solely looks at the map status and hadoop input sizes. I don't think this is good enough to make intelligent decisions as you also need to look at the record sizes to be able to prevent skew. The partial DAG execution described in the Shark paper is similar to what Dryad does. [~rxin], why was this not pushed down to core Spark? Partial DAG execution could allow us to have a number of runtime optimizations that are currently not possible. Dynamically determine optimal number of partitions -- Key: SPARK-4630 URL: https://issues.apache.org/jira/browse/SPARK-4630 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kostas Sakellis Assignee: Kostas Sakellis Partition sizes play a big part in how fast stages execute during a Spark job. There is a direct relationship between the size of partitions to the number of tasks - larger partitions, fewer tasks. For better performance, Spark has a sweet spot for how large partitions should be that get executed by a task. If partitions are too small, then the user pays a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution slows down due to gc pressure and spilling to disk. To increase performance of jobs, users often hand optimize the number(size) of partitions that the next stage gets. Factors that come into play are: Incoming partition sizes from previous stage number of available executors available memory per executor (taking into account spark.shuffle.memoryFraction) Spark has access to this data and so should be able to automatically do the partition sizing for the user. This feature can be turned off/on with a configuration option. To make this happen, we propose modifying the DAGScheduler to take into account partition sizes upon stage completion. Before scheduling the next stage, the scheduler can examine the sizes of the partitions and determine the appropriate number tasks to create. Since this change requires non-trivial modifications to the DAGScheduler, a detailed design doc will be attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5019: - Fix Version/s: 1.3.0 Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Travis Galoppo Priority: Minor Fix For: 1.3.0 The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5019: - Priority: Minor (was: Blocker) Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor Fix For: 1.3.0 The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5019: - Assignee: Travis Galoppo Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Travis Galoppo Priority: Minor Fix For: 1.3.0 The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5019. -- Resolution: Fixed Fixed by https://github.com/apache/spark/pull/4088 Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Travis Galoppo Priority: Minor Fix For: 1.3.0 The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size
[ https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5186. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3997 [https://github.com/apache/spark/pull/3997] Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size - Key: SPARK-5186 URL: https://issues.apache.org/jira/browse/SPARK-5186 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Derrick Burns Fix For: 1.3.0 Original Estimate: 0.25h Remaining Estimate: 0.25h The implementation of Vector.equals and Vector.hashCode are correct but slow for SparseVectors that are truly sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS
[ https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284908#comment-14284908 ] Hari Shreedharan commented on SPARK-5342: - [~pwendell], [~tgraves], [~vanzin], [~andrewor14] - Please take a look. Allow long running Spark apps to run on secure YARN/HDFS Key: SPARK-5342 URL: https://issues.apache.org/jira/browse/SPARK-5342 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Hari Shreedharan Attachments: SparkYARN.pdf Currently, Spark apps cannot write to HDFS after the delegation tokens reach their expiry, which maxes out at 7 days. We must find a way to ensure that we can run applications for longer - for example, spark streaming apps are expected to run forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS
[ https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Shreedharan updated SPARK-5342: Attachment: SparkYARN.pdf Design doc with proposed design. Original design doc with comments access: https://docs.google.com/document/d/1ECBZTprOEHPueXcG-w3GibpoWgLccHJwU62pNxYM5oU/edit?usp=sharing Allow long running Spark apps to run on secure YARN/HDFS Key: SPARK-5342 URL: https://issues.apache.org/jira/browse/SPARK-5342 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Hari Shreedharan Attachments: SparkYARN.pdf Currently, Spark apps cannot write to HDFS after the delegation tokens reach their expiry, which maxes out at 7 days. We must find a way to ensure that we can run applications for longer - for example, spark streaming apps are expected to run forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4923) Add Developer API to REPL to allow re-publishing the REPL jar
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4923: --- Summary: Add Developer API to REPL to allow re-publishing the REPL jar (was: Maven build should keep publishing spark-repl) Add Developer API to REPL to allow re-publishing the REPL jar - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Labels: shell Attachments: SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4923) Add Developer API to REPL to allow re-publishing the REPL jar
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4923: --- Assignee: Chip Senkbeil Add Developer API to REPL to allow re-publishing the REPL jar - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Assignee: Chip Senkbeil Priority: Critical Labels: shell Attachments: SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5323) Row shouldn't extend Seq
[ https://issues.apache.org/jira/browse/SPARK-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5323. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4115 [https://github.com/apache/spark/pull/4115] Row shouldn't extend Seq Key: SPARK-5323 URL: https://issues.apache.org/jira/browse/SPARK-5323 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.3.0 Extending Seq comes at a huge cost: 1. Bytecode bloat (the Row constructor now has to make about 20 static calls to the init method of various constructors. 2. Documentation bloat (added hundreds of methods most of them are irrelevant). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2
[ https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5289: --- Fix Version/s: 1.2.1 Backport publishing of repl, yarn into branch-1.2 - Key: SPARK-5289 URL: https://issues.apache.org/jira/browse/SPARK-5289 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.2.1 In SPARK-3452 we did some clean-up of published artifacts that turned out to adversely affect some users. This has been mostly patched up in master via SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn modules, they were fixed in SPARK-4048 as part of a larger change that only went into master. Those pieces should be backported to Spark 1.2 to allow publishing in a 1.2.1 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284470#comment-14284470 ] Stephen Boesch commented on SPARK-4259: --- Xiangrui has provided valuable feedback. His latest recommendation points out that the Gaussian SImilarities will result in a small proportion of the input vertices having non-zero (or nearly zero) value . That ratio may then represent the out-degree of each vertex of the graph. The graph edges will represent the sparse (non-zero) matrix entries of the Normalized Affinity matrix W - so W_ij that have the non-zero entries. The algorithm thus bears similarities to PageRank. We are using the Power Iteration Clustering algorithm. In each iteration of the PIC the components of the estimated Eigenvector - represented by vertices in the Graph - are updated via Graph.aggregateMessages execution. Further input from Xiangrui: The graph is sparse, we don’t need to store edges with 0 similarity. We can assume that the average degree is D and then the number of edges is D N, where N is the number of vertices. It should be much less than N^2. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Assignee: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5341) Support maven coordinates in spark-shell and spark-submit
Burak Yavuz created SPARK-5341: -- Summary: Support maven coordinates in spark-shell and spark-submit Key: SPARK-5341 URL: https://issues.apache.org/jira/browse/SPARK-5341 Project: Spark Issue Type: New Feature Components: Deploy, Spark Shell Reporter: Burak Yavuz This feature will allow users to provide the maven coordinates of jars they wish to use in their spark application. Coordinates can be a comma-delimited list and be supplied like: ```spark-submit --maven org.apache.example.a,org.apache.example.b``` This feature will also be added to spark-shell (where it is more critical to have this feature) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size
[ https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5186: - Assignee: yuhao yang Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size - Key: SPARK-5186 URL: https://issues.apache.org/jira/browse/SPARK-5186 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Derrick Burns Assignee: yuhao yang Fix For: 1.3.0 Original Estimate: 0.25h Remaining Estimate: 0.25h The implementation of Vector.equals and Vector.hashCode are correct but slow for SparseVectors that are truly sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5287) Add defaultSizeOf to every data type
[ https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5287. Resolution: Fixed Fix Version/s: 1.3.0 Add defaultSizeOf to every data type Key: SPARK-5287 URL: https://issues.apache.org/jira/browse/SPARK-5287 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.3.0 Right now, in NativeType, we defined some defaultSizes (it is actually missing some types) and for complex types, we calculate the default size at the place where we use the default size. We should add defaultSize to every data type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5287) Add defaultSizeOf to every data type
[ https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5287: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-5166 Add defaultSizeOf to every data type Key: SPARK-5287 URL: https://issues.apache.org/jira/browse/SPARK-5287 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.3.0 Right now, in NativeType, we defined some defaultSizes (it is actually missing some types) and for complex types, we calculate the default size at the place where we use the default size. We should add defaultSize to every data type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5287) Add defaultSizeOf to every data type
[ https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5287: --- Assignee: Yin Huai Add defaultSizeOf to every data type Key: SPARK-5287 URL: https://issues.apache.org/jira/browse/SPARK-5287 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.3.0 Right now, in NativeType, we defined some defaultSizes (it is actually missing some types) and for complex types, we calculate the default size at the place where we use the default size. We should add defaultSize to every data type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS
[ https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284914#comment-14284914 ] Hari Shreedharan commented on SPARK-5342: - Thanks [~adhoot] for helping with investigating the solution on the YARN side. Allow long running Spark apps to run on secure YARN/HDFS Key: SPARK-5342 URL: https://issues.apache.org/jira/browse/SPARK-5342 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Hari Shreedharan Attachments: SparkYARN.pdf Currently, Spark apps cannot write to HDFS after the delegation tokens reach their expiry, which maxes out at 7 days. We must find a way to ensure that we can run applications for longer - for example, spark streaming apps are expected to run forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5135) Add support for describe [extended] table to DDL in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284915#comment-14284915 ] Apache Spark commented on SPARK-5135: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4127 Add support for describe [extended] table to DDL in SQLContext -- Key: SPARK-5135 URL: https://issues.apache.org/jira/browse/SPARK-5135 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.3.0 Reporter: shengli Priority: Minor Fix For: 1.3.0 Original Estimate: 72h Remaining Estimate: 72h Support Describe Table Command. describe [extended] tableName. This also support external datasource table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-4296. --- Resolution: Fixed Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0 Reporter: Shixiong Zhu Assignee: Cheng Lian Priority: Blocker When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3439) Add Canopy Clustering Algorithm
[ https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284584#comment-14284584 ] Xiangrui Meng commented on SPARK-3439: -- [~angellandros] Are you interested in contributing canopy clustering to MLlib? It would be nice if you can describe the proposed API first (input type, output type, and parameters) and the complexity. [~yuu.ishik...@gmail.com] I've assigned this ticket to [~angellandros]. Please let me know if you are working on it. Add Canopy Clustering Algorithm --- Key: SPARK-3439 URL: https://issues.apache.org/jira/browse/SPARK-3439 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa Priority: Minor The canopy clustering algorithm is an unsupervised pre-clustering algorithm. It is often used as a preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm. It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3439) Add Canopy Clustering Algorithm
[ https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3439: - Assignee: Muhammad-Ali A'rabi Add Canopy Clustering Algorithm --- Key: SPARK-3439 URL: https://issues.apache.org/jira/browse/SPARK-3439 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa Assignee: Muhammad-Ali A'rabi Priority: Minor The canopy clustering algorithm is an unsupervised pre-clustering algorithm. It is often used as a preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm. It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS
Hari Shreedharan created SPARK-5342: --- Summary: Allow long running Spark apps to run on secure YARN/HDFS Key: SPARK-5342 URL: https://issues.apache.org/jira/browse/SPARK-5342 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Hari Shreedharan Currently, Spark apps cannot write to HDFS after the delegation tokens reach their expiry, which maxes out at 7 days. We must find a way to ensure that we can run applications for longer - for example, spark streaming apps are expected to run forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS
[ https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284908#comment-14284908 ] Hari Shreedharan edited comment on SPARK-5342 at 1/21/15 12:46 AM: --- [~pwendell], [~tgraves], [~sandyr], [~vanzin], [~andrewor14] - Please take a look. was (Author: hshreedharan): [~pwendell], [~tgraves], [~vanzin], [~andrewor14] - Please take a look. Allow long running Spark apps to run on secure YARN/HDFS Key: SPARK-5342 URL: https://issues.apache.org/jira/browse/SPARK-5342 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Hari Shreedharan Attachments: SparkYARN.pdf Currently, Spark apps cannot write to HDFS after the delegation tokens reach their expiry, which maxes out at 7 days. We must find a way to ensure that we can run applications for longer - for example, spark streaming apps are expected to run forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5275) pyspark.streaming is not included in assembly jar
[ https://issues.apache.org/jira/browse/SPARK-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284924#comment-14284924 ] Apache Spark commented on SPARK-5275: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4128 pyspark.streaming is not included in assembly jar - Key: SPARK-5275 URL: https://issues.apache.org/jira/browse/SPARK-5275 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Priority: Blocker The pyspark.streaming is not included in assembly jar of spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5294) Hide tables in AllStagePages for Active Stages, Completed Stages and Failed Stages when they are empty
[ https://issues.apache.org/jira/browse/SPARK-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5294: -- Assignee: Kousuke Saruta Hide tables in AllStagePages for Active Stages, Completed Stages and Failed Stages when they are empty Key: SPARK-5294 URL: https://issues.apache.org/jira/browse/SPARK-5294 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Fix For: 1.3.0 Related to SPARK-5228, AllStagesPage also should hide the table for ActiveStages, CompleteStages and FailedStages when they are empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5294) Hide tables in AllStagePages for Active Stages, Completed Stages and Failed Stages when they are empty
[ https://issues.apache.org/jira/browse/SPARK-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5294. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4083 [https://github.com/apache/spark/pull/4083] Hide tables in AllStagePages for Active Stages, Completed Stages and Failed Stages when they are empty Key: SPARK-5294 URL: https://issues.apache.org/jira/browse/SPARK-5294 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.3.0 Reporter: Kousuke Saruta Fix For: 1.3.0 Related to SPARK-5228, AllStagesPage also should hide the table for ActiveStages, CompleteStages and FailedStages when they are empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4923) Add Developer API to REPL to allow re-publishing the REPL jar
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4923. Resolution: Fixed Target Version/s: 1.3.0 (was: 1.3.0, 1.2.1) I updated the title of this to reflect the work that actually happened in Chip's patch. And SPARK-5289 is tracking publishing of the artifacts. Add Developer API to REPL to allow re-publishing the REPL jar - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Assignee: Chip Senkbeil Priority: Critical Labels: shell Attachments: SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284561#comment-14284561 ] Xiangrui Meng commented on SPARK-4259: -- Note: [~javadba]'s update is from an offline discussion we had. The algorithm we plan to implement is described in the paper Power Iteration Clustering (PIC) (http://www.icml2010.org/papers/387.pdf) and the notation is adapted from there. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Assignee: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5144) spark-yarn module should be published
[ https://issues.apache.org/jira/browse/SPARK-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284662#comment-14284662 ] David McWhorter commented on SPARK-5144: Similar problem here, building an uber-jar to submit a spark job problematically and get: Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support. Downgrading to a previous version of spark for now... spark-yarn module should be published - Key: SPARK-5144 URL: https://issues.apache.org/jira/browse/SPARK-5144 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Aniket Bhatnagar We disabled publishing of certain modules in SPARK-3452. One of such modules is spark-yarn. This breaks applications that submit spark jobs programatically with master set as yarn-client. This is because SparkContext is dependent on classes from yarn-client module to submit the YARN application. Here is the stack trace that you get if you submit the spark job without yarn-client dependency: 2015-01-07 14:39:22,799 [pool-10-thread-13] [info] o.a.s.s.MemoryStore - MemoryStore started with capacity 731.7 MB Exception in thread pool-10-thread-13 java.lang.ExceptionInInitializerError at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784) at org.apache.spark.storage.BlockManager.init(BlockManager.scala:105) at org.apache.spark.storage.BlockManager.init(BlockManager.scala:180) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159) at org.apache.spark.SparkContext.init(SparkContext.scala:232) at com.myimpl.Server:23) at scala.util.Success$$anonfun$map$1.apply(Try.scala:236) at scala.util.Try$.apply(Try.scala:191) at scala.util.Success.map(Try.scala:236) at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23) at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23) at scala.util.Success$$anonfun$map$1.apply(Try.scala:236) at scala.util.Try$.apply(Try.scala:191) at scala.util.Success.map(Try.scala:236) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Unable to load YARN support at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) ... 27 more Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:195) ... 29 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on
[ https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284659#comment-14284659 ] David McWhorter commented on SPARK-3452: Same problem here -- if spark-yarn is not available, what is the correct way to submit yarn jobs programatically? Maven build should skip publishing artifacts people shouldn't depend on --- Key: SPARK-3452 URL: https://issues.apache.org/jira/browse/SPARK-3452 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical Fix For: 1.2.0 I think it's easy to do this by just adding a skip configuration somewhere. We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3789) Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284903#comment-14284903 ] Ameet Talwalkar commented on SPARK-3789: Great. I hope this can make it into 1.3. On Tue, Jan 20, 2015 at 12:45 PM, Kushal Datta (JIRA) j...@apache.org Python bindings for GraphX -- Key: SPARK-3789 URL: https://issues.apache.org/jira/browse/SPARK-3789 Project: Spark Issue Type: New Feature Components: GraphX, PySpark Reporter: Ameet Talwalkar Assignee: Kushal Datta -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5337) respect spark.task.cpus when launch executors
[ https://issues.apache.org/jira/browse/SPARK-5337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284986#comment-14284986 ] Apache Spark commented on SPARK-5337: - User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/4129 respect spark.task.cpus when launch executors - Key: SPARK-5337 URL: https://issues.apache.org/jira/browse/SPARK-5337 Project: Spark Issue Type: Bug Components: Spark Core Reporter: WangTaoTheTonic In standalone mode, we did not respect spark.task.cpus when luanch executors. Some executors would have not enough cores to launch a single task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285058#comment-14285058 ] Derrick Burns commented on SPARK-2620: -- Thanks for the info! It would seem to me that the latter is a bug in the Scala compiler. Specifically, if one wanted an isInstanceOf check that ignored the outer class, it would seem natural to encode that as: {code} x.isInstanceOf[a#B] {code} On Tue, Jan 20, 2015 at 5:21 PM, Tobias Schlatter (JIRA) j...@apache.org case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.0.0, 1.1.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Assignee: Tobias Schlatter Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284969#comment-14284969 ] Tobias Schlatter commented on SPARK-2620: - I am currently looking into the various issues in the REPL. This one is caused by the fact that the Spark REPL (unlike the Scala REPL) uses classes instead of objects to wrap user code. This leads to serialized case classes having different outer pointers and therefore do not equal. Fun fact: Given: {code} class A { class B } val a = new A {code} {code} x match { case _: a.B = true case _ = false } {code} and {code} x.isInstanceOf[a.B] {code} are not equivalent (former checks outer pointer, latter does not). case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.0.0, 1.1.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Assignee: Tobias Schlatter Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5343) ShortestPaths traverses backwards
Michael Malak created SPARK-5343: Summary: ShortestPaths traverses backwards Key: SPARK-5343 URL: https://issues.apache.org/jira/browse/SPARK-5343 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Reporter: Michael Malak GraphX ShortestPaths seems to be following edges backwards instead of forwards: import org.apache.spark.graphx._ val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L, lib.ShortestPaths.run(g,Array(3)).vertices.collect res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 - 0)), (2,Map())) lib.ShortestPaths.run(g,Array(1)).vertices.collect res2: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), (3,Map(1 - 2)), (2,Map(1 - 1))) The following changes may be what will make it run forward: Change one occurrence of src to dst in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64 Change three occurrences of dst to src in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS
[ https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Shreedharan updated SPARK-5342: Attachment: SparkYARN.pdf Minor updates. Allow long running Spark apps to run on secure YARN/HDFS Key: SPARK-5342 URL: https://issues.apache.org/jira/browse/SPARK-5342 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Hari Shreedharan Attachments: SparkYARN.pdf Currently, Spark apps cannot write to HDFS after the delegation tokens reach their expiry, which maxes out at 7 days. We must find a way to ensure that we can run applications for longer - for example, spark streaming apps are expected to run forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS
[ https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Shreedharan updated SPARK-5342: Attachment: (was: SparkYARN.pdf) Allow long running Spark apps to run on secure YARN/HDFS Key: SPARK-5342 URL: https://issues.apache.org/jira/browse/SPARK-5342 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Hari Shreedharan Attachments: SparkYARN.pdf Currently, Spark apps cannot write to HDFS after the delegation tokens reach their expiry, which maxes out at 7 days. We must find a way to ensure that we can run applications for longer - for example, spark streaming apps are expected to run forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5331) Spark workers can't find tachyon master as spark-ec2 doesn't set spark.tachyonStore.url
[ https://issues.apache.org/jira/browse/SPARK-5331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-5331: --- Component/s: EC2 Description: ps -ef | grep Tachyon shows Tachyon running on the master (and the slave) node with correct setting: -Dtachyon.master.hostname=ec2-54-252-156-187.ap-southeast-2.compute.amazonaws.com However from stderr log on worker running the SparkTachyonPi example: 15/01/20 06:00:56 INFO CacheManager: Partition rdd_0_0 not found, computing it 15/01/20 06:00:56 INFO : Trying to connect master @ localhost/127.0.0.1:19998 15/01/20 06:00:56 ERROR : Failed to connect (1) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:00:57 ERROR : Failed to connect (2) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:00:58 ERROR : Failed to connect (3) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:00:59 ERROR : Failed to connect (4) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:01:00 ERROR : Failed to connect (5) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:01:01 WARN TachyonBlockManager: Attempt 1 to create tachyon dir null failed java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 after 5 attempts at tachyon.client.TachyonFS.connect(TachyonFS.java:293) at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011) at tachyon.client.TachyonFS.exist(TachyonFS.java:633) at org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117) at org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106) at org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57) at org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:94) at org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:88) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:773) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master localhost/127.0.0.1:19998 after 5 attempts at tachyon.master.MasterClient.connect(MasterClient.java:178) at tachyon.client.TachyonFS.connect(TachyonFS.java:290) ... 28 more Caused by: tachyon.org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185) at tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) at tachyon.master.MasterClient.connect(MasterClient.java:156) ... 29 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
[jira] [Commented] (SPARK-5262) coalesce should allow NullType and 1 another type in parameters
[ https://issues.apache.org/jira/browse/SPARK-5262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285179#comment-14285179 ] Adrian Wang commented on SPARK-5262: Currently if you try coalesce in hivecontext, it will use hive udf instead of scala build-in method. coalesce should allow NullType and 1 another type in parameters --- Key: SPARK-5262 URL: https://issues.apache.org/jira/browse/SPARK-5262 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Currently Coalesce(null, 1, null) would throw exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection
[ https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-4959: Labels: backport-needed (was: ) Attributes are case sensitive when using a select query from a projection - Key: SPARK-4959 URL: https://issues.apache.org/jira/browse/SPARK-4959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andy Konwinski Priority: Critical Labels: backport-needed Per [~marmbrus], see this line of code, where we should be using an attribute map https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147 To reproduce, i ran the following in the Spark shell: {code} import sqlContext._ sql(drop table if exists test) sql(create table test (col1 string)) sql(insert into table test select hi from prejoined limit 1) val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil sqlContext.table(test).select(projection:_*).registerTempTable(test2) # This succeeds. sql(select CaseSensitiveColName from test2).first() # This fails with java.util.NoSuchElementException: key not found: casesensitivecolname#23046 sql(select casesensitivecolname from test2).first() {code} The full stack trace printed for the final command that is failing: {code} java.util.NoSuchElementException: key not found: casesensitivecolname#23046 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108) at org.apache.spark.rdd.RDD.first(RDD.scala:1093) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5257) SparseVector indices must be non-negative
[ https://issues.apache.org/jira/browse/SPARK-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5257. -- Resolution: Won't Fix [~MechCoder] I will resolve as WontFix. With ~1100 open JIRAs unfortunately I don't think you can assume that JIRAs has been reviewed by someone with authority to commit. Almost all of them are merely submitted. If in doubt, ask for comments first before beginning work. SparseVector indices must be non-negative - Key: SPARK-5257 URL: https://issues.apache.org/jira/browse/SPARK-5257 Project: Spark Issue Type: Documentation Components: MLlib Affects Versions: 1.2.0 Reporter: Derrick Burns Priority: Minor Original Estimate: 0.25h Remaining Estimate: 0.25h The description of SparseVector suggests only that the indices have to be distinct integers. However the code for the constructor that takes an array of (index, value) tuples assumes that the indices are non-negative. Either the code must be changed or the description should be changed. This arose when I generated indices via hashing and converting the hash values to (signed) integers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection
[ https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4959: --- Priority: Blocker (was: Critical) Attributes are case sensitive when using a select query from a projection - Key: SPARK-4959 URL: https://issues.apache.org/jira/browse/SPARK-4959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andy Konwinski Priority: Blocker Labels: backport-needed Per [~marmbrus], see this line of code, where we should be using an attribute map https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147 To reproduce, i ran the following in the Spark shell: {code} import sqlContext._ sql(drop table if exists test) sql(create table test (col1 string)) sql(insert into table test select hi from prejoined limit 1) val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil sqlContext.table(test).select(projection:_*).registerTempTable(test2) # This succeeds. sql(select CaseSensitiveColName from test2).first() # This fails with java.util.NoSuchElementException: key not found: casesensitivecolname#23046 sql(select casesensitivecolname from test2).first() {code} The full stack trace printed for the final command that is failing: {code} java.util.NoSuchElementException: key not found: casesensitivecolname#23046 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108) at org.apache.spark.rdd.RDD.first(RDD.scala:1093) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection
[ https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4959: --- Assignee: Cheng Hao Attributes are case sensitive when using a select query from a projection - Key: SPARK-4959 URL: https://issues.apache.org/jira/browse/SPARK-4959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andy Konwinski Assignee: Cheng Hao Priority: Blocker Labels: backport-needed Fix For: 1.3.0 Per [~marmbrus], see this line of code, where we should be using an attribute map https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147 To reproduce, i ran the following in the Spark shell: {code} import sqlContext._ sql(drop table if exists test) sql(create table test (col1 string)) sql(insert into table test select hi from prejoined limit 1) val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil sqlContext.table(test).select(projection:_*).registerTempTable(test2) # This succeeds. sql(select CaseSensitiveColName from test2).first() # This fails with java.util.NoSuchElementException: key not found: casesensitivecolname#23046 sql(select casesensitivecolname from test2).first() {code} The full stack trace printed for the final command that is failing: {code} java.util.NoSuchElementException: key not found: casesensitivecolname#23046 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108) at org.apache.spark.rdd.RDD.first(RDD.scala:1093) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection
[ https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4959: --- Fix Version/s: 1.3.0 Attributes are case sensitive when using a select query from a projection - Key: SPARK-4959 URL: https://issues.apache.org/jira/browse/SPARK-4959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andy Konwinski Priority: Blocker Labels: backport-needed Fix For: 1.3.0 Per [~marmbrus], see this line of code, where we should be using an attribute map https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147 To reproduce, i ran the following in the Spark shell: {code} import sqlContext._ sql(drop table if exists test) sql(create table test (col1 string)) sql(insert into table test select hi from prejoined limit 1) val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil sqlContext.table(test).select(projection:_*).registerTempTable(test2) # This succeeds. sql(select CaseSensitiveColName from test2).first() # This fails with java.util.NoSuchElementException: key not found: casesensitivecolname#23046 sql(select casesensitivecolname from test2).first() {code} The full stack trace printed for the final command that is failing: {code} java.util.NoSuchElementException: key not found: casesensitivecolname#23046 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108) at org.apache.spark.rdd.RDD.first(RDD.scala:1093) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection
[ https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4959: --- Fix Version/s: (was: 1.2.1) Attributes are case sensitive when using a select query from a projection - Key: SPARK-4959 URL: https://issues.apache.org/jira/browse/SPARK-4959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andy Konwinski Assignee: Cheng Hao Priority: Blocker Labels: backport-needed Fix For: 1.3.0 Per [~marmbrus], see this line of code, where we should be using an attribute map https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147 To reproduce, i ran the following in the Spark shell: {code} import sqlContext._ sql(drop table if exists test) sql(create table test (col1 string)) sql(insert into table test select hi from prejoined limit 1) val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil sqlContext.table(test).select(projection:_*).registerTempTable(test2) # This succeeds. sql(select CaseSensitiveColName from test2).first() # This fails with java.util.NoSuchElementException: key not found: casesensitivecolname#23046 sql(select casesensitivecolname from test2).first() {code} The full stack trace printed for the final command that is failing: {code} java.util.NoSuchElementException: key not found: casesensitivecolname#23046 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108) at org.apache.spark.rdd.RDD.first(RDD.scala:1093) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4959) Attributes are case sensitive when using a select query from a projection
[ https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285262#comment-14285262 ] Patrick Wendell commented on SPARK-4959: Excuse my last comment, it was on the wrong JIRA. Attributes are case sensitive when using a select query from a projection - Key: SPARK-4959 URL: https://issues.apache.org/jira/browse/SPARK-4959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andy Konwinski Assignee: Cheng Hao Priority: Blocker Labels: backport-needed Fix For: 1.3.0 Per [~marmbrus], see this line of code, where we should be using an attribute map https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147 To reproduce, i ran the following in the Spark shell: {code} import sqlContext._ sql(drop table if exists test) sql(create table test (col1 string)) sql(insert into table test select hi from prejoined limit 1) val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil sqlContext.table(test).select(projection:_*).registerTempTable(test2) # This succeeds. sql(select CaseSensitiveColName from test2).first() # This fails with java.util.NoSuchElementException: key not found: casesensitivecolname#23046 sql(select casesensitivecolname from test2).first() {code} The full stack trace printed for the final command that is failing: {code} java.util.NoSuchElementException: key not found: casesensitivecolname#23046 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108) at org.apache.spark.rdd.RDD.first(RDD.scala:1093) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-5344) HistoryServer cannot recognize that inprogress file was renamed to completed file
[ https://issues.apache.org/jira/browse/SPARK-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285261#comment-14285261 ] Apache Spark commented on SPARK-5344: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/4132 HistoryServer cannot recognize that inprogress file was renamed to completed file - Key: SPARK-5344 URL: https://issues.apache.org/jira/browse/SPARK-5344 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Kousuke Saruta FsHistoryProvider tries to update application status but if checkForLogs is called before .inprogress file is renamed to completed file, the file is not recognized as completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4959) Attributes are case sensitive when using a select query from a projection
[ https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285258#comment-14285258 ] Patrick Wendell edited comment on SPARK-4959 at 1/21/15 6:47 AM: - Note that in the 1.2 branch this was fixed by https://github.com/apache/spark/pull/3987 (per discussion with [~lian cheng]). was (Author: pwendell): Note that in the 1.2 branch this was fixed by https://github.com/apache/spark/pull/3987 (per discussion with @cheng lian). Attributes are case sensitive when using a select query from a projection - Key: SPARK-4959 URL: https://issues.apache.org/jira/browse/SPARK-4959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andy Konwinski Assignee: Cheng Hao Priority: Blocker Labels: backport-needed Fix For: 1.3.0, 1.2.1 Per [~marmbrus], see this line of code, where we should be using an attribute map https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147 To reproduce, i ran the following in the Spark shell: {code} import sqlContext._ sql(drop table if exists test) sql(create table test (col1 string)) sql(insert into table test select hi from prejoined limit 1) val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil sqlContext.table(test).select(projection:_*).registerTempTable(test2) # This succeeds. sql(select CaseSensitiveColName from test2).first() # This fails with java.util.NoSuchElementException: key not found: casesensitivecolname#23046 sql(select casesensitivecolname from test2).first() {code} The full stack trace printed for the final command that is failing: {code} java.util.NoSuchElementException: key not found: casesensitivecolname#23046 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446) at
[jira] [Updated] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5021: - Assignee: Manoj Kumar GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5276) pyspark.streaming is not included in assembly jar
[ https://issues.apache.org/jira/browse/SPARK-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5276. Resolution: Duplicate pyspark.streaming is not included in assembly jar - Key: SPARK-5276 URL: https://issues.apache.org/jira/browse/SPARK-5276 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Priority: Blocker The pyspark.streaming is not included in assembly jar of spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5262) coalesce should allow NullType and 1 another type in parameters
[ https://issues.apache.org/jira/browse/SPARK-5262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285187#comment-14285187 ] Yin Huai commented on SPARK-5262: - OK i see. In HiveContext, we are still using Hive's UDF. Actually, it will be good to do the work of this JIRA and SPARK-5244 together. coalesce should allow NullType and 1 another type in parameters --- Key: SPARK-5262 URL: https://issues.apache.org/jira/browse/SPARK-5262 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Currently Coalesce(null, 1, null) would throw exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4959) Attributes are case sensitive when using a select query from a projection
[ https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285258#comment-14285258 ] Patrick Wendell commented on SPARK-4959: Note that in the 1.2 branch this was fixed by https://github.com/apache/spark/pull/3987 (per discussion with @cheng lian). Attributes are case sensitive when using a select query from a projection - Key: SPARK-4959 URL: https://issues.apache.org/jira/browse/SPARK-4959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andy Konwinski Assignee: Cheng Hao Priority: Blocker Labels: backport-needed Fix For: 1.3.0, 1.2.1 Per [~marmbrus], see this line of code, where we should be using an attribute map https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147 To reproduce, i ran the following in the Spark shell: {code} import sqlContext._ sql(drop table if exists test) sql(create table test (col1 string)) sql(insert into table test select hi from prejoined limit 1) val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil sqlContext.table(test).select(projection:_*).registerTempTable(test2) # This succeeds. sql(select CaseSensitiveColName from test2).first() # This fails with java.util.NoSuchElementException: key not found: casesensitivecolname#23046 sql(select casesensitivecolname from test2).first() {code} The full stack trace printed for the final command that is failing: {code} java.util.NoSuchElementException: key not found: casesensitivecolname#23046 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108) at org.apache.spark.rdd.RDD.first(RDD.scala:1093) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-5344) HistoryServer cannot recognize that inprogress file was renamed to completed file
[ https://issues.apache.org/jira/browse/SPARK-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-5344: -- Description: FsHistoryProvider tries to update application status but if checkForLogs is called before .inprogress file is renamed to completed file, the file is not recognized as completed. (was: FsHistoryProvider tries to updates application status but if checkForLogs is called before .inprogress file is renamed to completed file, the file is not recognized as completed.) HistoryServer cannot recognize that inprogress file was renamed to completed file - Key: SPARK-5344 URL: https://issues.apache.org/jira/browse/SPARK-5344 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Kousuke Saruta FsHistoryProvider tries to update application status but if checkForLogs is called before .inprogress file is renamed to completed file, the file is not recognized as completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection
[ https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4959: --- Fix Version/s: 1.2.1 Attributes are case sensitive when using a select query from a projection - Key: SPARK-4959 URL: https://issues.apache.org/jira/browse/SPARK-4959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andy Konwinski Assignee: Cheng Hao Priority: Blocker Labels: backport-needed Fix For: 1.3.0, 1.2.1 Per [~marmbrus], see this line of code, where we should be using an attribute map https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147 To reproduce, i ran the following in the Spark shell: {code} import sqlContext._ sql(drop table if exists test) sql(create table test (col1 string)) sql(insert into table test select hi from prejoined limit 1) val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil sqlContext.table(test).select(projection:_*).registerTempTable(test2) # This succeeds. sql(select CaseSensitiveColName from test2).first() # This fails with java.util.NoSuchElementException: key not found: casesensitivecolname#23046 sql(select casesensitivecolname from test2).first() {code} The full stack trace printed for the final command that is failing: {code} java.util.NoSuchElementException: key not found: casesensitivecolname#23046 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108) at org.apache.spark.rdd.RDD.first(RDD.scala:1093) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5344) HistoryServer cannot recognize that inprogress file was renamed to completed file
[ https://issues.apache.org/jira/browse/SPARK-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-5344: -- Description: FsHistoryProvider tries to updates application status but if checkForLogs is called before .inprogress file is renamed to completed file, the file is not recognized as completed. (was: FsHistoryProvider, tries to updates application status but if checkForLogs is called before .inprogress file is renamed to completed file, the file is not recognized as completed.) HistoryServer cannot recognize that inprogress file was renamed to completed file - Key: SPARK-5344 URL: https://issues.apache.org/jira/browse/SPARK-5344 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Kousuke Saruta FsHistoryProvider tries to updates application status but if checkForLogs is called before .inprogress file is renamed to completed file, the file is not recognized as completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org