[jira] [Updated] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems
[ https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak updated SPARK-1836: - Description: Anand Avati partially traced the cause to REPL wrapping classes in $outer classes. There are at least two major symptoms: 1. equals() = In REPL equals() (required in custom classes used as a key for groupByKey) seems to have to be written using instanceOf[] instead of the canonical match{} Spark Shell (equals uses match{}): {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = false {noformat} Spark Shell (equals uses isInstanceOf[]): {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s = s) else false } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true {noformat} Scala Shell (equals uses match{}): {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true {noformat} 2. lookup() = {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == s else false override def hashCode = s.hashCode override def toString = s } val r = sc.parallelize(Array((new C(a),11),(new C(a),12))) r.lookup(new C(a)) console:17: error: type mismatch; found : C required: C r.lookup(new C(a)) ^ {noformat} See http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3C1400019424.80629.YahooMailNeo%40web160801.mail.bf1.yahoo.com%3E was: Anand Avati partially traced the cause to REPL wrapping classes in $outer classes. There are at least two major symptoms: 1. equals() = In REPL equals() (required in custom classes used as a key for groupByKey) seems to have to be written using instanceOf[] instead of the canonical match{} Spark Shell (equals uses match{}): class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = false Spark Shell (equals uses isInstanceOf[]): class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s = s) else false } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true Scala Shell (equals uses match{}): class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true 2. lookup() = class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == s else false override def hashCode = s.hashCode override def toString = s } val r = sc.parallelize(Array((new C(a),11),(new C(a),12))) r.lookup(new C(a)) console:17: error: type mismatch; found : C required: C r.lookup(new C(a)) ^ See
[jira] [Commented] (SPARK-1781) Generalized validity checking for configuration parameters
[ https://issues.apache.org/jira/browse/SPARK-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993903#comment-13993903 ] Erik Erlandson commented on SPARK-1781: --- Ideally, pre-fab predicates could be defined on a per-parameter basis, that could live somewhere, either serialized in a file, or in the code, etc.For example, from SPARK-1779, memoryFraction should always be checked for being on [0.0, 1.0]. A caller of getDouble() should not have to specify that if they're asking for memoryFraction (and perhaps should not be allowed to override that requirement either) Generalized validity checking for configuration parameters -- Key: SPARK-1781 URL: https://issues.apache.org/jira/browse/SPARK-1781 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: William Benton Priority: Minor Issues like SPARK-1779 could be handled easily by a general mechanism for specifying whether or not a configuration parameter value is valid or not (and then excepting or warning and switching to a default value if it is not). I think it's possible to do this in a fairly lightweight fashion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1647) Prevent data loss when Streaming driver goes down
[ https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1647: - Fix Version/s: 1.1.0 Prevent data loss when Streaming driver goes down - Key: SPARK-1647 URL: https://issues.apache.org/jira/browse/SPARK-1647 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Fix For: 1.1.0 Currently when the driver goes down, any uncheckpointed data is lost from within spark. If the system from which messages are pulled can replay messages, the data may be available - but for some systems, like Flume this is not the case. Also, all windowing information is lost for windowing functions. We must persist raw data somehow, and be able to replay this data if required. We also must persist windowing information with the data itself. This will likely require quite a bit of work to complete and probably will have to be split into several sub-jiras. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1850) Bad exception if multiple jars exist when running PySpark
[ https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1850: - Description: {code} Found multiple Spark assembly jars in /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10: Traceback (most recent call last): File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, line 43, in module sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, pyFiles=add_files) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 94, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 180, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, line 49, in launch_gateway gateway_port = int(proc.stdout.readline()) ValueError: invalid literal for int() with base 10: 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n' {code} *Explanation.* It's trying to read the Java gateway port as an int from the sub-process' STDOUT. However, what it read was an error message, which is clearly not an int. We should differentiate between these cases and just propagate the original message if it's not an int. Right now, this exception is not very helpful. was: {code} Found multiple Spark assembly jars in /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10: Traceback (most recent call last): File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, line 43, in module sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, pyFiles=add_files) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 94, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 180, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, line 49, in launch_gateway gateway_port = int(proc.stdout.readline()) ValueError: invalid literal for int() with base 10: 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n' {/code} It's trying to read the Java gateway port as an int from the sub-process' STDOUT. However, what it read was an error message, which is clearly not an int. We should differentiate between these cases and just propagate the original message if it's not an int. Right now, this exception is not very helpful. Bad exception if multiple jars exist when running PySpark - Key: SPARK-1850 URL: https://issues.apache.org/jira/browse/SPARK-1850 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Andrew Or Fix For: 1.0.1 {code} Found multiple Spark assembly jars in /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10: Traceback (most recent call last): File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, line 43, in module sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, pyFiles=add_files) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 94, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 180, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, line 49, in launch_gateway gateway_port = int(proc.stdout.readline()) ValueError: invalid literal for int() with base 10: 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n' {code} *Explanation.* It's trying to read the Java gateway port as an int from the sub-process' STDOUT. However, what it read was an error message, which is clearly not an int. We should differentiate between these cases and just propagate the original message if it's not an int. Right now, this exception is not very helpful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-1860) Standalone Worker cleanup should not clean up running applications by default
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-1860: The PR actually just disabled it, it didn't fix this. Standalone Worker cleanup should not clean up running applications by default - Key: SPARK-1860 URL: https://issues.apache.org/jira/browse/SPARK-1860 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical Fix For: 1.0.0 The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any applications that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Applications should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1839) PySpark take() does not launch a Spark job when it has to
[ https://issues.apache.org/jira/browse/SPARK-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1839: --- Fix Version/s: 1.1.0 PySpark take() does not launch a Spark job when it has to - Key: SPARK-1839 URL: https://issues.apache.org/jira/browse/SPARK-1839 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Hossein Falaki Fix For: 1.1.0 If you call take() or first() on a large FilteredRDD, the driver attempts to scan all partitions to find the first valid item. If the RDD is large this would fail or hang. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1850) Bad exception if multiple jars exist when running PySpark
Andrew Or created SPARK-1850: Summary: Bad exception if multiple jars exist when running PySpark Key: SPARK-1850 URL: https://issues.apache.org/jira/browse/SPARK-1850 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Andrew Or Fix For: 1.0.1 Found multiple Spark assembly jars in /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10: Traceback (most recent call last): File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, line 43, in module sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, pyFiles=add_files) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 94, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 180, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, line 49, in launch_gateway gateway_port = int(proc.stdout.readline()) ValueError: invalid literal for int() with base 10: 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n' It's trying to read the Java gateway port as an int from the sub-process' STDOUT. However, what it read was an error message, which is clearly not an int. We should differentiate between these cases and just propagate the original message if it's not an int. Right now, this exception is not very helpful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1359) SGD implementation is not efficient
[ https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1359: - Affects Version/s: 1.0.0 SGD implementation is not efficient --- Key: SPARK-1359 URL: https://issues.apache.org/jira/browse/SPARK-1359 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 0.9.0, 1.0.0 Reporter: Xiangrui Meng The SGD implementation samples a mini-batch to compute the stochastic gradient. This is not efficient because examples are provided via an iterator interface. We have to scan all of them to obtain a sample. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1110) Clean up and clarify use of SPARK_HOME
[ https://issues.apache.org/jira/browse/SPARK-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1110. Resolution: Fixed This was subsumed by the other configuration clean-up. Clean up and clarify use of SPARK_HOME -- Key: SPARK-1110 URL: https://issues.apache.org/jira/browse/SPARK-1110 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Assignee: Patrick Cogan Fix For: 1.0.0 In the spirit of SPARK-929 we should clean up the use of SPARK_HOME and, if possible, remove it entirely. We need to look through what this is used for. One use was allowing applications to run different versions of Spark in standalone mode. For instance, someone could submit an application with a custom SPARK_HOME and the Worker would launch an Executor using a different path for Spark. This use case is not widely used and maybe should just be removed. The existing constructors that take SPARK_HOME for this purpose should be deprecated and we should explain that SPARK_HOME is no longer used for this purpose. If there are other legitimate reasons for SPARK_HOME, we can keep it around... we need to audit the uses of it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed
[ https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harry Brundage updated SPARK-1849: -- Attachment: encoding_test Here's the windows encoded file I was using to test with if you'd like to play around. Broken UTF-8 encoded data gets character replacements and thus can't be fixed --- Key: SPARK-1849 URL: https://issues.apache.org/jira/browse/SPARK-1849 Project: Spark Issue Type: Bug Reporter: Harry Brundage Fix For: 1.0.0, 0.9.1 Attachments: encoding_test I'm trying to process a file which isn't valid UTF-8 data inside hadoop using Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that we should fix? It looks like {{HadoopRDD}} uses {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement character, \uFFFD. Some example code mimicking what {{sc.textFile}} does underneath: {code} scala sc.textFile(path).collect()(0) res8: String = ?pple scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes() res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101) scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair = pair._2.getBytes).collect()(0) res10: Array[Byte] = Array(-60, 112, 112, 108, 101) {code} In the above example, the first two snippets show the string representation and byte representation of the example line of text. The third snippet shows what happens if you call {{getBytes}} on the {{Text}} object which comes back from hadoop land: we get the real bytes in the file out. Now, I think this is a bug, though you may disagree. The text inside my file is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to rescue and re-encode into UTF-8, because I want my application to be smart like that. I think Spark should give me the raw broken string so I can re-encode, but I can't get at the original bytes in order to guess at what the source encoding might be, as they have already been replaced. I'm dealing with data from some CDN access logs which are to put it nicely diversely encoded, but I think a use case Spark should fully support. So, my suggested fix, which I'd like some guidance, is to change {{textFile}} to spit out broken strings by not using {{Text}}'s UTF-8 encoding. Further compounding this issue is that my application is actually in PySpark, but we can talk about how bytes fly through to Scala land after this if we agree that this is an issue at all. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-911) Support map pruning on sorted (K, V) RDD's
[ https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-911: -- Fix Version/s: (was: 1.0.0) Support map pruning on sorted (K, V) RDD's -- Key: SPARK-911 URL: https://issues.apache.org/jira/browse/SPARK-911 Project: Spark Issue Type: Bug Reporter: Patrick Wendell \[Tentatively assigned to me, but anyone can do this if they'd like!\] If someone has sorted a (K, V) rdd, we should offer them a way to filter a range of the partitions that employs map pruning. This would be simple using a small range index within the rdd itself. A good example is I sort my dataset by time and then I want to serve queries that are restricted to a certain time range. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
Tathagata Das created SPARK-1853: Summary: Show Streaming application code context (file, line number) in Spark Stages UI Key: SPARK-1853 URL: https://issues.apache.org/jira/browse/SPARK-1853 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Right now, the code context (file, and line number) shown for streaming jobs in stages UI is meaningless as it refers to internal DStream:random line rather than user application file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1741) Add predict(JavaRDD) to predictive models
[ https://issues.apache.org/jira/browse/SPARK-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1741: --- Fix Version/s: 1.0.1 Add predict(JavaRDD) to predictive models - Key: SPARK-1741 URL: https://issues.apache.org/jira/browse/SPARK-1741 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.0.0, 1.0.1 `model.predict` returns a RDD of Scala primitive type (Int/Double), which is recognized as Object in Java. Adding predict(JavaRDD) could make life easier for Java users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-874) Have version of `sc.stop()` that blocks until all executors are cleaned up.
[ https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-874: -- Fix Version/s: (was: 1.0.0) 1.1.0 Have version of `sc.stop()` that blocks until all executors are cleaned up. --- Key: SPARK-874 URL: https://issues.apache.org/jira/browse/SPARK-874 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Patrick Wendell Assignee: Patrick Cogan Priority: Minor Labels: starter Fix For: 1.1.0 This would make things nicer for writing automated tests of Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1485) Implement AllReduce
[ https://issues.apache.org/jira/browse/SPARK-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1485: - Priority: Critical (was: Major) Implement AllReduce --- Key: SPARK-1485 URL: https://issues.apache.org/jira/browse/SPARK-1485 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.1.0 The current implementations of machine learning algorithms rely on the driver for some computation and data broadcasting. This will create a bottleneck at the driver for both computation and communication, especially in multi-model training. An efficient implementation of AllReduce (or AllAggregate) can help free the driver: allReduce(RDD[T], (T, T) = T): RDD[T] This JIRA is created for discussing how to implement AllReduce efficiently and possible alternatives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1487) Support record filtering via predicate pushdown in Parquet
[ https://issues.apache.org/jira/browse/SPARK-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1487: Fix Version/s: (was: 1.1.0) Support record filtering via predicate pushdown in Parquet -- Key: SPARK-1487 URL: https://issues.apache.org/jira/browse/SPARK-1487 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Andre Schumacher Assignee: Andre Schumacher Fix For: 1.1.0 Parquet has support for column filters, which can be used to avoid reading and de-serializing records that fail the column filter condition. This can lead to potentially large savings, depending on the number of columns filtered by and how many records actually pass the filter. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1704) java.lang.AssertionError: assertion failed: No plan for ExplainCommand (Project [*])
[ https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1704: --- Fix Version/s: (was: 1.0.0) 1.1.0 java.lang.AssertionError: assertion failed: No plan for ExplainCommand (Project [*]) Key: SPARK-1704 URL: https://issues.apache.org/jira/browse/SPARK-1704 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: linux Reporter: Yangjp Labels: sql Fix For: 1.1.0 Original Estimate: 612h Remaining Estimate: 612h 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src 14/05/03 22:08:40 INFO ParseDriver: Parse Completed 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION : java.lang.AssertionError: assertion failed: No plan for ExplainCommand (Project [*]) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248) at org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39) at org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72) at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407) at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410) at org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710) at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664) at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653) at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67) at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124) at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1820) Make GenerateMimaIgnore @DeveloperApi annotation aware.
[ https://issues.apache.org/jira/browse/SPARK-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1820: --- Assignee: Prashant Sharma Make GenerateMimaIgnore @DeveloperApi annotation aware. --- Key: SPARK-1820 URL: https://issues.apache.org/jira/browse/SPARK-1820 Project: Spark Issue Type: Improvement Components: Build Reporter: Prashant Sharma Assignee: Prashant Sharma Fix For: 1.1.0 Ignore all the classes with DeveloperApi annotation, while doing Mima checks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1826) Some bad head notations in sparksql
[ https://issues.apache.org/jira/browse/SPARK-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998988#comment-13998988 ] Michael Armbrust commented on SPARK-1826: - Fixed in: https://github.com/apache/spark/pull/765 Some bad head notations in sparksql Key: SPARK-1826 URL: https://issues.apache.org/jira/browse/SPARK-1826 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: wangfei Fix For: 1.0.1 There are some obvious bad notations, such as sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1669) SQLContext.cacheTable() should be idempotent
[ https://issues.apache.org/jira/browse/SPARK-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-1669: -- Description: Calling {{cacheTable()}} on some table {{t}} multiple times causes table {{t}} to be cached multiple times. This semantics is different from {{RDD.cache()}}, which is idempotent. We can check whether a table is already cached by checking: # whether the structure of the underlying logical plan of the table is matches the pattern {{Subquery(\_, SparkLogicalPlan(inMem @ InMemoryColumnarTableScan(_, _)))}} # whether {{inMem.cachedColumnBuffers.getStorageLevel.useMemory}} is true was: Calling {{cacheTable()}} on some table {{t} multiple times causes table {{t}} to be cached multiple times. This semantics is different from {{RDD.cache()}}, which is idempotent. We can check whether a table is already cached by checking: # whether the structure of the underlying logical plan of the table is matches the pattern {{Subquery(\_, SparkLogicalPlan(inMem @ InMemoryColumnarTableScan(_, _)))}} # whether {{inMem.cachedColumnBuffers.getStorageLevel.useMemory}} is true SQLContext.cacheTable() should be idempotent Key: SPARK-1669 URL: https://issues.apache.org/jira/browse/SPARK-1669 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Cheng Lian Labels: cache, column Calling {{cacheTable()}} on some table {{t}} multiple times causes table {{t}} to be cached multiple times. This semantics is different from {{RDD.cache()}}, which is idempotent. We can check whether a table is already cached by checking: # whether the structure of the underlying logical plan of the table is matches the pattern {{Subquery(\_, SparkLogicalPlan(inMem @ InMemoryColumnarTableScan(_, _)))}} # whether {{inMem.cachedColumnBuffers.getStorageLevel.useMemory}} is true -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1633) Various examples for Scala and Java custom receiver, etc.
[ https://issues.apache.org/jira/browse/SPARK-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-1633. -- Resolution: Fixed Fix Version/s: 1.0.0 Various examples for Scala and Java custom receiver, etc. -- Key: SPARK-1633 URL: https://issues.apache.org/jira/browse/SPARK-1633 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-732: -- Fix Version/s: (was: 1.0.0) Recomputation of RDDs may result in duplicated accumulator updates -- Key: SPARK-732 URL: https://issues.apache.org/jira/browse/SPARK-732 Project: Spark Issue Type: Bug Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 0.8.2 Reporter: Josh Rosen Assignee: Nan Zhu Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD. For example: {code} val acc = sc.accumulator(0) data.map(x = acc += 1; f(x)) data.count() // acc should equal data.count() here data.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed. {code} I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation. There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates. Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations). To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this. Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties. Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level. I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1389) Make numPartitions in Exchange configurable
[ https://issues.apache.org/jira/browse/SPARK-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1389: --- Fix Version/s: (was: 1.0.0) 1.1.0 Make numPartitions in Exchange configurable --- Key: SPARK-1389 URL: https://issues.apache.org/jira/browse/SPARK-1389 Project: Spark Issue Type: Improvement Reporter: Michael Armbrust Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1603) flaky test case in StreamingContextSuite
[ https://issues.apache.org/jira/browse/SPARK-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999241#comment-13999241 ] Tathagata Das commented on SPARK-1603: -- I think we havent seen the flakiness since then. So I am marking this as resolved. flaky test case in StreamingContextSuite Key: SPARK-1603 URL: https://issues.apache.org/jira/browse/SPARK-1603 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Nan Zhu Assignee: Nan Zhu When Jenkins was testing 5 PRs at the same time, the test results in my PR shows that stop gracefully in StreamingContextSuite failed, the stacktrace is as {quote} stop gracefully *** FAILED *** (8 seconds, 350 milliseconds) [info] akka.actor.InvalidActorNameException: actor name [JobScheduler] is not unique! [info] at akka.actor.dungeon.ChildrenContainer$TerminatingChildrenContainer.reserve(ChildrenContainer.scala:192) [info] at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77) [info] at akka.actor.ActorCell.reserveChild(ActorCell.scala:338) [info] at akka.actor.dungeon.Children$class.makeChild(Children.scala:186) [info] at akka.actor.dungeon.Children$class.attachChild(Children.scala:42) [info] at akka.actor.ActorCell.attachChild(ActorCell.scala:338) [info] at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:518) [info] at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:57) [info] at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:434) [info] at org.apache.spark.streaming.StreamingContextSuite$$anonfun$14$$anonfun$apply$mcV$sp$3.apply$mcVI$sp(StreamingContextSuite.scala:174) [info] at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) [info] at org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply$mcV$sp(StreamingContextSuite.scala:163) [info] at org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply(StreamingContextSuite.scala:159) [info] at org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply(StreamingContextSuite.scala:159) [info] at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1974) [info] at org.apache.spark.streaming.StreamingContextSuite.withFixture(StreamingContextSuite.scala:34) [info] at org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262) [info] at org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271) [info] at org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:198) [info] at org.scalatest.FunSuite$class.runTest(FunSuite.scala:1271) [info] at org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:34) [info] at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:171) [info] at org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:34) [info] at org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304) [info] at org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304) [info] at org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:260) [info] at org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:249) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:249) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:326) [info] at org.scalatest.FunSuite$class.runTests(FunSuite.scala:1304) [info] at org.apache.spark.streaming.StreamingContextSuite.runTests(StreamingContextSuite.scala:34) [info] at org.scalatest.Suite$class.run(Suite.scala:2303) [info] at org.apache.spark.streaming.StreamingContextSuite.org$scalatest$FunSuite$$super$run(StreamingContextSuite.scala:34) [info] at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310) [info] at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:362) [info] at org.scalatest.FunSuite$class.run(FunSuite.scala:1310) [info] at org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingContextSuite.scala:34) [info] at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:208) [info] at org.apache.spark.streaming.StreamingContextSuite.run(StreamingContextSuite.scala:34) [info] at
[jira] [Commented] (SPARK-1154) Spark fills up disk with app-* folders
[ https://issues.apache.org/jira/browse/SPARK-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999562#comment-13999562 ] Mingyu Kim commented on SPARK-1154: --- I looked at the commit, and it seems like it wipes out app-* based on the last modification time of the directory itself. Because the modification time of a directory only changes when a child is added or removed, this may wipe out the app-* directory of a running Spark application if it has been running for more than TTL (unless new files/jars are added to the app directory once every while). I believe it should check the latest modification time of all the descendents of the app-* directory to decide whether to delete it or not. Am I mistaken? Spark fills up disk with app-* folders -- Key: SPARK-1154 URL: https://issues.apache.org/jira/browse/SPARK-1154 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Evan Chan Assignee: Evan Chan Priority: Critical Labels: starter Fix For: 1.0.0 Current version of Spark fills up the disk with many app-* folders: $ ls /var/lib/spark app-20140210022347-0597 app-20140212173327-0627 app-20140218154110-0657 app-20140225232537-0017 app-20140225233548-0047 app-20140210022407-0598 app-20140212173347-0628 app-20140218154130-0658 app-20140225232551-0018 app-20140225233556-0048 app-20140210022427-0599 app-20140212173754-0629 app-20140218164232-0659 app-20140225232611-0019 app-20140225233603-0049 app-20140210022447-0600 app-20140212182235-0630 app-20140218165133-0660 app-20140225232802-0020 app-20140225233610-0050 app-20140210022508-0601 app-20140212182256-0631 app-20140218165148-0661 app-20140225232822-0021 app-20140225233617-0051 app-20140210022528-0602 app-2014021314-0632 app-20140218165225-0662 app-20140225232940-0022 app-20140225233624-0052 app-20140211024356-0603 app-20140213002026-0633 app-20140218165249-0663 app-20140225233002-0023 app-20140225233631-0053 app-20140211024417-0604 app-20140213154948-0634 app-20140218172030-0664 app-20140225233056-0024 app-20140225233725-0054 app-20140211024437-0605 app-20140213171810-0635 app-20140218193853-0665 app-20140225233108-0025 app-20140225233731-0055 app-20140211024457-0606 app-20140213193637-0636 app-20140218194442-0666 app-20140225233124-0026 app-20140225233733-0056 app-20140211024517-0607 app-20140214011513-0637 app-20140218194746-0667 app-20140225233133-0027 app-20140225233734-0057 app-20140211024538-0608 app-20140214012151-0638 app-20140218194822-0668 app-20140225233147-0028 app-20140225233749-0058 app-20140211193443-0609 app-20140214013134-0639 app-20140218212317-0669 app-20140225233208-0029 app-20140225233759-0059 app-20140211195210-0610 app-20140214013332-0640 app-20140225180142- app-20140225233215-0030 app-20140225233809-0060 app-20140211213935-0611 app-20140214013642-0641 app-20140225180411-0001 app-20140225233224-0031 app-20140225233828-0061 app-20140211214227-0612 app-20140214014246-0642 app-20140225180431-0002 app-20140225233232-0032 app-20140225234719-0062 app-20140211215317-0613 app-20140214014607-0643 app-20140225180452-0003 app-20140225233239-0033 app-20140226032845-0063 app-20140211224601-0614 app-20140214184943-0644 app-20140225180512-0004 app-20140225233320-0034 app-20140226033004-0064 app-20140212022206-0615 app-20140214185118-0645 app-20140225180533-0005 app-20140225233328-0035 app-20140226033119-0065 app-2014021206-0616 app-20140214185851-0646 app-20140225180553-0006 app-20140225233354-0036 app-2014022604-0066 app-20140212022246-0617 app-20140214222856-0647 app-20140225181115-0007 app-20140225233402-0037 app-20140226033354-0067 app-20140212043704-0618 app-20140214231312-0648 app-20140225181244-0008 app-20140225233409-0038 app-20140226033538-0068 app-20140212043724-0619 app-20140214231434-0649 app-20140225182051-0009 app-20140225233416-0039 app-20140226033826-0069 app-20140212043745-0620 app-20140214231542-0650 app-20140225183009-0010 app-20140225233426-0040 app-20140226034002-0070 app-20140212044016-0621 app-20140214231616-0651 app-20140225184133-0011 app-20140225233432-0041 app-20140226034053-0071 app-20140212044203-0622 app-20140214233016-0652 app-20140225184318-0012 app-20140225233439-0042 app-20140226034234-0072 app-20140212044224-0623 app-20140214233037-0653 app-20140225184709-0013 app-20140225233447-0043 app-20140226034426-0073 app-20140212045034-0624 app-20140218153242-0654 app-20140225184844-0014 app-20140225233526-0044 app-20140226034447-0074 app-20140212045119-0625 app-20140218153341-0655 app-20140225190051-0015 app-20140225233534-0045
[jira] [Resolved] (SPARK-1826) Some bad head notations in sparksql
[ https://issues.apache.org/jira/browse/SPARK-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1826. - Resolution: Fixed Some bad head notations in sparksql Key: SPARK-1826 URL: https://issues.apache.org/jira/browse/SPARK-1826 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: wangfei Fix For: 1.0.1 There are some obvious bad notations, such as sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed
Harry Brundage created SPARK-1849: - Summary: Broken UTF-8 encoded data gets character replacements and thus can't be fixed Key: SPARK-1849 URL: https://issues.apache.org/jira/browse/SPARK-1849 Project: Spark Issue Type: Bug Reporter: Harry Brundage Fix For: 1.0.0, 0.9.1 Attachments: encoding_test I'm trying to process a file which isn't valid UTF-8 data inside hadoop using Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that we should fix? It looks like {{HadoopRDD}} uses {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement character, \uFFFD. Some example code mimicking what {{sc.textFile}} does underneath: {code} scala sc.textFile(path).collect()(0) res8: String = ?pple scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes() res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101) scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair = pair._2.getBytes).collect()(0) res10: Array[Byte] = Array(-60, 112, 112, 108, 101) {code} In the above example, the first two snippets show the string representation and byte representation of the example line of text. The third snippet shows what happens if you call {{getBytes}} on the {{Text}} object which comes back from hadoop land: we get the real bytes in the file out. Now, I think this is a bug, though you may disagree. The text inside my file is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to rescue and re-encode into UTF-8, because I want my application to be smart like that. I think Spark should give me the raw broken string so I can re-encode, but I can't get at the original bytes in order to guess at what the source encoding might be, as they have already been replaced. I'm dealing with data from some CDN access logs which are to put it nicely diversely encoded, but I think a use case Spark should fully support. So, my suggested fix, which I'd like some guidance, is to change {{textFile}} to spit out broken strings by not using {{Text}}'s UTF-8 encoding. Further compounding this issue is that my application is actually in PySpark, but we can talk about how bytes fly through to Scala land after this if we agree that this is an issue at all. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK
[ https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999499#comment-13999499 ] Xiangrui Meng commented on SPARK-1782: -- Btw, this approach only gives us \Sigma and V. If we compute U via A V \Sigma^{-1} (current implementation in MLlib), very likely we lose orthogonality. For sparse SVD, the best package is PROPACK, which implements Lanczos bidiagonalization with partial reorthogonalization (http://sun.stanford.edu/~rmunk/PROPACK/). But let us use ARPACK now since we can call it from Breeze. svd for sparse matrix using ARPACK -- Key: SPARK-1782 URL: https://issues.apache.org/jira/browse/SPARK-1782 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Li Pu Original Estimate: 672h Remaining Estimate: 672h Currently the svd implementation in mllib calls the dense matrix svd in breeze, which has a limitation of fitting n^2 Gram matrix entries in memory (n is the number of rows or number of columns of the matrix, whichever is smaller). In many use cases, the original matrix is sparse but the Gram matrix might not, and we often need only the largest k singular values/vectors. To make svd really scalable, the memory usage must be propositional to the non-zero entries in the matrix. One solution is to call the de facto standard eigen-decomposition package ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a reverse communication interface. The user provides a function to multiply a square matrix to be decomposed with a dense vector provided by ARPACK, and return the resulting dense vector to ARPACK. Inside ARPACK it uses an Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we need to provide are two matrix-vector multiplications, first M*x then M^t*x. These multiplications can be done in Spark in a distributed manner. The working memory used by ARPACK is O(n*k). When k (the number of desired singular values) is small, it can be easily fit into the memory of the master machine. The overall model is master machine runs ARPACK, and distribute matrix-vector multiplication onto working executors in each iteration. I made a PR to breeze with an ARPACK-backed svds interface (https://github.com/scalanlp/breeze/pull/240). The interface takes anything that can be multiplied by a DenseVector. On Spark/milib side, just need to implement the sparsematrix-vector multiplication. It might take some time to optimize and fully test this implementation, so set the workload estimate to 4 weeks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1752) Standardize input/output format for vectors and labeled points
[ https://issues.apache.org/jira/browse/SPARK-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1752: - Fix Version/s: 1.1.0 Standardize input/output format for vectors and labeled points -- Key: SPARK-1752 URL: https://issues.apache.org/jira/browse/SPARK-1752 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.1.0 We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following: 1. dense vector: [v0,v1,..] 2. sparse vector: (size,[i0,i1],[v0,v1]) 3. labeled point: (label,vector) where (..) indicates a tuple and [...] indicate an array. Those are compatible with Python's syntax and can be easily parsed using `eval`. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1845) Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala collections.
[ https://issues.apache.org/jira/browse/SPARK-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998680#comment-13998680 ] Takuya Ueshin commented on SPARK-1845: -- Pull-requested: https://github.com/apache/spark/pull/790 Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala collections. -- Key: SPARK-1845 URL: https://issues.apache.org/jira/browse/SPARK-1845 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin When I execute {{orderBy}} or {{limit}} for {{SchemaRDD}} including {{ArrayType}} or {{MapType}}, {{SparkSqlSerializer}} throws the following exception: {quote} com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.$colon$colon {quote} or {quote} com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.Vector {quote} or {quote} com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.HashMap$HashTrieMap {quote} and so on. This is because registrations of serializers for each concrete collections are missing in {{SparkSqlSerializer}}. I believe it should use {{AllScalaRegistrar}}. {{AllScalaRegistrar}} covers a lot of serializers for concrete classes of {{Seq}}, {{Map}} for {{ArrayType}}, {{MapType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1845) Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala collections.
[ https://issues.apache.org/jira/browse/SPARK-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1845. - Resolution: Fixed Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala collections. -- Key: SPARK-1845 URL: https://issues.apache.org/jira/browse/SPARK-1845 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin When I execute {{orderBy}} or {{limit}} for {{SchemaRDD}} including {{ArrayType}} or {{MapType}}, {{SparkSqlSerializer}} throws the following exception: {quote} com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.$colon$colon {quote} or {quote} com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.Vector {quote} or {quote} com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.HashMap$HashTrieMap {quote} and so on. This is because registrations of serializers for each concrete collections are missing in {{SparkSqlSerializer}}. I believe it should use {{AllScalaRegistrar}}. {{AllScalaRegistrar}} covers a lot of serializers for concrete classes of {{Seq}}, {{Map}} for {{ArrayType}}, {{MapType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1850) Bad exception if multiple jars exist when running PySpark
[ https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1850: - Description: {code} Found multiple Spark assembly jars in /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10: Traceback (most recent call last): File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, line 43, in module sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, pyFiles=add_files) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 94, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 180, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, line 49, in launch_gateway gateway_port = int(proc.stdout.readline()) ValueError: invalid literal for int() with base 10: 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n' {/code} It's trying to read the Java gateway port as an int from the sub-process' STDOUT. However, what it read was an error message, which is clearly not an int. We should differentiate between these cases and just propagate the original message if it's not an int. Right now, this exception is not very helpful. was: code Found multiple Spark assembly jars in /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10: Traceback (most recent call last): File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, line 43, in module sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, pyFiles=add_files) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 94, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 180, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, line 49, in launch_gateway gateway_port = int(proc.stdout.readline()) ValueError: invalid literal for int() with base 10: 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n' /code It's trying to read the Java gateway port as an int from the sub-process' STDOUT. However, what it read was an error message, which is clearly not an int. We should differentiate between these cases and just propagate the original message if it's not an int. Right now, this exception is not very helpful. Bad exception if multiple jars exist when running PySpark - Key: SPARK-1850 URL: https://issues.apache.org/jira/browse/SPARK-1850 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Andrew Or Fix For: 1.0.1 {code} Found multiple Spark assembly jars in /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10: Traceback (most recent call last): File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, line 43, in module sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, pyFiles=add_files) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 94, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, line 180, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, line 49, in launch_gateway gateway_port = int(proc.stdout.readline()) ValueError: invalid literal for int() with base 10: 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n' {/code} It's trying to read the Java gateway port as an int from the sub-process' STDOUT. However, what it read was an error message, which is clearly not an int. We should differentiate between these cases and just propagate the original message if it's not an int. Right now, this exception is not very helpful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1830) Deploy failover, Make Persistence engine and LeaderAgent Pluggable.
[ https://issues.apache.org/jira/browse/SPARK-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998549#comment-13998549 ] Prashant Sharma commented on SPARK-1830: PR at https://github.com/apache/spark/pull/771 Deploy failover, Make Persistence engine and LeaderAgent Pluggable. --- Key: SPARK-1830 URL: https://issues.apache.org/jira/browse/SPARK-1830 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Prashant Sharma Fix For: 1.1.0, 1.0.1 With current code base it is difficult to plugin an external user specified Persistence Engine or Election Agent. It would be good to expose this as a pluggable API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-944) Give example of writing to HBase from Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999226#comment-13999226 ] Tathagata Das commented on SPARK-944: - Hi Kanwal, the usual process of the contributing to Spark is through pull requests in Github. That way your name comes up in the change list as a contributor. Give example of writing to HBase from Spark Streaming - Key: SPARK-944 URL: https://issues.apache.org/jira/browse/SPARK-944 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Patrick Wendell Attachments: MetricAggregatorHBase.scala -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1022) Add unit tests for kafka streaming
[ https://issues.apache.org/jira/browse/SPARK-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1022: --- Fix Version/s: (was: 1.0.0) Add unit tests for kafka streaming -- Key: SPARK-1022 URL: https://issues.apache.org/jira/browse/SPARK-1022 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Saisai Shao It would be nice if we could add unit tests to verify elements of kafka's stream. Right now we do integration tests only which makes it hard to upgrade versions of kafka. The place to start here would be to look at how kafka tests itself and see if the functionality can be exposed to third party users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1749) DAGScheduler supervisor strategy broken with Mesos
[ https://issues.apache.org/jira/browse/SPARK-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-1749: Fix Version/s: 1.0.1 DAGScheduler supervisor strategy broken with Mesos -- Key: SPARK-1749 URL: https://issues.apache.org/jira/browse/SPARK-1749 Project: Spark Issue Type: Bug Components: Mesos, Spark Core Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Assignee: Mark Hamstra Priority: Blocker Labels: mesos, scheduler, scheduling Fix For: 1.0.1 Any bad Python code will trigger this bug, for example `sc.parallelize(range(100)).map(lambda n: undefined_variable * 2).collect()` will cause a `undefined_variable isn't defined`, which will cause spark to try to kill the task, resulting in the following stacktrace: java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:184) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:182) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:182) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:182) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:175) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:175) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1058) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045) at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGScheduler.scala:499) at org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1151) at org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1147) at akka.actor.SupervisorStrategy.handleFailure(FaultHandling.scala:295) at akka.actor.dungeon.FaultHandling$class.handleFailure(FaultHandling.scala:253) at akka.actor.ActorCell.handleFailure(ActorCell.scala:338) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:423) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262) at akka.dispatch.Mailbox.run(Mailbox.scala:218) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This is because killTask isn't implemented for the MesosSchedulerBackend. I assume this isn't pyspark-specific, as there will be other instances where you might want to kill the task -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1603) flaky test case in StreamingContextSuite
[ https://issues.apache.org/jira/browse/SPARK-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-1603. -- Resolution: Fixed flaky test case in StreamingContextSuite Key: SPARK-1603 URL: https://issues.apache.org/jira/browse/SPARK-1603 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Nan Zhu Assignee: Nan Zhu When Jenkins was testing 5 PRs at the same time, the test results in my PR shows that stop gracefully in StreamingContextSuite failed, the stacktrace is as {quote} stop gracefully *** FAILED *** (8 seconds, 350 milliseconds) [info] akka.actor.InvalidActorNameException: actor name [JobScheduler] is not unique! [info] at akka.actor.dungeon.ChildrenContainer$TerminatingChildrenContainer.reserve(ChildrenContainer.scala:192) [info] at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77) [info] at akka.actor.ActorCell.reserveChild(ActorCell.scala:338) [info] at akka.actor.dungeon.Children$class.makeChild(Children.scala:186) [info] at akka.actor.dungeon.Children$class.attachChild(Children.scala:42) [info] at akka.actor.ActorCell.attachChild(ActorCell.scala:338) [info] at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:518) [info] at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:57) [info] at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:434) [info] at org.apache.spark.streaming.StreamingContextSuite$$anonfun$14$$anonfun$apply$mcV$sp$3.apply$mcVI$sp(StreamingContextSuite.scala:174) [info] at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) [info] at org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply$mcV$sp(StreamingContextSuite.scala:163) [info] at org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply(StreamingContextSuite.scala:159) [info] at org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply(StreamingContextSuite.scala:159) [info] at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1974) [info] at org.apache.spark.streaming.StreamingContextSuite.withFixture(StreamingContextSuite.scala:34) [info] at org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262) [info] at org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271) [info] at org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:198) [info] at org.scalatest.FunSuite$class.runTest(FunSuite.scala:1271) [info] at org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:34) [info] at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:171) [info] at org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:34) [info] at org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304) [info] at org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304) [info] at org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:260) [info] at org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:249) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:249) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:326) [info] at org.scalatest.FunSuite$class.runTests(FunSuite.scala:1304) [info] at org.apache.spark.streaming.StreamingContextSuite.runTests(StreamingContextSuite.scala:34) [info] at org.scalatest.Suite$class.run(Suite.scala:2303) [info] at org.apache.spark.streaming.StreamingContextSuite.org$scalatest$FunSuite$$super$run(StreamingContextSuite.scala:34) [info] at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310) [info] at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:362) [info] at org.scalatest.FunSuite$class.run(FunSuite.scala:1310) [info] at org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingContextSuite.scala:34) [info] at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:208) [info] at org.apache.spark.streaming.StreamingContextSuite.run(StreamingContextSuite.scala:34) [info] at org.scalatest.tools.ScalaTestFramework$ScalaTestRunner.run(ScalaTestFramework.scala:214) [info] at
[jira] [Updated] (SPARK-1442) Add Window function support
[ https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1442: Fix Version/s: 1.1.0 Add Window function support --- Key: SPARK-1442 URL: https://issues.apache.org/jira/browse/SPARK-1442 Project: Spark Issue Type: New Feature Components: SQL Reporter: Chengxiang Li Fix For: 1.1.0 similiar to Hive, add window function support for catalyst. https://issues.apache.org/jira/browse/HIVE-4197 https://issues.apache.org/jira/browse/HIVE-896 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1741) Add predict(JavaRDD) to predictive models
[ https://issues.apache.org/jira/browse/SPARK-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1741. Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 670 [https://github.com/apache/spark/pull/670] Add predict(JavaRDD) to predictive models - Key: SPARK-1741 URL: https://issues.apache.org/jira/browse/SPARK-1741 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.0.0 `model.predict` returns a RDD of Scala primitive type (Int/Double), which is recognized as Object in Java. Adding predict(JavaRDD) could make life easier for Java users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running applications
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1860: --- Assignee: (was: Aaron Davidson) Standalone Worker cleanup should not clean up running applications -- Key: SPARK-1860 URL: https://issues.apache.org/jira/browse/SPARK-1860 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Aaron Davidson Priority: Critical Fix For: 1.0.0 The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any applications that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Applications should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1824) Python examples still take in master
[ https://issues.apache.org/jira/browse/SPARK-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1824: - Issue Type: Improvement (was: Bug) Python examples still take in master -- Key: SPARK-1824 URL: https://issues.apache.org/jira/browse/SPARK-1824 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Andrew Or Fix For: 1.0.1 A recent commit https://github.com/apache/spark/commit/44dd57fb66bb676d753ad8d9757f9f4c03364113 changed existing Spark examples in Scala and Java such that they no longer take in master as an argument. We forgot to do the same for Python. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-911) Support map pruning on sorted (K, V) RDD's
[ https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-911: -- Description: If someone has sorted a (K, V) rdd, we should offer them a way to filter a range of the partitions that employs map pruning. This would be simple using a small range index within the rdd itself. A good example is I sort my dataset by time and then I want to serve queries that are restricted to a certain time range. (was: \[Tentatively assigned to me, but anyone can do this if they'd like!\] If someone has sorted a (K, V) rdd, we should offer them a way to filter a range of the partitions that employs map pruning. This would be simple using a small range index within the rdd itself. A good example is I sort my dataset by time and then I want to serve queries that are restricted to a certain time range.) Support map pruning on sorted (K, V) RDD's -- Key: SPARK-911 URL: https://issues.apache.org/jira/browse/SPARK-911 Project: Spark Issue Type: Bug Reporter: Patrick Wendell If someone has sorted a (K, V) rdd, we should offer them a way to filter a range of the partitions that employs map pruning. This would be simple using a small range index within the rdd itself. A good example is I sort my dataset by time and then I want to serve queries that are restricted to a certain time range. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-874) Have version of `sc.stop()` that blocks until all executors are cleaned up.
[ https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-874: -- Assignee: (was: Patrick Cogan) Have version of `sc.stop()` that blocks until all executors are cleaned up. --- Key: SPARK-874 URL: https://issues.apache.org/jira/browse/SPARK-874 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Patrick Wendell Priority: Minor Labels: starter Fix For: 1.1.0 This would make things nicer for writing automated tests of Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished
[ https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-874: -- Description: When running benchmarking jobs, sometimes the cluster takes a long time to shut down. We should add a feature where it will ssh into all the workers every few seconds and check that the processes are dead, and won't return until they are all dead. This would help a lot with automating benchmarking scripts. (was: This would make things nicer for writing automated tests of Spark.) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished --- Key: SPARK-874 URL: https://issues.apache.org/jira/browse/SPARK-874 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Patrick Wendell Priority: Minor Labels: starter Fix For: 1.1.0 When running benchmarking jobs, sometimes the cluster takes a long time to shut down. We should add a feature where it will ssh into all the workers every few seconds and check that the processes are dead, and won't return until they are all dead. This would help a lot with automating benchmarking scripts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999602#comment-13999602 ] Patrick Wendell commented on SPARK-1860: I think it would be better to only start the TTL once an executor has finished and to only delete the specific folder used by the executor. Standalone Worker cleanup should not clean up running applications -- Key: SPARK-1860 URL: https://issues.apache.org/jira/browse/SPARK-1860 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Aaron Davidson Priority: Critical Fix For: 1.0.0 The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any applications that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Applications should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-874) Have version of `sc.stop()` that blocks until all executors are cleaned up.
[ https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-874: -- Labels: starter (was: ) Have version of `sc.stop()` that blocks until all executors are cleaned up. --- Key: SPARK-874 URL: https://issues.apache.org/jira/browse/SPARK-874 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Patrick Wendell Assignee: Patrick Cogan Priority: Minor Labels: starter Fix For: 1.1.0 This would make things nicer for writing automated tests of Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-944) Give example of writing to HBase from Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-944: -- Assignee: (was: Patrick Cogan) Give example of writing to HBase from Spark Streaming - Key: SPARK-944 URL: https://issues.apache.org/jira/browse/SPARK-944 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Patrick Wendell Attachments: MetricAggregatorHBase.scala -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1792) Missing Spark-Shell Configure Options
[ https://issues.apache.org/jira/browse/SPARK-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1792: --- Fix Version/s: 1.1.0 Missing Spark-Shell Configure Options - Key: SPARK-1792 URL: https://issues.apache.org/jira/browse/SPARK-1792 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Reporter: Joseph E. Gonzalez Fix For: 1.1.0 The `conf/spark-env.sh.template` does not have configure options for the spark shell. For example to enable Kryo for GraphX when using the spark shell in stand alone mode it appears you must add: {code} SPARK_SUBMIT_OPTS=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer SPARK_SUBMIT_OPTS+=-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator {code} However SPARK_SUBMIT_OPTS is not documented anywhere. Perhaps the spark-shell should have its own options (e.g., SPARK_SHELL_OPTS). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
[ https://issues.apache.org/jira/browse/SPARK-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1478: - Fix Version/s: 1.1.0 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915 --- Key: SPARK-1478 URL: https://issues.apache.org/jira/browse/SPARK-1478 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Ted Malaska Assignee: Ted Malaska Priority: Minor Fix For: 1.1.0 Flume-1915 added support for compression over the wire from avro sink to avro source. I would like to add this functionality to the FlumeReceiver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1820) Make GenerateMimaIgnore @DeveloperApi annotation aware.
[ https://issues.apache.org/jira/browse/SPARK-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1820: --- Affects Version/s: (was: 1.1.0) Make GenerateMimaIgnore @DeveloperApi annotation aware. --- Key: SPARK-1820 URL: https://issues.apache.org/jira/browse/SPARK-1820 Project: Spark Issue Type: Improvement Components: Build Reporter: Prashant Sharma Fix For: 1.1.0 Ignore all the classes with DeveloperApi annotation, while doing Mima checks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1154) Spark fills up disk with app-* folders
[ https://issues.apache.org/jira/browse/SPARK-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999601#comment-13999601 ] Patrick Wendell commented on SPARK-1154: [~mkim] yes you are correct - this is broken. Checkout SPARK-1860. Spark fills up disk with app-* folders -- Key: SPARK-1154 URL: https://issues.apache.org/jira/browse/SPARK-1154 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Evan Chan Assignee: Evan Chan Priority: Critical Labels: starter Fix For: 1.0.0 Current version of Spark fills up the disk with many app-* folders: $ ls /var/lib/spark app-20140210022347-0597 app-20140212173327-0627 app-20140218154110-0657 app-20140225232537-0017 app-20140225233548-0047 app-20140210022407-0598 app-20140212173347-0628 app-20140218154130-0658 app-20140225232551-0018 app-20140225233556-0048 app-20140210022427-0599 app-20140212173754-0629 app-20140218164232-0659 app-20140225232611-0019 app-20140225233603-0049 app-20140210022447-0600 app-20140212182235-0630 app-20140218165133-0660 app-20140225232802-0020 app-20140225233610-0050 app-20140210022508-0601 app-20140212182256-0631 app-20140218165148-0661 app-20140225232822-0021 app-20140225233617-0051 app-20140210022528-0602 app-2014021314-0632 app-20140218165225-0662 app-20140225232940-0022 app-20140225233624-0052 app-20140211024356-0603 app-20140213002026-0633 app-20140218165249-0663 app-20140225233002-0023 app-20140225233631-0053 app-20140211024417-0604 app-20140213154948-0634 app-20140218172030-0664 app-20140225233056-0024 app-20140225233725-0054 app-20140211024437-0605 app-20140213171810-0635 app-20140218193853-0665 app-20140225233108-0025 app-20140225233731-0055 app-20140211024457-0606 app-20140213193637-0636 app-20140218194442-0666 app-20140225233124-0026 app-20140225233733-0056 app-20140211024517-0607 app-20140214011513-0637 app-20140218194746-0667 app-20140225233133-0027 app-20140225233734-0057 app-20140211024538-0608 app-20140214012151-0638 app-20140218194822-0668 app-20140225233147-0028 app-20140225233749-0058 app-20140211193443-0609 app-20140214013134-0639 app-20140218212317-0669 app-20140225233208-0029 app-20140225233759-0059 app-20140211195210-0610 app-20140214013332-0640 app-20140225180142- app-20140225233215-0030 app-20140225233809-0060 app-20140211213935-0611 app-20140214013642-0641 app-20140225180411-0001 app-20140225233224-0031 app-20140225233828-0061 app-20140211214227-0612 app-20140214014246-0642 app-20140225180431-0002 app-20140225233232-0032 app-20140225234719-0062 app-20140211215317-0613 app-20140214014607-0643 app-20140225180452-0003 app-20140225233239-0033 app-20140226032845-0063 app-20140211224601-0614 app-20140214184943-0644 app-20140225180512-0004 app-20140225233320-0034 app-20140226033004-0064 app-20140212022206-0615 app-20140214185118-0645 app-20140225180533-0005 app-20140225233328-0035 app-20140226033119-0065 app-2014021206-0616 app-20140214185851-0646 app-20140225180553-0006 app-20140225233354-0036 app-2014022604-0066 app-20140212022246-0617 app-20140214222856-0647 app-20140225181115-0007 app-20140225233402-0037 app-20140226033354-0067 app-20140212043704-0618 app-20140214231312-0648 app-20140225181244-0008 app-20140225233409-0038 app-20140226033538-0068 app-20140212043724-0619 app-20140214231434-0649 app-20140225182051-0009 app-20140225233416-0039 app-20140226033826-0069 app-20140212043745-0620 app-20140214231542-0650 app-20140225183009-0010 app-20140225233426-0040 app-20140226034002-0070 app-20140212044016-0621 app-20140214231616-0651 app-20140225184133-0011 app-20140225233432-0041 app-20140226034053-0071 app-20140212044203-0622 app-20140214233016-0652 app-20140225184318-0012 app-20140225233439-0042 app-20140226034234-0072 app-20140212044224-0623 app-20140214233037-0653 app-20140225184709-0013 app-20140225233447-0043 app-20140226034426-0073 app-20140212045034-0624 app-20140218153242-0654 app-20140225184844-0014 app-20140225233526-0044 app-20140226034447-0074 app-20140212045119-0625 app-20140218153341-0655 app-20140225190051-0015 app-20140225233534-0045 app-20140212173310-0626 app-20140218153442-0656 app-20140225232516-0016 app-20140225233540-0046 This problem is particularly bad if you have a whole bunch of fast jobs. Also what makes the problem worse is that any jars for jobs is downloaded into the app-* folder, so that fills up the disk particularly fast. I would like to propose two things: 1) Spark should have a cleanup thread (or actor) which periodically removes old app-* folders; This should not be the
[jira] [Updated] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished
[ https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-874: -- Description: When running benchmarking jobs, sometimes the cluster takes a long time to shut down. We should add a feature where it will ssh into all the workers every few seconds and check that the processes are dead, and won't return until they are all dead. This would help a lot with automating benchmarking scripts. There is some equivalent logic here written in python, we just need to add it to the shell script: https://github.com/pwendell/spark-perf/blob/master/bin/run#L117 was:When running benchmarking jobs, sometimes the cluster takes a long time to shut down. We should add a feature where it will ssh into all the workers every few seconds and check that the processes are dead, and won't return until they are all dead. This would help a lot with automating benchmarking scripts. Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished --- Key: SPARK-874 URL: https://issues.apache.org/jira/browse/SPARK-874 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Patrick Wendell Priority: Minor Labels: starter Fix For: 1.1.0 When running benchmarking jobs, sometimes the cluster takes a long time to shut down. We should add a feature where it will ssh into all the workers every few seconds and check that the processes are dead, and won't return until they are all dead. This would help a lot with automating benchmarking scripts. There is some equivalent logic here written in python, we just need to add it to the shell script: https://github.com/pwendell/spark-perf/blob/master/bin/run#L117 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1851) Upgrade Avro dependency to 1.7.6 so Spark can read Avro files
[ https://issues.apache.org/jira/browse/SPARK-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1851: --- Assignee: Sandy Ryza Upgrade Avro dependency to 1.7.6 so Spark can read Avro files - Key: SPARK-1851 URL: https://issues.apache.org/jira/browse/SPARK-1851 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical Fix For: 1.0.0 I tried to set up a basic example getting a Spark job to read an Avro container file with Avro specifics. This results in a ClassNotFoundException: can't convert GenericData.Record to com.cloudera.sparkavro.User. The reason is: * When creating records, to decide whether to be specific or generic, Avro tries to load a class with the name specified in the schema. * Initially, executors just have the system jars (which include Avro), and load the app jars dynamically with a URLClassLoader that's set as the context classloader for the task threads. * Avro tries to load the generated classes with SpecificData.class.getClassLoader(), which sidesteps this URLClassLoader and goes up to the AppClassLoader. Avro 1.7.6 has a change (AVRO-987) that falls back to the Thread's context classloader when the SpecificData.class.getClassLoader() fails. I tested with Avro 1.7.6 and did not observe the problem. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1830) Deploy failover, Make Persistence engine and LeaderAgent Pluggable.
[ https://issues.apache.org/jira/browse/SPARK-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-1830: --- Fix Version/s: 1.0.1 1.1.0 Deploy failover, Make Persistence engine and LeaderAgent Pluggable. --- Key: SPARK-1830 URL: https://issues.apache.org/jira/browse/SPARK-1830 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Prashant Sharma Fix For: 1.1.0, 1.0.1 With current code base it is difficult to plugin an external user specified Persistence Engine or Election Agent. It would be good to expose this as a pluggable API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1846) RAT checks should exclude logs/ directory
Andrew Ash created SPARK-1846: - Summary: RAT checks should exclude logs/ directory Key: SPARK-1846 URL: https://issues.apache.org/jira/browse/SPARK-1846 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Andrew Ash When there are logs in the logs/ directory, the rat check from ./dev/check-license fails. ``` aash@aash-mbp ~/git/spark$ find logs -type f logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.1 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.2 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.3 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.4 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.5 logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out.1 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.1 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.2 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.3 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.4 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.5 logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.1 logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.2 logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.1 logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.2 aash@aash-mbp ~/git/spark$ ./dev/check-license Could not find Apache license headers in the following files: !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.2 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.3 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.4 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.5 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.2 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.3 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.4 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.5 !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.2 !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.2 aash@aash-mbp ~/git/spark$ ``` -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems
[ https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998807#comment-13998807 ] Michael Malak commented on SPARK-1836: -- Michael Ambrust: Indeed. Do you think I should add my additional case of equals() (and its workaround) as a comment to SPARK-1199 and mark this one as a duplicate? REPL $outer type mismatch causes lookup() and equals() problems --- Key: SPARK-1836 URL: https://issues.apache.org/jira/browse/SPARK-1836 Project: Spark Issue Type: Bug Affects Versions: 0.9.0 Reporter: Michael Malak Anand Avati partially traced the cause to REPL wrapping classes in $outer classes. There are at least two major symptoms: 1. equals() = In REPL equals() (required in custom classes used as a key for groupByKey) seems to have to be written using instanceOf[] instead of the canonical match{} Spark Shell (equals uses match{}): {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = false {noformat} Spark Shell (equals uses isInstanceOf[]): {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s = s) else false } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true {noformat} Scala Shell (equals uses match{}): {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true {noformat} 2. lookup() = {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == s else false override def hashCode = s.hashCode override def toString = s } val r = sc.parallelize(Array((new C(a),11),(new C(a),12))) r.lookup(new C(a)) console:17: error: type mismatch; found : C required: C r.lookup(new C(a)) ^ {noformat} See http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3C1400019424.80629.YahooMailNeo%40web160801.mail.bf1.yahoo.com%3E -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1730) Make receiver store data reliably to avoid data-loss on executor failures
[ https://issues.apache.org/jira/browse/SPARK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1730: - Fix Version/s: 1.1.0 Make receiver store data reliably to avoid data-loss on executor failures - Key: SPARK-1730 URL: https://issues.apache.org/jira/browse/SPARK-1730 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1858) Update third-party Hadoop distros doc to list more distros
Matei Zaharia created SPARK-1858: Summary: Update third-party Hadoop distros doc to list more distros Key: SPARK-1858 URL: https://issues.apache.org/jira/browse/SPARK-1858 Project: Spark Issue Type: Sub-task Components: Documentation Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1553) Support alternating nonnegative least-squares
[ https://issues.apache.org/jira/browse/SPARK-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1553: - Fix Version/s: 1.1.0 Support alternating nonnegative least-squares - Key: SPARK-1553 URL: https://issues.apache.org/jira/browse/SPARK-1553 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 0.9.0 Reporter: Tor Myklebust Assignee: Tor Myklebust Fix For: 1.1.0 There's already an ALS implementation. It can be tweaked to support nonnegative least-squares by conditionally running a nonnegative least-squares solve instead of a least-squares solver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1854) Add a version of StreamingContext.fileStream that take hadoop conf object
Tathagata Das created SPARK-1854: Summary: Add a version of StreamingContext.fileStream that take hadoop conf object Key: SPARK-1854 URL: https://issues.apache.org/jira/browse/SPARK-1854 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK
[ https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000232#comment-14000232 ] Li Pu commented on SPARK-1782: -- [~mengxr] thank you for the comments! You are right, (A^T A) x can be done in a single pass. What we need is actually a eigs solver. Seems that mllib depends on Breeze 0.7 (though I cannot find this release version in scalanlp/breeze). I think the code would be cleaner if we call eigs in Breeze? or do you prefer to have more control in mllib by calling ARPACK directly? Also thank you for the pointer to PROPACK. I looked into the svd routine. It calls lapack dbdsqr for actual svd computation. I will try to find a better way to incorporate Fortran routines in a distributed way. For now we can use ARPACK. svd for sparse matrix using ARPACK -- Key: SPARK-1782 URL: https://issues.apache.org/jira/browse/SPARK-1782 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Li Pu Original Estimate: 672h Remaining Estimate: 672h Currently the svd implementation in mllib calls the dense matrix svd in breeze, which has a limitation of fitting n^2 Gram matrix entries in memory (n is the number of rows or number of columns of the matrix, whichever is smaller). In many use cases, the original matrix is sparse but the Gram matrix might not, and we often need only the largest k singular values/vectors. To make svd really scalable, the memory usage must be propositional to the non-zero entries in the matrix. One solution is to call the de facto standard eigen-decomposition package ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a reverse communication interface. The user provides a function to multiply a square matrix to be decomposed with a dense vector provided by ARPACK, and return the resulting dense vector to ARPACK. Inside ARPACK it uses an Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we need to provide are two matrix-vector multiplications, first M*x then M^t*x. These multiplications can be done in Spark in a distributed manner. The working memory used by ARPACK is O(n*k). When k (the number of desired singular values) is small, it can be easily fit into the memory of the master machine. The overall model is master machine runs ARPACK, and distribute matrix-vector multiplication onto working executors in each iteration. I made a PR to breeze with an ARPACK-backed svds interface (https://github.com/scalanlp/breeze/pull/240). The interface takes anything that can be multiplied by a DenseVector. On Spark/milib side, just need to implement the sparsematrix-vector multiplication. It might take some time to optimize and fully test this implementation, so set the workload estimate to 4 weeks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-874) Have version of `sc.stop()` that blocks until all executors are cleaned up.
[ https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-874: -- Component/s: Deploy Have version of `sc.stop()` that blocks until all executors are cleaned up. --- Key: SPARK-874 URL: https://issues.apache.org/jira/browse/SPARK-874 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Patrick Wendell Assignee: Patrick Cogan Priority: Minor Labels: starter Fix For: 1.1.0 This would make things nicer for writing automated tests of Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-944) Give example of writing to HBase from Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-944: -- Fix Version/s: (was: 1.0.0) Give example of writing to HBase from Spark Streaming - Key: SPARK-944 URL: https://issues.apache.org/jira/browse/SPARK-944 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Patrick Wendell Assignee: Patrick Cogan Attachments: MetricAggregatorHBase.scala -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1368) HiveTableScan is slow
[ https://issues.apache.org/jira/browse/SPARK-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998619#comment-13998619 ] Cheng Lian commented on SPARK-1368: --- Corresponding PR: https://github.com/apache/spark/pull/758 HiveTableScan is slow - Key: SPARK-1368 URL: https://issues.apache.org/jira/browse/SPARK-1368 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Fix For: 1.1.0 The major issues here are the use of functional programming (.map, .foreach) and the creation of a new Row object for each output tuple. We should switch to while loops in the critical path and a single MutableRow per partition. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1585) Not robust Lasso causes Infinity on weights and losses
[ https://issues.apache.org/jira/browse/SPARK-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999123#comment-13999123 ] Xiangrui Meng commented on SPARK-1585: -- I think the gradient should pull the weights back. If I'm wrong, could you create an example code to demonstrate the problem? -Xiangrui Not robust Lasso causes Infinity on weights and losses -- Key: SPARK-1585 URL: https://issues.apache.org/jira/browse/SPARK-1585 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 0.9.1 Reporter: Xusen Yin Assignee: Xusen Yin Fix For: 1.1.0 Lasso uses LeastSquaresGradient and L1Updater, but diff = brzWeights.dot(brzData) - label in LeastSquaresGradient would cause too big diff, then will affect the L1Updater, which increases weights exponentially. Small shrinkage value cannot lasso weights back to zero then. Finally, the weights and losses reach Infinity. For example, data = (0.5 repeats 10k times), weights = (0.6 repeats 10k times), then data.dot(weights) approximates 300+, the diff will be 300. Then L1Updater sets weights to approximate 300. In the next iteration, the weights will be set to approximate 3, and so on. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1851) Upgrade Avro dependency to 1.7.6 so Spark can read Avro files
[ https://issues.apache.org/jira/browse/SPARK-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1851. Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 795 [https://github.com/apache/spark/pull/795] Upgrade Avro dependency to 1.7.6 so Spark can read Avro files - Key: SPARK-1851 URL: https://issues.apache.org/jira/browse/SPARK-1851 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Priority: Critical Fix For: 1.0.0 I tried to set up a basic example getting a Spark job to read an Avro container file with Avro specifics. This results in a ClassNotFoundException: can't convert GenericData.Record to com.cloudera.sparkavro.User. The reason is: * When creating records, to decide whether to be specific or generic, Avro tries to load a class with the name specified in the schema. * Initially, executors just have the system jars (which include Avro), and load the app jars dynamically with a URLClassLoader that's set as the context classloader for the task threads. * Avro tries to load the generated classes with SpecificData.class.getClassLoader(), which sidesteps this URLClassLoader and goes up to the AppClassLoader. Avro 1.7.6 has a change (AVRO-987) that falls back to the Thread's context classloader when the SpecificData.class.getClassLoader() fails. I tested with Avro 1.7.6 and did not observe the problem. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing
Xiangrui Meng created SPARK-1855: Summary: Provide memory-and-local-disk RDD checkpointing Key: SPARK-1855 URL: https://issues.apache.org/jira/browse/SPARK-1855 Project: Spark Issue Type: New Feature Components: MLlib, Spark Core Affects Versions: 1.0.0 Reporter: Xiangrui Meng Checkpointing is used to cut long lineage while maintaining fault tolerance. The current implementation is HDFS-based. Using the BlockRDD we can create in-memory-and-local-disk (with replication) checkpoints that are not as reliable as HDFS-based solution but faster. It can help applications that require many iterations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1485) Implement AllReduce
[ https://issues.apache.org/jira/browse/SPARK-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1485: - Fix Version/s: 1.1.0 Implement AllReduce --- Key: SPARK-1485 URL: https://issues.apache.org/jira/browse/SPARK-1485 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.1.0 The current implementations of machine learning algorithms rely on the driver for some computation and data broadcasting. This will create a bottleneck at the driver for both computation and communication, especially in multi-model training. An efficient implementation of AllReduce (or AllAggregate) can help free the driver: allReduce(RDD[T], (T, T) = T): RDD[T] This JIRA is created for discussing how to implement AllReduce efficiently and possible alternatives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1852) SparkSQL Queries with Sorts run before the user asks them to
Michael Armbrust created SPARK-1852: --- Summary: SparkSQL Queries with Sorts run before the user asks them to Key: SPARK-1852 URL: https://issues.apache.org/jira/browse/SPARK-1852 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust This is related to [SPARK-1021] but will not be fixed by that since we do our own partitioning. Part of the problem here is that we calculate the range partitioning too eagerly. Though this could also be alleviated by avoiding the call to toRdd for non DDL queries. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1729) Make Flume pull data from source, rather than the current push model
[ https://issues.apache.org/jira/browse/SPARK-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1729: - Fix Version/s: 1.1.0 Make Flume pull data from source, rather than the current push model Key: SPARK-1729 URL: https://issues.apache.org/jira/browse/SPARK-1729 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Assignee: Hari Shreedharan Fix For: 1.1.0 This makes sure that the if the Spark executor running the receiver goes down, the new receiver on a new node can still get data from Flume. This is not possible in the current model, as Flume is configured to push data to a executor/worker and if that worker is down, Flume cant push data. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1230) Enable SparkContext.addJars() to load classes not in CLASSPATH
[ https://issues.apache.org/jira/browse/SPARK-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1230. Resolution: Incomplete I forget what this actually means (hah) so I'm gonna close it for now. Enable SparkContext.addJars() to load classes not in CLASSPATH -- Key: SPARK-1230 URL: https://issues.apache.org/jira/browse/SPARK-1230 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1638) Executors fail to come up if spark.executor.extraJavaOptions is set
[ https://issues.apache.org/jira/browse/SPARK-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1638. Resolution: Duplicate Executors fail to come up if spark.executor.extraJavaOptions is set -- Key: SPARK-1638 URL: https://issues.apache.org/jira/browse/SPARK-1638 Project: Spark Issue Type: Bug Components: Spark Core Environment: Bring up a cluster in EC2 using spark-ec2 scripts Reporter: Kalpit Shah Fix For: 1.0.0 If you try to launch a PySpark shell with spark.executor.extraJavaOptions set to -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, the executors never come up on any of the workers. I see the following error in log file : Spark Executor Command: /usr/lib/jvm/java/bin/java -cp /root/c3/lib/*::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar: -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xms13312M -Xmx13312M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@HOSTNAME:45429/user/CoarseGrainedScheduler 7 HOSTNAME 4 akka.tcp://sparkWorker@HOSTNAME:39727/user/Worker app-20140423224526- Unrecognized VM option 'UseCompressedOops -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps' Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1368) HiveTableScan is slow
[ https://issues.apache.org/jira/browse/SPARK-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1368: Assignee: Cheng Lian HiveTableScan is slow - Key: SPARK-1368 URL: https://issues.apache.org/jira/browse/SPARK-1368 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Fix For: 1.1.0 The major issues here are the use of functional programming (.map, .foreach) and the creation of a new Row object for each output tuple. We should switch to while loops in the critical path and a single MutableRow per partition. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1847) Pushdown filters on non-required parquet columns
Michael Armbrust created SPARK-1847: --- Summary: Pushdown filters on non-required parquet columns Key: SPARK-1847 URL: https://issues.apache.org/jira/browse/SPARK-1847 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust From Andre: TODO: we currently only filter on non-nullable (Parquet REQUIRED) attributes until https://github.com/Parquet/parquet-mr/issues/371 has been resolved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1768) History Server enhancements
[ https://issues.apache.org/jira/browse/SPARK-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1768: --- Fix Version/s: 1.1.0 History Server enhancements --- Key: SPARK-1768 URL: https://issues.apache.org/jira/browse/SPARK-1768 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.1.0 The history server currently has some limitations; the one that currently concerns me the most is that it limits the number of applications it will show, to avoid having to hold all applications in memory. It would be better if the code were smarter and able to show any application available in the history storage. Also, thinking forward a little bit (I'm thinking SPARK-1537), it would be nice to separate the serving logic from the logic to access app log data. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1862) Add build support for MapR
Patrick Wendell created SPARK-1862: -- Summary: Add build support for MapR Key: SPARK-1862 URL: https://issues.apache.org/jira/browse/SPARK-1862 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.0.0 Would be nice to add support for some of the other distro's in the build. MapR is one that I've done before, so it's a starting point. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
[ https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1853: - Fix Version/s: 1.1.0 Show Streaming application code context (file, line number) in Spark Stages UI -- Key: SPARK-1853 URL: https://issues.apache.org/jira/browse/SPARK-1853 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Fix For: 1.1.0 Right now, the code context (file, and line number) shown for streaming jobs in stages UI is meaningless as it refers to internal DStream:random line rather than user application file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1682) Add gradient descent w/o sampling and RDA L1 updater
[ https://issues.apache.org/jira/browse/SPARK-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1682: --- Fix Version/s: (was: 1.0.0) Add gradient descent w/o sampling and RDA L1 updater Key: SPARK-1682 URL: https://issues.apache.org/jira/browse/SPARK-1682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Dong Wang The GradientDescent optimizer does sampling before a gradient step. When input data is already shuffled beforehand, it is possible to scan data and make gradient descent for each data instance. This could be potentially more efficient. Add enhanced RDA L1 updater, which could produce even sparse solutions with comparable quality compared with L1. Reference: Lin Xiao, Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization, Journal of Machine Learning Research 11 (2010) 2543-2596. Small fix: add options to BinaryClassification example to read and write model file -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1272) Don't fail job if some local directories are buggy
[ https://issues.apache.org/jira/browse/SPARK-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1272: --- Fix Version/s: (was: 1.0.0) 1.1.0 Don't fail job if some local directories are buggy -- Key: SPARK-1272 URL: https://issues.apache.org/jira/browse/SPARK-1272 Project: Spark Issue Type: Improvement Components: Shuffle Reporter: Patrick Wendell Assignee: OuyangJin Fix For: 1.1.0 If Spark cannot create shuffle directories inside of a local directory it might make sense to just log an error and continue, provided that at least one valid shuffle directory exists. Otherwise if a single disk is wonky the entire job can fail. The down side is that this might mask failures if the person actually misconfigures the local directories to point to the wrong disk(s). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK
[ https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999493#comment-13999493 ] Xiangrui Meng commented on SPARK-1782: -- This sounds good to me. Let's assume that A is m-by-n where n is small (1e6). Note that (A^T A) x = (A^T A)^T x can be computed in a single pass, which is sum_i (a_i^T x) a_i . So we don't need to implement A^T y, which simplifies the task. svd for sparse matrix using ARPACK -- Key: SPARK-1782 URL: https://issues.apache.org/jira/browse/SPARK-1782 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Li Pu Original Estimate: 672h Remaining Estimate: 672h Currently the svd implementation in mllib calls the dense matrix svd in breeze, which has a limitation of fitting n^2 Gram matrix entries in memory (n is the number of rows or number of columns of the matrix, whichever is smaller). In many use cases, the original matrix is sparse but the Gram matrix might not, and we often need only the largest k singular values/vectors. To make svd really scalable, the memory usage must be propositional to the non-zero entries in the matrix. One solution is to call the de facto standard eigen-decomposition package ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a reverse communication interface. The user provides a function to multiply a square matrix to be decomposed with a dense vector provided by ARPACK, and return the resulting dense vector to ARPACK. Inside ARPACK it uses an Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we need to provide are two matrix-vector multiplications, first M*x then M^t*x. These multiplications can be done in Spark in a distributed manner. The working memory used by ARPACK is O(n*k). When k (the number of desired singular values) is small, it can be easily fit into the memory of the master machine. The overall model is master machine runs ARPACK, and distribute matrix-vector multiplication onto working executors in each iteration. I made a PR to breeze with an ARPACK-backed svds interface (https://github.com/scalanlp/breeze/pull/240). The interface takes anything that can be multiplied by a DenseVector. On Spark/milib side, just need to implement the sparsematrix-vector multiplication. It might take some time to optimize and fully test this implementation, so set the workload estimate to 4 weeks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1580) ALS: Estimate communication and computation costs given a partitioner
[ https://issues.apache.org/jira/browse/SPARK-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1580: - Fix Version/s: 1.1.0 ALS: Estimate communication and computation costs given a partitioner - Key: SPARK-1580 URL: https://issues.apache.org/jira/browse/SPARK-1580 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Tor Myklebust Priority: Minor Fix For: 1.1.0 It would be nice to be able to estimate the amount of work needed to solve an ALS problem. The chief components of this work are computation time---time spent forming and solving the least squares problems---and communication cost---the number of bytes sent across the network. Communication cost depends heavily on how the users and products are partitioned. We currently do not try to cluster users or products so that fewer feature vectors need to be communicated. This is intended as a first step toward that end---we ought to be able to tell whether one partitioning is better than another. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1813) Add a utility to SparkConf that makes using Kryo really easy
[ https://issues.apache.org/jira/browse/SPARK-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998557#comment-13998557 ] Sandy Ryza commented on SPARK-1813: --- https://github.com/apache/spark/pull/789 is what I mean. [~mrid...@yahoo-inc.com], are there common uses of Kryo you're saying this misses? Add a utility to SparkConf that makes using Kryo really easy Key: SPARK-1813 URL: https://issues.apache.org/jira/browse/SPARK-1813 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza It would be nice to have a method in SparkConf that makes it really easy to turn on Kryo serialization and register a set of classes. Using Kryo currently requires all this: {code} import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator class MyRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[MyClass1]) kryo.register(classOf[MyClass2]) } } val conf = new SparkConf().setMaster(...).setAppName(...) conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) conf.set(spark.kryo.registrator, mypackage.MyRegistrator) val sc = new SparkContext(conf) {code} It would be nice if it just required this: {code} SparkConf.setKryo(Array(classOf[MyClass1], classOf[MyClass2])) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished
[ https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-874: -- Summary: Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished (was: Have version of `sc.stop()` that blocks until all executors are cleaned up.) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished --- Key: SPARK-874 URL: https://issues.apache.org/jira/browse/SPARK-874 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Patrick Wendell Priority: Minor Labels: starter Fix For: 1.1.0 This would make things nicer for writing automated tests of Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1856) Standardize MLlib interfaces
Xiangrui Meng created SPARK-1856: Summary: Standardize MLlib interfaces Key: SPARK-1856 URL: https://issues.apache.org/jira/browse/SPARK-1856 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Priority: Critical Fix For: 1.1.0 Instead of expanding MLlib based on the current class naming scheme (ProblemWithAlgorithm), we should standardize MLlib's interfaces that clearly separate datasets, formulations, algorithms, parameter sets, and models. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1860) Standalone Worker cleanup should not clean up running applications by default
Aaron Davidson created SPARK-1860: - Summary: Standalone Worker cleanup should not clean up running applications by default Key: SPARK-1860 URL: https://issues.apache.org/jira/browse/SPARK-1860 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Aaron Davidson Priority: Critical The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any applications that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Applications should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1861) ArrayIndexOutOfBoundsException when reading bzip2 files
Xiangrui Meng created SPARK-1861: Summary: ArrayIndexOutOfBoundsException when reading bzip2 files Key: SPARK-1861 URL: https://issues.apache.org/jira/browse/SPARK-1861 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: Xiangrui Meng Hadoop uses CBZip2InputStream to decode bzip2 files. However, the implementation is not threadsafe and Spark may run multiple tasks in the same JVM, which leads to this error. This is not a problem for Hadoop MapReduce because Hadoop runs each task in a separate JVM. A workaround is to set `SPARK_WORKER_CORES=1` in spark-env.sh for a standalone cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1585) Not robust Lasso causes Infinity on weights and losses
[ https://issues.apache.org/jira/browse/SPARK-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1585: --- Fix Version/s: (was: 1.0.0) 1.1.0 Not robust Lasso causes Infinity on weights and losses -- Key: SPARK-1585 URL: https://issues.apache.org/jira/browse/SPARK-1585 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 0.9.1 Reporter: Xusen Yin Assignee: Xusen Yin Fix For: 1.1.0 Lasso uses LeastSquaresGradient and L1Updater, but diff = brzWeights.dot(brzData) - label in LeastSquaresGradient would cause too big diff, then will affect the L1Updater, which increases weights exponentially. Small shrinkage value cannot lasso weights back to zero then. Finally, the weights and losses reach Infinity. For example, data = (0.5 repeats 10k times), weights = (0.6 repeats 10k times), then data.dot(weights) approximates 300+, the diff will be 300. Then L1Updater sets weights to approximate 300. In the next iteration, the weights will be set to approximate 3, and so on. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1486: - Priority: Critical (was: Major) Support multi-model training in MLlib - Key: SPARK-1486 URL: https://issues.apache.org/jira/browse/SPARK-1486 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.1.0 It is rare in practice to train just one model with a given set of parameters. Usually, this is done by training multiple models with different sets of parameters and then select the best based on their performance on the validation set. MLlib should provide native support for multi-model training/scoring. It requires decoupling of concepts like problem, formulation, algorithm, parameter set, and model, which are missing in MLlib now. MLI implements similar concepts, which we can borrow. There are different approaches for multi-model training: 0) Keep one copy of the data, and train models one after another (or maybe in parallel, depending on the scheduler). 1) Keep one copy of the data, and train multiple models at the same time (similar to `runs` in KMeans). 2) Make multiple copies of the data (still stored distributively), and use more cores to distribute the work. 3) Collect the data, make the entire dataset available on workers, and train one or more models on each worker. Users should be able to choose which execution mode they want to use. Note that 3) could cover many use cases in practice when the training data is not huge, e.g., 1GB. This task will be divided into sub-tasks and this JIRA is created to discuss the design and track the overall progress. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1765) Modify a typo in monitoring.md
[ https://issues.apache.org/jira/browse/SPARK-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1765. Resolution: Fixed Fix Version/s: 1.0.0 Modify a typo in monitoring.md -- Key: SPARK-1765 URL: https://issues.apache.org/jira/browse/SPARK-1765 Project: Spark Issue Type: Bug Reporter: Kousuke Saruta Priority: Minor Fix For: 1.0.0 There is a word 'JXM' In monitoring.md. I guess, it's a typo for 'JMX'. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results
Vlad Frolov created SPARK-1859: -- Summary: Linear, Ridge and Lasso Regressions with SGD yield unexpected results Key: SPARK-1859 URL: https://issues.apache.org/jira/browse/SPARK-1859 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 0.9.1 Environment: OS: Ubuntu Server 12.04 x64 PySpark Reporter: Vlad Frolov Issue: Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one). Ridge Regression with SGD *sometimes* works ok. Lasso Regression with SGD *sometimes* works ok. Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 : {code:title=regression_example.py} parsedData = sc.parallelize([ array([2400., 1500.]), array([240., 150.]), array([24., 15.]), array([2.4, 1.5]), array([0.24, 0.15]) ]) # Build the model model = LinearRegressionWithSGD.train(parsedData) print model._coeffs {code} So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! :) The resulting model has nan coeffs: {{array([ nan])}}. Furthermore, if you comment records line by line you will get: * [-1.55897475e+296] coeff (the first record is commented), * [-8.62115396e+104] coeff (the first two records are commented), * etc It looks like the implemented regression algorithms diverges somehow. I get almost the same results on Ridge and Lasso. I've also tested these inputs in scikit-learn and it works as expected there. However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1630: --- Fix Version/s: (was: 1.0.0) 1.1.0 PythonRDDs don't handle nulls gracefully Key: SPARK-1630 URL: https://issues.apache.org/jira/browse/SPARK-1630 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0, 0.9.1 Reporter: Kalpit Shah Fix For: 1.1.0 Original Estimate: 2h Remaining Estimate: 2h If PythonRDDs receive a null element in iterators, they currently NPE. It would be better do log a DEBUG message and skip the write of NULL elements. Here are the 2 stack traces : 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread Thread[stdin writer for python,5,main] java.lang.NullPointerException at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) - Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.writeToFile. : java.lang.NullPointerException at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1600) flaky test case in streaming.CheckpointSuite
[ https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1600: - Fix Version/s: 1.1.0 flaky test case in streaming.CheckpointSuite Key: SPARK-1600 URL: https://issues.apache.org/jira/browse/SPARK-1600 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Nan Zhu Fix For: 1.1.0 the case recovery with file input stream.recovery with file input stream sometimes fails when the Jenkins is very busy with an unrelated change I have met it for 3 times, I also saw it in other places, the latest example is in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/ where the modification is just in YARN related files I once reported in dev mail list: http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1487) Support record filtering via predicate pushdown in Parquet
[ https://issues.apache.org/jira/browse/SPARK-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1487: Fix Version/s: 1.1.0 Support record filtering via predicate pushdown in Parquet -- Key: SPARK-1487 URL: https://issues.apache.org/jira/browse/SPARK-1487 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Andre Schumacher Assignee: Andre Schumacher Fix For: 1.1.0 Parquet has support for column filters, which can be used to avoid reading and de-serializing records that fail the column filter condition. This can lead to potentially large savings, depending on the number of columns filtered by and how many records actually pass the filter. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1340) Some Spark Streaming receivers are not restarted when worker fails
[ https://issues.apache.org/jira/browse/SPARK-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-1340. -- Resolution: Fixed Resolved with https://issues.apache.org/jira/browse/SPARK-1332 Some Spark Streaming receivers are not restarted when worker fails -- Key: SPARK-1340 URL: https://issues.apache.org/jira/browse/SPARK-1340 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Fix For: 1.0.0 For some streams like Kafka stream, the receiver do not get restarted if the worker running the receiver fails. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1848) Executors are mysteriously dying when using Spark on Mesos
Bouke van der Bijl created SPARK-1848: - Summary: Executors are mysteriously dying when using Spark on Mesos Key: SPARK-1848 URL: https://issues.apache.org/jira/browse/SPARK-1848 Project: Spark Issue Type: Bug Components: Mesos, Spark Core Affects Versions: 1.0.0 Environment: Linux 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4 17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mesos 0.18.0 Spark Master Reporter: Bouke van der Bijl Here's a logfile: https://gist.github.com/bouk/b4647e7ba62eb169a40a We have 47 machines running Mesos that we're trying to run Spark jobs on, but they fail at some point because tasks have to get rescheduled too often, which is caused by Spark killing the tasks because of executors dying. When I look at the stderr or stdout of the Mesos slaves, there seem to be no indication of an error happening and sometimes I can see a 14/05/15 17:38:54 INFO DAGScheduler: Ignoring possibly bogus ShuffleMapTask completion from id which would indicate that the executor just keeps going and hasn't actually died. If I add a Thread.dumpStack() at the location where the job is killed, this is the trace it returns: at java.lang.Thread.dumpStack(Thread.java:1364) at org.apache.spark.scheduler.TaskSetManager.handleFailedTask(TaskSetManager.scala:588) at org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:665) at org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:664) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.scheduler.TaskSetManager.executorLost(TaskSetManager.scala:664) at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87) at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87) at org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:271) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:266) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.statusUpdate(MesosSchedulerBackend.scala:287) What could cause this? Is this a set up problem with our cluster or a bug in spark? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1810) The spark tar ball does not unzip into a separate folder when un-tarred.
[ https://issues.apache.org/jira/browse/SPARK-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1810. Resolution: Cannot Reproduce The spark tar ball does not unzip into a separate folder when un-tarred. Key: SPARK-1810 URL: https://issues.apache.org/jira/browse/SPARK-1810 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.0 Environment: All environments Reporter: Manikandan Narayanaswamy Priority: Minor Labels: maven All other Hadoop components when extracted are contained within a new folder that is created. But, this is not the case for Spark. The Spark.tar decompresses all files into the Current Working Directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1741) Add predict(JavaRDD) to predictive models
[ https://issues.apache.org/jira/browse/SPARK-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999115#comment-13999115 ] Patrick Wendell commented on SPARK-1741: This might end up in 1.0 or not depending on whether we release the current RC. Add predict(JavaRDD) to predictive models - Key: SPARK-1741 URL: https://issues.apache.org/jira/browse/SPARK-1741 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.0.0, 1.0.1 `model.predict` returns a RDD of Scala primitive type (Int/Double), which is recognized as Object in Java. Adding predict(JavaRDD) could make life easier for Java users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1487) Support record filtering via predicate pushdown in Parquet
[ https://issues.apache.org/jira/browse/SPARK-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999034#comment-13999034 ] Michael Armbrust commented on SPARK-1487: - PR here: https://github.com/apache/spark/pull/511/ Support record filtering via predicate pushdown in Parquet -- Key: SPARK-1487 URL: https://issues.apache.org/jira/browse/SPARK-1487 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Andre Schumacher Assignee: Andre Schumacher Fix For: 1.1.0 Parquet has support for column filters, which can be used to avoid reading and de-serializing records that fail the column filter condition. This can lead to potentially large savings, depending on the number of columns filtered by and how many records actually pass the filter. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1154) Spark fills up disk with app-* folders
[ https://issues.apache.org/jira/browse/SPARK-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999601#comment-13999601 ] Patrick Wendell edited comment on SPARK-1154 at 5/16/14 5:00 AM: - [~mkim] yes you are correct - this is broken. Checkout SPARK-1860. Are you interested in fixing this? was (Author: pwendell): [~mkim] yes you are correct - this is broken. Checkout SPARK-1860. Spark fills up disk with app-* folders -- Key: SPARK-1154 URL: https://issues.apache.org/jira/browse/SPARK-1154 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Evan Chan Assignee: Evan Chan Priority: Critical Labels: starter Fix For: 1.0.0 Current version of Spark fills up the disk with many app-* folders: $ ls /var/lib/spark app-20140210022347-0597 app-20140212173327-0627 app-20140218154110-0657 app-20140225232537-0017 app-20140225233548-0047 app-20140210022407-0598 app-20140212173347-0628 app-20140218154130-0658 app-20140225232551-0018 app-20140225233556-0048 app-20140210022427-0599 app-20140212173754-0629 app-20140218164232-0659 app-20140225232611-0019 app-20140225233603-0049 app-20140210022447-0600 app-20140212182235-0630 app-20140218165133-0660 app-20140225232802-0020 app-20140225233610-0050 app-20140210022508-0601 app-20140212182256-0631 app-20140218165148-0661 app-20140225232822-0021 app-20140225233617-0051 app-20140210022528-0602 app-2014021314-0632 app-20140218165225-0662 app-20140225232940-0022 app-20140225233624-0052 app-20140211024356-0603 app-20140213002026-0633 app-20140218165249-0663 app-20140225233002-0023 app-20140225233631-0053 app-20140211024417-0604 app-20140213154948-0634 app-20140218172030-0664 app-20140225233056-0024 app-20140225233725-0054 app-20140211024437-0605 app-20140213171810-0635 app-20140218193853-0665 app-20140225233108-0025 app-20140225233731-0055 app-20140211024457-0606 app-20140213193637-0636 app-20140218194442-0666 app-20140225233124-0026 app-20140225233733-0056 app-20140211024517-0607 app-20140214011513-0637 app-20140218194746-0667 app-20140225233133-0027 app-20140225233734-0057 app-20140211024538-0608 app-20140214012151-0638 app-20140218194822-0668 app-20140225233147-0028 app-20140225233749-0058 app-20140211193443-0609 app-20140214013134-0639 app-20140218212317-0669 app-20140225233208-0029 app-20140225233759-0059 app-20140211195210-0610 app-20140214013332-0640 app-20140225180142- app-20140225233215-0030 app-20140225233809-0060 app-20140211213935-0611 app-20140214013642-0641 app-20140225180411-0001 app-20140225233224-0031 app-20140225233828-0061 app-20140211214227-0612 app-20140214014246-0642 app-20140225180431-0002 app-20140225233232-0032 app-20140225234719-0062 app-20140211215317-0613 app-20140214014607-0643 app-20140225180452-0003 app-20140225233239-0033 app-20140226032845-0063 app-20140211224601-0614 app-20140214184943-0644 app-20140225180512-0004 app-20140225233320-0034 app-20140226033004-0064 app-20140212022206-0615 app-20140214185118-0645 app-20140225180533-0005 app-20140225233328-0035 app-20140226033119-0065 app-2014021206-0616 app-20140214185851-0646 app-20140225180553-0006 app-20140225233354-0036 app-2014022604-0066 app-20140212022246-0617 app-20140214222856-0647 app-20140225181115-0007 app-20140225233402-0037 app-20140226033354-0067 app-20140212043704-0618 app-20140214231312-0648 app-20140225181244-0008 app-20140225233409-0038 app-20140226033538-0068 app-20140212043724-0619 app-20140214231434-0649 app-20140225182051-0009 app-20140225233416-0039 app-20140226033826-0069 app-20140212043745-0620 app-20140214231542-0650 app-20140225183009-0010 app-20140225233426-0040 app-20140226034002-0070 app-20140212044016-0621 app-20140214231616-0651 app-20140225184133-0011 app-20140225233432-0041 app-20140226034053-0071 app-20140212044203-0622 app-20140214233016-0652 app-20140225184318-0012 app-20140225233439-0042 app-20140226034234-0072 app-20140212044224-0623 app-20140214233037-0653 app-20140225184709-0013 app-20140225233447-0043 app-20140226034426-0073 app-20140212045034-0624 app-20140218153242-0654 app-20140225184844-0014 app-20140225233526-0044 app-20140226034447-0074 app-20140212045119-0625 app-20140218153341-0655 app-20140225190051-0015 app-20140225233534-0045 app-20140212173310-0626 app-20140218153442-0656 app-20140225232516-0016 app-20140225233540-0046 This problem is particularly bad if you have a whole bunch of fast jobs. Also what makes the problem worse is that any jars for jobs is downloaded into the app-* folder, so that fills up the disk
[jira] [Updated] (SPARK-1623) SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name
[ https://issues.apache.org/jira/browse/SPARK-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1623: --- Fix Version/s: (was: 1.0.0) SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name - Key: SPARK-1623 URL: https://issues.apache.org/jira/browse/SPARK-1623 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Niraj Suthar -- This message was sent by Atlassian JIRA (v6.2#6252)