date:20140516


 [ 
https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1647:
-

Fix Version/s: 1.1.0

 Prevent data loss when Streaming driver goes down
 -

 Key: SPARK-1647
 URL: https://issues.apache.org/jira/browse/SPARK-1647
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan
 Fix For: 1.1.0


 Currently when the driver goes down, any uncheckpointed data is lost from 
 within spark. If the system from which messages are pulled can  replay 
 messages, the data may be available - but for some systems, like Flume this 
 is not the case. 
 Also, all windowing information is lost for windowing functions. 
 We must persist raw data somehow, and be able to replay this data if 
 required. We also must persist windowing information with the data itself.
 This will likely require quite a bit of work to complete and probably will 
 have to be split into several sub-jiras.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1850) Bad exception if multiple jars exist when running PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1850:
-

Description: 
{code}
Found multiple Spark assembly jars in 
/Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
Traceback (most recent call last):
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, line 
43, in module
sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, 
pyFiles=add_files)
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
line 94, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
line 180, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File 
/Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, line 
49, in launch_gateway
gateway_port = int(proc.stdout.readline())
ValueError: invalid literal for int() with base 10: 
'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'
{code}

*Explanation.* It's trying to read the Java gateway port as an int from the 
sub-process' STDOUT. However, what it read was an error message, which is 
clearly not an int. We should differentiate between these cases and just 
propagate the original message if it's not an int. Right now, this exception is 
not very helpful.

  was:
{code}
Found multiple Spark assembly jars in 
/Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
Traceback (most recent call last):
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, line 
43, in module
sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, 
pyFiles=add_files)
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
line 94, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
line 180, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File 
/Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, line 
49, in launch_gateway
gateway_port = int(proc.stdout.readline())
ValueError: invalid literal for int() with base 10: 
'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'
{/code}

It's trying to read the Java gateway port as an int from the sub-process' 
STDOUT. However, what it read was an error message, which is clearly not an 
int. We should differentiate between these cases and just propagate the 
original message if it's not an int. Right now, this exception is not very 
helpful.


 Bad exception if multiple jars exist when running PySpark
 -

 Key: SPARK-1850
 URL: https://issues.apache.org/jira/browse/SPARK-1850
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.0.1


 {code}
 Found multiple Spark assembly jars in 
 /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
 Traceback (most recent call last):
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, 
 line 43, in module
 sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, 
 pyFiles=add_files)
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
 line 94, in __init__
 SparkContext._ensure_initialized(self, gateway=gateway)
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
 line 180, in _ensure_initialized
 SparkContext._gateway = gateway or launch_gateway()
   File 
 /Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, 
 line 49, in launch_gateway
 gateway_port = int(proc.stdout.readline())
 ValueError: invalid literal for int() with base 10: 
 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'
 {code}
 *Explanation.* It's trying to read the Java gateway port as an int from the 
 sub-process' STDOUT. However, what it read was an error message, which is 
 clearly not an int. We should differentiate between these cases and just 
 propagate the original message if it's not an int. Right now, this exception 
 is not very helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Reopened] (SPARK-1860) Standalone Worker cleanup should not clean up running applications by default


 [ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-1860:



The PR actually just disabled it, it didn't fix this.

 Standalone Worker cleanup should not clean up running applications by default
 -

 Key: SPARK-1860
 URL: https://issues.apache.org/jira/browse/SPARK-1860
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical
 Fix For: 1.0.0


 The default values of the standalone worker cleanup code cleanup all 
 application data every 7 days. This includes jars that were added to any 
 applications that happen to be running for longer than 7 days, hitting 
 streaming jobs especially hard.
 Applications should not be cleaned up if they're still running. Until then, 
 this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1839) PySpark take() does not launch a Spark job when it has to


 [ 
https://issues.apache.org/jira/browse/SPARK-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1839:
---

Fix Version/s: 1.1.0

 PySpark take() does not launch a Spark job when it has to
 -

 Key: SPARK-1839
 URL: https://issues.apache.org/jira/browse/SPARK-1839
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Hossein Falaki
 Fix For: 1.1.0


 If you call take() or first() on a large FilteredRDD, the driver attempts to 
 scan all partitions to find the first valid item. If the RDD is large this 
 would fail or hang.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1850) Bad exception if multiple jars exist when running PySpark

Andrew Or created SPARK-1850:


 Summary: Bad exception if multiple jars exist when running PySpark
 Key: SPARK-1850
 URL: https://issues.apache.org/jira/browse/SPARK-1850
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.0.1


Found multiple Spark assembly jars in 
/Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
Traceback (most recent call last):
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, line 
43, in module
sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, 
pyFiles=add_files)
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
line 94, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
line 180, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File 
/Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, line 
49, in launch_gateway
gateway_port = int(proc.stdout.readline())
ValueError: invalid literal for int() with base 10: 
'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'

It's trying to read the Java gateway port as an int from the sub-process' 
STDOUT. However, what it read was an error message, which is clearly not an 
int. We should differentiate between these cases and just propagate the 
original message if it's not an int. Right now, this exception is not very 
helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1359) SGD implementation is not efficient


 [ 
https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1359:
-

Affects Version/s: 1.0.0

 SGD implementation is not efficient
 ---

 Key: SPARK-1359
 URL: https://issues.apache.org/jira/browse/SPARK-1359
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 0.9.0, 1.0.0
Reporter: Xiangrui Meng

 The SGD implementation samples a mini-batch to compute the stochastic 
 gradient. This is not efficient because examples are provided via an iterator 
 interface. We have to scan all of them to obtain a sample.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1110) Clean up and clarify use of SPARK_HOME


 [ 
https://issues.apache.org/jira/browse/SPARK-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1110.


Resolution: Fixed

This was subsumed by the other configuration clean-up.

 Clean up and clarify use of SPARK_HOME
 --

 Key: SPARK-1110
 URL: https://issues.apache.org/jira/browse/SPARK-1110
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Assignee: Patrick Cogan
 Fix For: 1.0.0


 In the spirit of SPARK-929 we should clean up the use of SPARK_HOME and, if 
 possible, remove it entirely.
 We need to look through what this is used for. One use was allowing 
 applications to run different versions of Spark in standalone mode. For 
 instance, someone could submit an application with a custom SPARK_HOME and 
 the Worker would launch an Executor using a different path for Spark. This 
 use case is not widely used and maybe should just be removed.
 The existing constructors that take SPARK_HOME for this purpose should be 
 deprecated and we should explain that SPARK_HOME is no longer used for this 
 purpose.
 If there are other legitimate reasons for SPARK_HOME, we can keep it 
 around... we need to audit the uses of it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed

2014-05-16 Thread Harry Brundage (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harry Brundage updated SPARK-1849:
--

Attachment: encoding_test

Here's the windows encoded file I was using to test with if you'd like to play 
around.

 Broken UTF-8 encoded data gets character replacements and thus can't be 
 fixed
 ---

 Key: SPARK-1849
 URL: https://issues.apache.org/jira/browse/SPARK-1849
 Project: Spark
  Issue Type: Bug
Reporter: Harry Brundage
 Fix For: 1.0.0, 0.9.1

 Attachments: encoding_test


 I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
 Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
 we should fix? It looks like {{HadoopRDD}} uses 
 {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
 believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
 character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
 underneath:
 {code}
 scala sc.textFile(path).collect()(0)
 res8: String = ?pple
 scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes()
 res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
 scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text]).map(pair = pair._2.getBytes).collect()(0)
 res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
 {code}
 In the above example, the first two snippets show the string representation 
 and byte representation of the example line of text. The third snippet shows 
 what happens if you call {{getBytes}} on the {{Text}} object which comes back 
 from hadoop land: we get the real bytes in the file out.
 Now, I think this is a bug, though you may disagree. The text inside my file 
 is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
 rescue and re-encode into UTF-8, because I want my application to be smart 
 like that. I think Spark should give me the raw broken string so I can 
 re-encode, but I can't get at the original bytes in order to guess at what 
 the source encoding might be, as they have already been replaced. I'm dealing 
 with data from some CDN access logs which are to put it nicely diversely 
 encoded, but I think a use case Spark should fully support. So, my suggested 
 fix, which I'd like some guidance, is to change {{textFile}} to spit out 
 broken strings by not using {{Text}}'s UTF-8 encoding.
 Further compounding this issue is that my application is actually in PySpark, 
 but we can talk about how bytes fly through to Scala land after this if we 
 agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-911) Support map pruning on sorted (K, V) RDD's


 [ 
https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-911:
--

Fix Version/s: (was: 1.0.0)

 Support map pruning on sorted (K, V) RDD's
 --

 Key: SPARK-911
 URL: https://issues.apache.org/jira/browse/SPARK-911
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell

 \[Tentatively assigned to me, but anyone can do this if they'd like!\]
 If someone has sorted a (K, V) rdd, we should offer them a way to filter a 
 range of the partitions that employs map pruning. This would be simple using 
 a small range index within the rdd itself. A good example is I sort my 
 dataset by time and then I want to serve queries that are restricted to a 
 certain time range.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI

Tathagata Das created SPARK-1853:


 Summary: Show Streaming application code context (file, line 
number) in Spark Stages UI
 Key: SPARK-1853
 URL: https://issues.apache.org/jira/browse/SPARK-1853
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das


Right now, the code context (file, and line number) shown for streaming jobs in 
stages UI is meaningless as it refers to internal DStream:random line rather 
than user application file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1741) Add predict(JavaRDD) to predictive models


 [ 
https://issues.apache.org/jira/browse/SPARK-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1741:
---

Fix Version/s: 1.0.1

 Add predict(JavaRDD) to predictive models
 -

 Key: SPARK-1741
 URL: https://issues.apache.org/jira/browse/SPARK-1741
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.0.0, 1.0.1


 `model.predict` returns a RDD of Scala primitive type (Int/Double), which is 
 recognized as Object in Java. Adding predict(JavaRDD) could make life easier 
 for Java users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-874) Have version of `sc.stop()` that blocks until all executors are cleaned up.


 [ 
https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-874:
--

Fix Version/s: (was: 1.0.0)
   1.1.0

 Have version of `sc.stop()` that blocks until all executors are cleaned up.
 ---

 Key: SPARK-874
 URL: https://issues.apache.org/jira/browse/SPARK-874
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Patrick Wendell
Assignee: Patrick Cogan
Priority: Minor
  Labels: starter
 Fix For: 1.1.0


 This would make things nicer for writing automated tests of Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1485) Implement AllReduce


 [ 
https://issues.apache.org/jira/browse/SPARK-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1485:
-

Priority: Critical  (was: Major)

 Implement AllReduce
 ---

 Key: SPARK-1485
 URL: https://issues.apache.org/jira/browse/SPARK-1485
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.1.0


 The current implementations of machine learning algorithms rely on the driver 
 for some computation and data broadcasting. This will create a bottleneck at 
 the driver for both computation and communication, especially in multi-model 
 training. An efficient implementation of AllReduce (or AllAggregate) can help 
 free the driver:
 allReduce(RDD[T], (T, T) = T): RDD[T]
 This JIRA is created for discussing how to implement AllReduce efficiently 
 and possible alternatives.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1487) Support record filtering via predicate pushdown in Parquet


 [ 
https://issues.apache.org/jira/browse/SPARK-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1487:


Fix Version/s: (was: 1.1.0)

 Support record filtering via predicate pushdown in Parquet
 --

 Key: SPARK-1487
 URL: https://issues.apache.org/jira/browse/SPARK-1487
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Andre Schumacher
Assignee: Andre Schumacher
 Fix For: 1.1.0


 Parquet has support for column filters, which can be used to avoid reading 
 and de-serializing records that fail the column filter condition. This can 
 lead to potentially large savings, depending on the number of columns 
 filtered by and how many records actually pass the filter.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1704) java.lang.AssertionError: assertion failed: No plan for ExplainCommand (Project [*])


 [ 
https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1704:
---

Fix Version/s: (was: 1.0.0)
   1.1.0

 java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
 (Project [*])
 

 Key: SPARK-1704
 URL: https://issues.apache.org/jira/browse/SPARK-1704
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: linux
Reporter: Yangjp
  Labels: sql
 Fix For: 1.1.0

   Original Estimate: 612h
  Remaining Estimate: 612h

 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src
 14/05/03 22:08:40 INFO ParseDriver: Parse Completed
 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION :
 java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
 (Project [*])
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248)
 at 
 org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39)
 at 
 org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407)
 at 
 org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124)
 at 
 org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1820) Make GenerateMimaIgnore @DeveloperApi annotation aware.


 [ 
https://issues.apache.org/jira/browse/SPARK-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1820:
---

Assignee: Prashant Sharma

 Make GenerateMimaIgnore @DeveloperApi annotation aware.
 ---

 Key: SPARK-1820
 URL: https://issues.apache.org/jira/browse/SPARK-1820
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Prashant Sharma
Assignee: Prashant Sharma
 Fix For: 1.1.0


 Ignore all the classes with DeveloperApi annotation, while doing Mima checks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1826) Some bad head notations in sparksql


[ 
https://issues.apache.org/jira/browse/SPARK-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998988#comment-13998988
 ] 

Michael Armbrust commented on SPARK-1826:
-

Fixed in: https://github.com/apache/spark/pull/765

 Some bad head notations in sparksql 
 

 Key: SPARK-1826
 URL: https://issues.apache.org/jira/browse/SPARK-1826
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: wangfei
 Fix For: 1.0.1


 There are some obvious bad notations， such as 
 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1669) SQLContext.cacheTable() should be idempotent

2014-05-16 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1669:
--

Description: 
Calling {{cacheTable()}} on some table {{t}} multiple times causes table {{t}} 
to  be cached multiple times. This semantics is different from {{RDD.cache()}}, 
which is idempotent.

We can check whether a table is already cached by checking:

# whether the structure of the underlying logical plan of the table is matches 
the pattern {{Subquery(\_, SparkLogicalPlan(inMem @ 
InMemoryColumnarTableScan(_, _)))}}
# whether {{inMem.cachedColumnBuffers.getStorageLevel.useMemory}} is true

  was:
Calling {{cacheTable()}} on some table {{t} multiple times causes table {{t}} 
to  be cached multiple times. This semantics is different from {{RDD.cache()}}, 
which is idempotent.

We can check whether a table is already cached by checking:

# whether the structure of the underlying logical plan of the table is matches 
the pattern {{Subquery(\_, SparkLogicalPlan(inMem @ 
InMemoryColumnarTableScan(_, _)))}}
# whether {{inMem.cachedColumnBuffers.getStorageLevel.useMemory}} is true


 SQLContext.cacheTable() should be idempotent
 

 Key: SPARK-1669
 URL: https://issues.apache.org/jira/browse/SPARK-1669
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Cheng Lian
  Labels: cache, column

 Calling {{cacheTable()}} on some table {{t}} multiple times causes table 
 {{t}} to  be cached multiple times. This semantics is different from 
 {{RDD.cache()}}, which is idempotent.
 We can check whether a table is already cached by checking:
 # whether the structure of the underlying logical plan of the table is 
 matches the pattern {{Subquery(\_, SparkLogicalPlan(inMem @ 
 InMemoryColumnarTableScan(_, _)))}}
 # whether {{inMem.cachedColumnBuffers.getStorageLevel.useMemory}} is true



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1633) Various examples for Scala and Java custom receiver, etc.


 [ 
https://issues.apache.org/jira/browse/SPARK-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-1633.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

 Various examples for Scala and Java custom receiver, etc. 
 --

 Key: SPARK-1633
 URL: https://issues.apache.org/jira/browse/SPARK-1633
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates


 [ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-732:
--

Fix Version/s: (was: 1.0.0)

 Recomputation of RDDs may result in duplicated accumulator updates
 --

 Key: SPARK-732
 URL: https://issues.apache.org/jira/browse/SPARK-732
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 
 0.8.2
Reporter: Josh Rosen
Assignee: Nan Zhu

 Currently, Spark doesn't guard against duplicated updates to the same 
 accumulator due to recomputations of an RDD.  For example:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += 1; f(x))
 data.count()
 // acc should equal data.count() here
 data.foreach{...}
 // Now, acc = 2 * data.count() because the map() was recomputed.
 {code}
 I think that this behavior is incorrect, especially because this behavior 
 allows the additon or removal of a cache() call to affect the outcome of a 
 computation.
 There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
 code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
 I haven't tested whether recomputation due to blocks being dropped from the 
 cache can trigger duplicate accumulator updates.
 Hypothetically someone could be relying on the current behavior to implement 
 performance counters that track the actual number of computations performed 
 (including recomputations).  To be safe, we could add an explicit warning in 
 the release notes that documents the change in behavior when we fix this.
 Ignoring duplicate updates shouldn't be too hard, but there are a few 
 subtleties.  Currently, we allow accumulators to be used in multiple 
 transformations, so we'd need to detect duplicate updates at the 
 per-transformation level.  I haven't dug too deeply into the scheduler 
 internals, but we might also run into problems where pipelining causes what 
 is logically one set of accumulator updates to show up in two different tasks 
 (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
 what's logically the same accumulator update to be applied from two different 
 contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1389) Make numPartitions in Exchange configurable


 [ 
https://issues.apache.org/jira/browse/SPARK-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1389:
---

Fix Version/s: (was: 1.0.0)
   1.1.0

 Make numPartitions in Exchange configurable
 ---

 Key: SPARK-1389
 URL: https://issues.apache.org/jira/browse/SPARK-1389
 Project: Spark
  Issue Type: Improvement
Reporter: Michael Armbrust
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1603) flaky test case in StreamingContextSuite


[ 
https://issues.apache.org/jira/browse/SPARK-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999241#comment-13999241
 ] 

Tathagata Das commented on SPARK-1603:
--

I think we havent seen the flakiness since then. So I am marking this as 
resolved. 

 flaky test case in StreamingContextSuite
 

 Key: SPARK-1603
 URL: https://issues.apache.org/jira/browse/SPARK-1603
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Nan Zhu
Assignee: Nan Zhu

 When Jenkins was testing 5 PRs at the same time, the test results in my PR 
 shows that  stop gracefully in StreamingContextSuite failed, 
 the stacktrace is as
 {quote}
  stop gracefully *** FAILED *** (8 seconds, 350 milliseconds)
 [info]   akka.actor.InvalidActorNameException: actor name [JobScheduler] is 
 not unique!
 [info]   at 
 akka.actor.dungeon.ChildrenContainer$TerminatingChildrenContainer.reserve(ChildrenContainer.scala:192)
 [info]   at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77)
 [info]   at akka.actor.ActorCell.reserveChild(ActorCell.scala:338)
 [info]   at akka.actor.dungeon.Children$class.makeChild(Children.scala:186)
 [info]   at akka.actor.dungeon.Children$class.attachChild(Children.scala:42)
 [info]   at akka.actor.ActorCell.attachChild(ActorCell.scala:338)
 [info]   at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:518)
 [info]   at 
 org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:57)
 [info]   at 
 org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:434)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$14$$anonfun$apply$mcV$sp$3.apply$mcVI$sp(StreamingContextSuite.scala:174)
 [info]   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply$mcV$sp(StreamingContextSuite.scala:163)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply(StreamingContextSuite.scala:159)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply(StreamingContextSuite.scala:159)
 [info]   at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
 [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.withFixture(StreamingContextSuite.scala:34)
 [info]   at 
 org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262)
 [info]   at 
 org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
 [info]   at 
 org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
 [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:198)
 [info]   at org.scalatest.FunSuite$class.runTest(FunSuite.scala:1271)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:34)
 [info]   at 
 org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:171)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:34)
 [info]   at 
 org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
 [info]   at 
 org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
 [info]   at 
 org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:260)
 [info]   at 
 org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:249)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:249)
 [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:326)
 [info]   at org.scalatest.FunSuite$class.runTests(FunSuite.scala:1304)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.runTests(StreamingContextSuite.scala:34)
 [info]   at org.scalatest.Suite$class.run(Suite.scala:2303)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.org$scalatest$FunSuite$$super$run(StreamingContextSuite.scala:34)
 [info]   at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310)
 [info]   at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310)
 [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:362)
 [info]   at org.scalatest.FunSuite$class.run(FunSuite.scala:1310)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingContextSuite.scala:34)
 [info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:208)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.run(StreamingContextSuite.scala:34)
 [info]   at

[jira] [Commented] (SPARK-1154) Spark fills up disk with app-* folders

2014-05-16 Thread Mingyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999562#comment-13999562
 ] 

Mingyu Kim commented on SPARK-1154:
---

I looked at the commit, and it seems like it wipes out app-* based on the last 
modification time of the directory itself. Because the modification time of a 
directory only changes when a child is added or removed, this may wipe out the 
app-* directory of a running Spark application if it has been running for more 
than TTL (unless new files/jars are added to the app directory once every 
while). I believe it should check the latest modification time of all the 
descendents of the app-* directory to decide whether to delete it or not. Am I 
mistaken?

 Spark fills up disk with app-* folders
 --

 Key: SPARK-1154
 URL: https://issues.apache.org/jira/browse/SPARK-1154
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Evan Chan
Assignee: Evan Chan
Priority: Critical
  Labels: starter
 Fix For: 1.0.0


 Current version of Spark fills up the disk with many app-* folders:
 $ ls /var/lib/spark
 app-20140210022347-0597  app-20140212173327-0627  app-20140218154110-0657  
 app-20140225232537-0017  app-20140225233548-0047
 app-20140210022407-0598  app-20140212173347-0628  app-20140218154130-0658  
 app-20140225232551-0018  app-20140225233556-0048
 app-20140210022427-0599  app-20140212173754-0629  app-20140218164232-0659  
 app-20140225232611-0019  app-20140225233603-0049
 app-20140210022447-0600  app-20140212182235-0630  app-20140218165133-0660  
 app-20140225232802-0020  app-20140225233610-0050
 app-20140210022508-0601  app-20140212182256-0631  app-20140218165148-0661  
 app-20140225232822-0021  app-20140225233617-0051
 app-20140210022528-0602  app-2014021314-0632  app-20140218165225-0662  
 app-20140225232940-0022  app-20140225233624-0052
 app-20140211024356-0603  app-20140213002026-0633  app-20140218165249-0663  
 app-20140225233002-0023  app-20140225233631-0053
 app-20140211024417-0604  app-20140213154948-0634  app-20140218172030-0664  
 app-20140225233056-0024  app-20140225233725-0054
 app-20140211024437-0605  app-20140213171810-0635  app-20140218193853-0665  
 app-20140225233108-0025  app-20140225233731-0055
 app-20140211024457-0606  app-20140213193637-0636  app-20140218194442-0666  
 app-20140225233124-0026  app-20140225233733-0056
 app-20140211024517-0607  app-20140214011513-0637  app-20140218194746-0667  
 app-20140225233133-0027  app-20140225233734-0057
 app-20140211024538-0608  app-20140214012151-0638  app-20140218194822-0668  
 app-20140225233147-0028  app-20140225233749-0058
 app-20140211193443-0609  app-20140214013134-0639  app-20140218212317-0669  
 app-20140225233208-0029  app-20140225233759-0059
 app-20140211195210-0610  app-20140214013332-0640  app-20140225180142-  
 app-20140225233215-0030  app-20140225233809-0060
 app-20140211213935-0611  app-20140214013642-0641  app-20140225180411-0001  
 app-20140225233224-0031  app-20140225233828-0061
 app-20140211214227-0612  app-20140214014246-0642  app-20140225180431-0002  
 app-20140225233232-0032  app-20140225234719-0062
 app-20140211215317-0613  app-20140214014607-0643  app-20140225180452-0003  
 app-20140225233239-0033  app-20140226032845-0063
 app-20140211224601-0614  app-20140214184943-0644  app-20140225180512-0004  
 app-20140225233320-0034  app-20140226033004-0064
 app-20140212022206-0615  app-20140214185118-0645  app-20140225180533-0005  
 app-20140225233328-0035  app-20140226033119-0065
 app-2014021206-0616  app-20140214185851-0646  app-20140225180553-0006  
 app-20140225233354-0036  app-2014022604-0066
 app-20140212022246-0617  app-20140214222856-0647  app-20140225181115-0007  
 app-20140225233402-0037  app-20140226033354-0067
 app-20140212043704-0618  app-20140214231312-0648  app-20140225181244-0008  
 app-20140225233409-0038  app-20140226033538-0068
 app-20140212043724-0619  app-20140214231434-0649  app-20140225182051-0009  
 app-20140225233416-0039  app-20140226033826-0069
 app-20140212043745-0620  app-20140214231542-0650  app-20140225183009-0010  
 app-20140225233426-0040  app-20140226034002-0070
 app-20140212044016-0621  app-20140214231616-0651  app-20140225184133-0011  
 app-20140225233432-0041  app-20140226034053-0071
 app-20140212044203-0622  app-20140214233016-0652  app-20140225184318-0012  
 app-20140225233439-0042  app-20140226034234-0072
 app-20140212044224-0623  app-20140214233037-0653  app-20140225184709-0013  
 app-20140225233447-0043  app-20140226034426-0073
 app-20140212045034-0624  app-20140218153242-0654  app-20140225184844-0014  
 app-20140225233526-0044  app-20140226034447-0074
 app-20140212045119-0625  app-20140218153341-0655  app-20140225190051-0015  
 app-20140225233534-0045

[jira] [Resolved] (SPARK-1826) Some bad head notations in sparksql


 [ 
https://issues.apache.org/jira/browse/SPARK-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1826.
-

Resolution: Fixed

 Some bad head notations in sparksql 
 

 Key: SPARK-1826
 URL: https://issues.apache.org/jira/browse/SPARK-1826
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: wangfei
 Fix For: 1.0.1


 There are some obvious bad notations， such as 
 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed

2014-05-16 Thread Harry Brundage (JIRA)

Harry Brundage created SPARK-1849:
-

 Summary: Broken UTF-8 encoded data gets character replacements and 
thus can't be fixed
 Key: SPARK-1849
 URL: https://issues.apache.org/jira/browse/SPARK-1849
 Project: Spark
  Issue Type: Bug
Reporter: Harry Brundage
 Fix For: 1.0.0, 0.9.1
 Attachments: encoding_test

I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
we should fix? It looks like {{HadoopRDD}} uses 
{{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
underneath:

{code}
scala sc.textFile(path).collect()(0)
res8: String = ?pple

scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes()
res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)

scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
classOf[Text]).map(pair = pair._2.getBytes).collect()(0)
res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
{code}

In the above example, the first two snippets show the string representation and 
byte representation of the example line of text. The third snippet shows what 
happens if you call {{getBytes}} on the {{Text}} object which comes back from 
hadoop land: we get the real bytes in the file out.

Now, I think this is a bug, though you may disagree. The text inside my file is 
perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
rescue and re-encode into UTF-8, because I want my application to be smart like 
that. I think Spark should give me the raw broken string so I can re-encode, 
but I can't get at the original bytes in order to guess at what the source 
encoding might be, as they have already been replaced. I'm dealing with data 
from some CDN access logs which are to put it nicely diversely encoded, but I 
think a use case Spark should fully support. So, my suggested fix, which I'd 
like some guidance, is to change {{textFile}} to spit out broken strings by not 
using {{Text}}'s UTF-8 encoding.

Further compounding this issue is that my application is actually in PySpark, 
but we can talk about how bytes fly through to Scala land after this if we 
agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK


[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999499#comment-13999499
 ] 

Xiangrui Meng commented on SPARK-1782:
--

Btw, this approach only gives us \Sigma and V. If we compute U via A V 
\Sigma^{-1} (current implementation in MLlib), very likely we lose 
orthogonality. For sparse SVD, the best package is PROPACK, which implements 
Lanczos bidiagonalization with partial reorthogonalization 
(http://sun.stanford.edu/~rmunk/PROPACK/). But let us use ARPACK now since we 
can call it from Breeze.

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1752) Standardize input/output format for vectors and labeled points


 [ 
https://issues.apache.org/jira/browse/SPARK-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1752:
-

Fix Version/s: 1.1.0

 Standardize input/output format for vectors and labeled points
 --

 Key: SPARK-1752
 URL: https://issues.apache.org/jira/browse/SPARK-1752
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.1.0


 We should standardize the text format used to represent vectors and labeled 
 points. The proposed formats are the following:
 1. dense vector: [v0,v1,..]
 2. sparse vector: (size,[i0,i1],[v0,v1])
 3. labeled point: (label,vector)
 where (..) indicates a tuple and [...] indicate an array. Those are 
 compatible with Python's syntax and can be easily parsed using `eval`.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1845) Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala collections.

2014-05-16 Thread Takuya Ueshin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998680#comment-13998680
 ] 

Takuya Ueshin commented on SPARK-1845:
--

Pull-requested: https://github.com/apache/spark/pull/790

 Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala 
 collections.
 --

 Key: SPARK-1845
 URL: https://issues.apache.org/jira/browse/SPARK-1845
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin

 When I execute {{orderBy}} or {{limit}} for {{SchemaRDD}} including 
 {{ArrayType}} or {{MapType}}, {{SparkSqlSerializer}} throws the following 
 exception:
 {quote}
 com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
 no-arg constructor): scala.collection.immutable.$colon$colon
 {quote}
 or
 {quote}
 com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
 no-arg constructor): scala.collection.immutable.Vector
 {quote}
 or
 {quote}
 com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
 no-arg constructor): scala.collection.immutable.HashMap$HashTrieMap
 {quote}
 and so on.
 This is because registrations of serializers for each concrete collections 
 are missing in {{SparkSqlSerializer}}.
 I believe it should use {{AllScalaRegistrar}}.
 {{AllScalaRegistrar}} covers a lot of serializers for concrete classes of 
 {{Seq}}, {{Map}} for {{ArrayType}}, {{MapType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1845) Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala collections.


 [ 
https://issues.apache.org/jira/browse/SPARK-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1845.
-

Resolution: Fixed

 Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala 
 collections.
 --

 Key: SPARK-1845
 URL: https://issues.apache.org/jira/browse/SPARK-1845
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin

 When I execute {{orderBy}} or {{limit}} for {{SchemaRDD}} including 
 {{ArrayType}} or {{MapType}}, {{SparkSqlSerializer}} throws the following 
 exception:
 {quote}
 com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
 no-arg constructor): scala.collection.immutable.$colon$colon
 {quote}
 or
 {quote}
 com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
 no-arg constructor): scala.collection.immutable.Vector
 {quote}
 or
 {quote}
 com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
 no-arg constructor): scala.collection.immutable.HashMap$HashTrieMap
 {quote}
 and so on.
 This is because registrations of serializers for each concrete collections 
 are missing in {{SparkSqlSerializer}}.
 I believe it should use {{AllScalaRegistrar}}.
 {{AllScalaRegistrar}} covers a lot of serializers for concrete classes of 
 {{Seq}}, {{Map}} for {{ArrayType}}, {{MapType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1850) Bad exception if multiple jars exist when running PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1850:
-

Description: 
{code}
Found multiple Spark assembly jars in 
/Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
Traceback (most recent call last):
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, line 
43, in module
sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, 
pyFiles=add_files)
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
line 94, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
line 180, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File 
/Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, line 
49, in launch_gateway
gateway_port = int(proc.stdout.readline())
ValueError: invalid literal for int() with base 10: 
'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'
{/code}

It's trying to read the Java gateway port as an int from the sub-process' 
STDOUT. However, what it read was an error message, which is clearly not an 
int. We should differentiate between these cases and just propagate the 
original message if it's not an int. Right now, this exception is not very 
helpful.

  was:
code
Found multiple Spark assembly jars in 
/Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
Traceback (most recent call last):
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, line 
43, in module
sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, 
pyFiles=add_files)
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
line 94, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
line 180, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File 
/Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, line 
49, in launch_gateway
gateway_port = int(proc.stdout.readline())
ValueError: invalid literal for int() with base 10: 
'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'
/code

It's trying to read the Java gateway port as an int from the sub-process' 
STDOUT. However, what it read was an error message, which is clearly not an 
int. We should differentiate between these cases and just propagate the 
original message if it's not an int. Right now, this exception is not very 
helpful.


 Bad exception if multiple jars exist when running PySpark
 -

 Key: SPARK-1850
 URL: https://issues.apache.org/jira/browse/SPARK-1850
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.0.1


 {code}
 Found multiple Spark assembly jars in 
 /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
 Traceback (most recent call last):
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, 
 line 43, in module
 sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, 
 pyFiles=add_files)
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
 line 94, in __init__
 SparkContext._ensure_initialized(self, gateway=gateway)
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
 line 180, in _ensure_initialized
 SparkContext._gateway = gateway or launch_gateway()
   File 
 /Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, 
 line 49, in launch_gateway
 gateway_port = int(proc.stdout.readline())
 ValueError: invalid literal for int() with base 10: 
 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'
 {/code}
 It's trying to read the Java gateway port as an int from the sub-process' 
 STDOUT. However, what it read was an error message, which is clearly not an 
 int. We should differentiate between these cases and just propagate the 
 original message if it's not an int. Right now, this exception is not very 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1830) Deploy failover, Make Persistence engine and LeaderAgent Pluggable.

2014-05-16 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998549#comment-13998549
 ] 

Prashant Sharma commented on SPARK-1830:


PR at https://github.com/apache/spark/pull/771

 Deploy failover, Make Persistence engine and LeaderAgent Pluggable.
 ---

 Key: SPARK-1830
 URL: https://issues.apache.org/jira/browse/SPARK-1830
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Prashant Sharma
 Fix For: 1.1.0, 1.0.1


 With current code base it is difficult to plugin an external user specified 
 Persistence Engine or Election Agent. It would be good to expose this as 
 a pluggable API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-944) Give example of writing to HBase from Spark Streaming


[ 
https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999226#comment-13999226
 ] 

Tathagata Das commented on SPARK-944:
-

Hi Kanwal, the usual process of the contributing to Spark is through pull 
requests in Github. That way your name comes up in the change list as a 
contributor.



 Give example of writing to HBase from Spark Streaming
 -

 Key: SPARK-944
 URL: https://issues.apache.org/jira/browse/SPARK-944
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Patrick Wendell
 Attachments: MetricAggregatorHBase.scala






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1022) Add unit tests for kafka streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1022:
---

Fix Version/s: (was: 1.0.0)

 Add unit tests for kafka streaming
 --

 Key: SPARK-1022
 URL: https://issues.apache.org/jira/browse/SPARK-1022
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Saisai Shao

 It would be nice if we could add unit tests to verify elements of kafka's 
 stream. Right now we do integration tests only which makes it hard to upgrade 
 versions of kafka. The place to start here would be to look at how kafka 
 tests itself and see if the functionality can be exposed to third party users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1749) DAGScheduler supervisor strategy broken with Mesos

2014-05-16 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1749:


Fix Version/s: 1.0.1

 DAGScheduler supervisor strategy broken with Mesos
 --

 Key: SPARK-1749
 URL: https://issues.apache.org/jira/browse/SPARK-1749
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Spark Core
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
Assignee: Mark Hamstra
Priority: Blocker
  Labels: mesos, scheduler, scheduling
 Fix For: 1.0.1


 Any bad Python code will trigger this bug, for example 
 `sc.parallelize(range(100)).map(lambda n: undefined_variable * 2).collect()` 
 will cause a `undefined_variable isn't defined`, which will cause spark to 
 try to kill the task, resulting in the following stacktrace:
 java.lang.UnsupportedOperationException
   at 
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:184)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:182)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:182)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:182)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:175)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:175)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1058)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGScheduler.scala:499)
   at 
 org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1151)
   at 
 org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1147)
   at akka.actor.SupervisorStrategy.handleFailure(FaultHandling.scala:295)
   at 
 akka.actor.dungeon.FaultHandling$class.handleFailure(FaultHandling.scala:253)
   at akka.actor.ActorCell.handleFailure(ActorCell.scala:338)
   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:423)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
   at akka.dispatch.Mailbox.run(Mailbox.scala:218)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This is because killTask isn't implemented for the MesosSchedulerBackend. I 
 assume this isn't pyspark-specific, as there will be other instances where 
 you might want to kill the task 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1603) flaky test case in StreamingContextSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-1603.
--

Resolution: Fixed

 flaky test case in StreamingContextSuite
 

 Key: SPARK-1603
 URL: https://issues.apache.org/jira/browse/SPARK-1603
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Nan Zhu
Assignee: Nan Zhu

 When Jenkins was testing 5 PRs at the same time, the test results in my PR 
 shows that  stop gracefully in StreamingContextSuite failed, 
 the stacktrace is as
 {quote}
  stop gracefully *** FAILED *** (8 seconds, 350 milliseconds)
 [info]   akka.actor.InvalidActorNameException: actor name [JobScheduler] is 
 not unique!
 [info]   at 
 akka.actor.dungeon.ChildrenContainer$TerminatingChildrenContainer.reserve(ChildrenContainer.scala:192)
 [info]   at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77)
 [info]   at akka.actor.ActorCell.reserveChild(ActorCell.scala:338)
 [info]   at akka.actor.dungeon.Children$class.makeChild(Children.scala:186)
 [info]   at akka.actor.dungeon.Children$class.attachChild(Children.scala:42)
 [info]   at akka.actor.ActorCell.attachChild(ActorCell.scala:338)
 [info]   at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:518)
 [info]   at 
 org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:57)
 [info]   at 
 org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:434)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$14$$anonfun$apply$mcV$sp$3.apply$mcVI$sp(StreamingContextSuite.scala:174)
 [info]   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply$mcV$sp(StreamingContextSuite.scala:163)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply(StreamingContextSuite.scala:159)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$14.apply(StreamingContextSuite.scala:159)
 [info]   at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
 [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.withFixture(StreamingContextSuite.scala:34)
 [info]   at 
 org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262)
 [info]   at 
 org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
 [info]   at 
 org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
 [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:198)
 [info]   at org.scalatest.FunSuite$class.runTest(FunSuite.scala:1271)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:34)
 [info]   at 
 org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:171)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:34)
 [info]   at 
 org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
 [info]   at 
 org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
 [info]   at 
 org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:260)
 [info]   at 
 org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:249)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:249)
 [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:326)
 [info]   at org.scalatest.FunSuite$class.runTests(FunSuite.scala:1304)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.runTests(StreamingContextSuite.scala:34)
 [info]   at org.scalatest.Suite$class.run(Suite.scala:2303)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.org$scalatest$FunSuite$$super$run(StreamingContextSuite.scala:34)
 [info]   at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310)
 [info]   at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310)
 [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:362)
 [info]   at org.scalatest.FunSuite$class.run(FunSuite.scala:1310)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingContextSuite.scala:34)
 [info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:208)
 [info]   at 
 org.apache.spark.streaming.StreamingContextSuite.run(StreamingContextSuite.scala:34)
 [info]   at 
 org.scalatest.tools.ScalaTestFramework$ScalaTestRunner.run(ScalaTestFramework.scala:214)
 [info]   at

[jira] [Updated] (SPARK-1442) Add Window function support


 [ 
https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1442:


Fix Version/s: 1.1.0

 Add Window function support
 ---

 Key: SPARK-1442
 URL: https://issues.apache.org/jira/browse/SPARK-1442
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Chengxiang Li
 Fix For: 1.1.0


 similiar to Hive, add window function support for catalyst.
 https://issues.apache.org/jira/browse/HIVE-4197
 https://issues.apache.org/jira/browse/HIVE-896



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1741) Add predict(JavaRDD) to predictive models


 [ 
https://issues.apache.org/jira/browse/SPARK-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1741.


   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 670
[https://github.com/apache/spark/pull/670]

 Add predict(JavaRDD) to predictive models
 -

 Key: SPARK-1741
 URL: https://issues.apache.org/jira/browse/SPARK-1741
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.0.0


 `model.predict` returns a RDD of Scala primitive type (Int/Double), which is 
 recognized as Object in Java. Adding predict(JavaRDD) could make life easier 
 for Java users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running applications


 [ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1860:
---

Assignee: (was: Aaron Davidson)

 Standalone Worker cleanup should not clean up running applications
 --

 Key: SPARK-1860
 URL: https://issues.apache.org/jira/browse/SPARK-1860
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Priority: Critical
 Fix For: 1.0.0


 The default values of the standalone worker cleanup code cleanup all 
 application data every 7 days. This includes jars that were added to any 
 applications that happen to be running for longer than 7 days, hitting 
 streaming jobs especially hard.
 Applications should not be cleaned up if they're still running. Until then, 
 this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1824) Python examples still take in master


 [ 
https://issues.apache.org/jira/browse/SPARK-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1824:
-

Issue Type: Improvement  (was: Bug)

 Python examples still take in master
 --

 Key: SPARK-1824
 URL: https://issues.apache.org/jira/browse/SPARK-1824
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.0.1


 A recent commit 
 https://github.com/apache/spark/commit/44dd57fb66bb676d753ad8d9757f9f4c03364113
  changed existing Spark examples in Scala and Java such that they no longer 
 take in master as an argument. We forgot to do the same for Python.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-911) Support map pruning on sorted (K, V) RDD's


 [ 
https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-911:
--

Description: If someone has sorted a (K, V) rdd, we should offer them a way 
to filter a range of the partitions that employs map pruning. This would be 
simple using a small range index within the rdd itself. A good example is I 
sort my dataset by time and then I want to serve queries that are restricted to 
a certain time range.  (was: \[Tentatively assigned to me, but anyone can do 
this if they'd like!\]

If someone has sorted a (K, V) rdd, we should offer them a way to filter a 
range of the partitions that employs map pruning. This would be simple using a 
small range index within the rdd itself. A good example is I sort my dataset by 
time and then I want to serve queries that are restricted to a certain time 
range.)

 Support map pruning on sorted (K, V) RDD's
 --

 Key: SPARK-911
 URL: https://issues.apache.org/jira/browse/SPARK-911
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell

 If someone has sorted a (K, V) rdd, we should offer them a way to filter a 
 range of the partitions that employs map pruning. This would be simple using 
 a small range index within the rdd itself. A good example is I sort my 
 dataset by time and then I want to serve queries that are restricted to a 
 certain time range.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-874) Have version of `sc.stop()` that blocks until all executors are cleaned up.


 [ 
https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-874:
--

Assignee: (was: Patrick Cogan)

 Have version of `sc.stop()` that blocks until all executors are cleaned up.
 ---

 Key: SPARK-874
 URL: https://issues.apache.org/jira/browse/SPARK-874
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Patrick Wendell
Priority: Minor
  Labels: starter
 Fix For: 1.1.0


 This would make things nicer for writing automated tests of Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished


 [ 
https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-874:
--

Description: When running benchmarking jobs, sometimes the cluster takes a 
long time to shut down. We should add a feature where it will ssh into all the 
workers every few seconds and check that the processes are dead, and won't 
return until they are all dead. This would help a lot with automating 
benchmarking scripts.  (was: This would make things nicer for writing automated 
tests of Spark.)

 Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are 
 finished
 ---

 Key: SPARK-874
 URL: https://issues.apache.org/jira/browse/SPARK-874
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Patrick Wendell
Priority: Minor
  Labels: starter
 Fix For: 1.1.0


 When running benchmarking jobs, sometimes the cluster takes a long time to 
 shut down. We should add a feature where it will ssh into all the workers 
 every few seconds and check that the processes are dead, and won't return 
 until they are all dead. This would help a lot with automating benchmarking 
 scripts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications


[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999602#comment-13999602
 ] 

Patrick Wendell commented on SPARK-1860:


I think it would be better to only start the TTL once an executor has finished 
and to only delete the specific folder used by the executor.

 Standalone Worker cleanup should not clean up running applications
 --

 Key: SPARK-1860
 URL: https://issues.apache.org/jira/browse/SPARK-1860
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Priority: Critical
 Fix For: 1.0.0


 The default values of the standalone worker cleanup code cleanup all 
 application data every 7 days. This includes jars that were added to any 
 applications that happen to be running for longer than 7 days, hitting 
 streaming jobs especially hard.
 Applications should not be cleaned up if they're still running. Until then, 
 this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-874) Have version of `sc.stop()` that blocks until all executors are cleaned up.


 [ 
https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-874:
--

Labels: starter  (was: )

 Have version of `sc.stop()` that blocks until all executors are cleaned up.
 ---

 Key: SPARK-874
 URL: https://issues.apache.org/jira/browse/SPARK-874
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Patrick Wendell
Assignee: Patrick Cogan
Priority: Minor
  Labels: starter
 Fix For: 1.1.0


 This would make things nicer for writing automated tests of Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-944) Give example of writing to HBase from Spark Streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-944:
--

Assignee: (was: Patrick Cogan)

 Give example of writing to HBase from Spark Streaming
 -

 Key: SPARK-944
 URL: https://issues.apache.org/jira/browse/SPARK-944
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Patrick Wendell
 Attachments: MetricAggregatorHBase.scala






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1792) Missing Spark-Shell Configure Options


 [ 
https://issues.apache.org/jira/browse/SPARK-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1792:
---

Fix Version/s: 1.1.0

 Missing Spark-Shell Configure Options
 -

 Key: SPARK-1792
 URL: https://issues.apache.org/jira/browse/SPARK-1792
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Reporter: Joseph E. Gonzalez
 Fix For: 1.1.0


 The `conf/spark-env.sh.template` does not have configure options for the 
 spark shell.   For example to enable Kryo for GraphX when using the spark 
 shell in stand alone mode it appears you must add:
 {code}
 SPARK_SUBMIT_OPTS=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
  
 SPARK_SUBMIT_OPTS+=-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
   
 {code}
 However SPARK_SUBMIT_OPTS is not documented anywhere.  Perhaps the 
 spark-shell should have its own options (e.g., SPARK_SHELL_OPTS).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915


 [ 
https://issues.apache.org/jira/browse/SPARK-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1478:
-

Fix Version/s: 1.1.0

 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
 ---

 Key: SPARK-1478
 URL: https://issues.apache.org/jira/browse/SPARK-1478
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor
 Fix For: 1.1.0


 Flume-1915 added support for compression over the wire from avro sink to avro 
 source.  I would like to add this functionality to the FlumeReceiver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1820) Make GenerateMimaIgnore @DeveloperApi annotation aware.


 [ 
https://issues.apache.org/jira/browse/SPARK-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1820:
---

Affects Version/s: (was: 1.1.0)

 Make GenerateMimaIgnore @DeveloperApi annotation aware.
 ---

 Key: SPARK-1820
 URL: https://issues.apache.org/jira/browse/SPARK-1820
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Prashant Sharma
 Fix For: 1.1.0


 Ignore all the classes with DeveloperApi annotation, while doing Mima checks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1154) Spark fills up disk with app-* folders


[ 
https://issues.apache.org/jira/browse/SPARK-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999601#comment-13999601
 ] 

Patrick Wendell commented on SPARK-1154:


[~mkim] yes you are correct - this is broken. Checkout SPARK-1860.

 Spark fills up disk with app-* folders
 --

 Key: SPARK-1154
 URL: https://issues.apache.org/jira/browse/SPARK-1154
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Evan Chan
Assignee: Evan Chan
Priority: Critical
  Labels: starter
 Fix For: 1.0.0


 Current version of Spark fills up the disk with many app-* folders:
 $ ls /var/lib/spark
 app-20140210022347-0597  app-20140212173327-0627  app-20140218154110-0657  
 app-20140225232537-0017  app-20140225233548-0047
 app-20140210022407-0598  app-20140212173347-0628  app-20140218154130-0658  
 app-20140225232551-0018  app-20140225233556-0048
 app-20140210022427-0599  app-20140212173754-0629  app-20140218164232-0659  
 app-20140225232611-0019  app-20140225233603-0049
 app-20140210022447-0600  app-20140212182235-0630  app-20140218165133-0660  
 app-20140225232802-0020  app-20140225233610-0050
 app-20140210022508-0601  app-20140212182256-0631  app-20140218165148-0661  
 app-20140225232822-0021  app-20140225233617-0051
 app-20140210022528-0602  app-2014021314-0632  app-20140218165225-0662  
 app-20140225232940-0022  app-20140225233624-0052
 app-20140211024356-0603  app-20140213002026-0633  app-20140218165249-0663  
 app-20140225233002-0023  app-20140225233631-0053
 app-20140211024417-0604  app-20140213154948-0634  app-20140218172030-0664  
 app-20140225233056-0024  app-20140225233725-0054
 app-20140211024437-0605  app-20140213171810-0635  app-20140218193853-0665  
 app-20140225233108-0025  app-20140225233731-0055
 app-20140211024457-0606  app-20140213193637-0636  app-20140218194442-0666  
 app-20140225233124-0026  app-20140225233733-0056
 app-20140211024517-0607  app-20140214011513-0637  app-20140218194746-0667  
 app-20140225233133-0027  app-20140225233734-0057
 app-20140211024538-0608  app-20140214012151-0638  app-20140218194822-0668  
 app-20140225233147-0028  app-20140225233749-0058
 app-20140211193443-0609  app-20140214013134-0639  app-20140218212317-0669  
 app-20140225233208-0029  app-20140225233759-0059
 app-20140211195210-0610  app-20140214013332-0640  app-20140225180142-  
 app-20140225233215-0030  app-20140225233809-0060
 app-20140211213935-0611  app-20140214013642-0641  app-20140225180411-0001  
 app-20140225233224-0031  app-20140225233828-0061
 app-20140211214227-0612  app-20140214014246-0642  app-20140225180431-0002  
 app-20140225233232-0032  app-20140225234719-0062
 app-20140211215317-0613  app-20140214014607-0643  app-20140225180452-0003  
 app-20140225233239-0033  app-20140226032845-0063
 app-20140211224601-0614  app-20140214184943-0644  app-20140225180512-0004  
 app-20140225233320-0034  app-20140226033004-0064
 app-20140212022206-0615  app-20140214185118-0645  app-20140225180533-0005  
 app-20140225233328-0035  app-20140226033119-0065
 app-2014021206-0616  app-20140214185851-0646  app-20140225180553-0006  
 app-20140225233354-0036  app-2014022604-0066
 app-20140212022246-0617  app-20140214222856-0647  app-20140225181115-0007  
 app-20140225233402-0037  app-20140226033354-0067
 app-20140212043704-0618  app-20140214231312-0648  app-20140225181244-0008  
 app-20140225233409-0038  app-20140226033538-0068
 app-20140212043724-0619  app-20140214231434-0649  app-20140225182051-0009  
 app-20140225233416-0039  app-20140226033826-0069
 app-20140212043745-0620  app-20140214231542-0650  app-20140225183009-0010  
 app-20140225233426-0040  app-20140226034002-0070
 app-20140212044016-0621  app-20140214231616-0651  app-20140225184133-0011  
 app-20140225233432-0041  app-20140226034053-0071
 app-20140212044203-0622  app-20140214233016-0652  app-20140225184318-0012  
 app-20140225233439-0042  app-20140226034234-0072
 app-20140212044224-0623  app-20140214233037-0653  app-20140225184709-0013  
 app-20140225233447-0043  app-20140226034426-0073
 app-20140212045034-0624  app-20140218153242-0654  app-20140225184844-0014  
 app-20140225233526-0044  app-20140226034447-0074
 app-20140212045119-0625  app-20140218153341-0655  app-20140225190051-0015  
 app-20140225233534-0045
 app-20140212173310-0626  app-20140218153442-0656  app-20140225232516-0016  
 app-20140225233540-0046
 This problem is particularly bad if you have a whole bunch of fast jobs.   
 Also what makes the problem worse is that any jars for jobs is downloaded 
 into the app-* folder, so that fills up the disk particularly fast.
 I would like to propose two things:
 1) Spark should have a cleanup thread (or actor) which periodically removes 
 old app-* folders;   This should not be the

[jira] [Updated] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished


 [ 
https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-874:
--

Description: 
When running benchmarking jobs, sometimes the cluster takes a long time to shut 
down. We should add a feature where it will ssh into all the workers every few 
seconds and check that the processes are dead, and won't return until they are 
all dead. This would help a lot with automating benchmarking scripts.

There is some equivalent logic here written in python, we just need to add it 
to the shell script:
https://github.com/pwendell/spark-perf/blob/master/bin/run#L117

  was:When running benchmarking jobs, sometimes the cluster takes a long time 
to shut down. We should add a feature where it will ssh into all the workers 
every few seconds and check that the processes are dead, and won't return until 
they are all dead. This would help a lot with automating benchmarking scripts.


 Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are 
 finished
 ---

 Key: SPARK-874
 URL: https://issues.apache.org/jira/browse/SPARK-874
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Patrick Wendell
Priority: Minor
  Labels: starter
 Fix For: 1.1.0


 When running benchmarking jobs, sometimes the cluster takes a long time to 
 shut down. We should add a feature where it will ssh into all the workers 
 every few seconds and check that the processes are dead, and won't return 
 until they are all dead. This would help a lot with automating benchmarking 
 scripts.
 There is some equivalent logic here written in python, we just need to add it 
 to the shell script:
 https://github.com/pwendell/spark-perf/blob/master/bin/run#L117



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1851) Upgrade Avro dependency to 1.7.6 so Spark can read Avro files


 [ 
https://issues.apache.org/jira/browse/SPARK-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1851:
---

Assignee: Sandy Ryza

 Upgrade Avro dependency to 1.7.6 so Spark can read Avro files
 -

 Key: SPARK-1851
 URL: https://issues.apache.org/jira/browse/SPARK-1851
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Fix For: 1.0.0


 I tried to set up a basic example getting a Spark job to read an Avro 
 container file with Avro specifics.  This results in a 
 ClassNotFoundException: can't convert GenericData.Record to 
 com.cloudera.sparkavro.User.
 The reason is:
 * When creating records, to decide whether to be specific or generic, Avro 
 tries to load a class with the name specified in the schema.
 * Initially, executors just have the system jars (which include Avro), and 
 load the app jars dynamically with a URLClassLoader that's set as the context 
 classloader for the task threads.
 * Avro tries to load the generated classes with 
 SpecificData.class.getClassLoader(), which sidesteps this URLClassLoader and 
 goes up to the AppClassLoader.
 Avro 1.7.6 has a change (AVRO-987) that falls back to the Thread's context 
 classloader when the SpecificData.class.getClassLoader() fails.  I tested 
 with Avro 1.7.6 and did not observe the problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1830) Deploy failover, Make Persistence engine and LeaderAgent Pluggable.

2014-05-16 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-1830:
---

Fix Version/s: 1.0.1
   1.1.0

 Deploy failover, Make Persistence engine and LeaderAgent Pluggable.
 ---

 Key: SPARK-1830
 URL: https://issues.apache.org/jira/browse/SPARK-1830
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Prashant Sharma
 Fix For: 1.1.0, 1.0.1


 With current code base it is difficult to plugin an external user specified 
 Persistence Engine or Election Agent. It would be good to expose this as 
 a pluggable API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1846) RAT checks should exclude logs/ directory

2014-05-16 Thread Andrew Ash (JIRA)

Andrew Ash created SPARK-1846:
-

 Summary: RAT checks should exclude logs/ directory
 Key: SPARK-1846
 URL: https://issues.apache.org/jira/browse/SPARK-1846
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Andrew Ash


When there are logs in the logs/ directory, the rat check from 
./dev/check-license fails.

```
aash@aash-mbp ~/git/spark$ find logs -type f
logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out
logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.1
logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.2
logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.3
logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.4
logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.5
logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out
logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out.1
logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out
logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.1
logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.2
logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.3
logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.4
logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.5
logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out
logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.1
logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.2
logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out
logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.1
logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.2
aash@aash-mbp ~/git/spark$ ./dev/check-license
Could not find Apache license headers in the following files:
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.1
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.2
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.3
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.4
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.5
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out.1
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.1
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.2
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.3
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.4
 !? 
/Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.5
 !? 
/Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out
 !? 
/Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.1
 !? 
/Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.2
 !? 
/Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out
 !? 
/Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.1
 !? 
/Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.2
aash@aash-mbp ~/git/spark$
```



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems

2014-05-16 Thread Michael Malak (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998807#comment-13998807
 ] 

Michael Malak commented on SPARK-1836:
--

Michael Ambrust: Indeed. Do you think I should add my additional case of 
equals() (and its workaround) as a comment to SPARK-1199 and mark this one as a 
duplicate?

 REPL $outer type mismatch causes lookup() and equals() problems
 ---

 Key: SPARK-1836
 URL: https://issues.apache.org/jira/browse/SPARK-1836
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Michael Malak

 Anand Avati partially traced the cause to REPL wrapping classes in $outer 
 classes. There are at least two major symptoms:
 1. equals()
 =
 In REPL equals() (required in custom classes used as a key for groupByKey) 
 seems to have to be written using instanceOf[] instead of the canonical 
 match{}
 Spark Shell (equals uses match{}):
 {noformat}
 class C(val s:String) extends Serializable {
   override def equals(o: Any) = o match {
 case that: C = that.s == s
 case _ = false
   }
 }
 val x = new C(a)
 val bos = new java.io.ByteArrayOutputStream()
 val out = new java.io.ObjectOutputStream(bos)
 out.writeObject(x);
 val b = bos.toByteArray();
 out.close
 bos.close
 val y = new java.io.ObjectInputStream(new 
 ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
 x.equals(y)
 res: Boolean = false
 {noformat}
 Spark Shell (equals uses isInstanceOf[]):
 {noformat}
 class C(val s:String) extends Serializable {
   override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s = 
 s) else false
 }
 val x = new C(a)
 val bos = new java.io.ByteArrayOutputStream()
 val out = new java.io.ObjectOutputStream(bos)
 out.writeObject(x);
 val b = bos.toByteArray();
 out.close
 bos.close
 val y = new java.io.ObjectInputStream(new 
 ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
 x.equals(y)
 res: Boolean = true
 {noformat}
 Scala Shell (equals uses match{}):
 {noformat}
 class C(val s:String) extends Serializable {
   override def equals(o: Any) = o match {
 case that: C = that.s == s
 case _ = false
   }
 }
 val x = new C(a)
 val bos = new java.io.ByteArrayOutputStream()
 val out = new java.io.ObjectOutputStream(bos)
 out.writeObject(x);
 val b = bos.toByteArray();
 out.close
 bos.close
 val y = new java.io.ObjectInputStream(new 
 java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
 x.equals(y)
 res: Boolean = true
 {noformat}
 2. lookup()
 =
 {noformat}
 class C(val s:String) extends Serializable {
   override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == 
 s else false
   override def hashCode = s.hashCode
   override def toString = s
 }
 val r = sc.parallelize(Array((new C(a),11),(new C(a),12)))
 r.lookup(new C(a))
 console:17: error: type mismatch;
  found   : C
  required: C
   r.lookup(new C(a))
^
 {noformat}
 See
 http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3C1400019424.80629.YahooMailNeo%40web160801.mail.bf1.yahoo.com%3E



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1730) Make receiver store data reliably to avoid data-loss on executor failures


 [ 
https://issues.apache.org/jira/browse/SPARK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1730:
-

Fix Version/s: 1.1.0

 Make receiver store data reliably to avoid data-loss on executor failures
 -

 Key: SPARK-1730
 URL: https://issues.apache.org/jira/browse/SPARK-1730
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1858) Update third-party Hadoop distros doc to list more distros

2014-05-16 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-1858:


 Summary: Update third-party Hadoop distros doc to list more 
distros
 Key: SPARK-1858
 URL: https://issues.apache.org/jira/browse/SPARK-1858
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1553) Support alternating nonnegative least-squares


 [ 
https://issues.apache.org/jira/browse/SPARK-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1553:
-

Fix Version/s: 1.1.0

 Support alternating nonnegative least-squares
 -

 Key: SPARK-1553
 URL: https://issues.apache.org/jira/browse/SPARK-1553
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 0.9.0
Reporter: Tor Myklebust
Assignee: Tor Myklebust
 Fix For: 1.1.0


 There's already an ALS implementation.  It can be tweaked to support 
 nonnegative least-squares by conditionally running a nonnegative 
 least-squares solve instead of a least-squares solver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1854) Add a version of StreamingContext.fileStream that take hadoop conf object

Tathagata Das created SPARK-1854:


 Summary: Add a version of StreamingContext.fileStream that take 
hadoop conf object
 Key: SPARK-1854
 URL: https://issues.apache.org/jira/browse/SPARK-1854
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-05-16 Thread Li Pu (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000232#comment-14000232
]

Li Pu commented on SPARK-1782:
--

[~mengxr] thank you for the comments! You are right, (A^T A) x can be done in a
single pass. What we need is actually a eigs solver. Seems that mllib depends
on Breeze 0.7 (though I cannot find this release version in scalanlp/breeze). I
think the code would be cleaner if we call eigs in Breeze? or do you prefer to
have more control in mllib by calling ARPACK directly?

Also thank you for the pointer to PROPACK. I looked into the svd routine. It
calls lapack dbdsqr for actual svd computation. I will try to find a better way
to incorporate Fortran routines in a distributed way. For now we can use ARPACK.

svd for sparse matrix using ARPACK
--

Key: SPARK-1782
URL: https://issues.apache.org/jira/browse/SPARK-1782
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Li Pu
Original Estimate: 672h
Remaining Estimate: 672h

Currently the svd implementation in mllib calls the dense matrix svd in
breeze, which has a limitation of fitting n^2 Gram matrix entries in memory
(n is the number of rows or number of columns of the matrix, whichever is
smaller). In many use cases, the original matrix is sparse but the Gram
matrix might not, and we often need only the largest k singular
values/vectors. To make svd really scalable, the memory usage must be
propositional to the non-zero entries in the matrix.
One solution is to call the de facto standard eigen-decomposition package
ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors
of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the
eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a
reverse communication interface. The user provides a function to multiply a
square matrix to be decomposed with a dense vector provided by ARPACK, and
return the resulting dense vector to ARPACK. Inside ARPACK it uses an
Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we
need to provide are two matrix-vector multiplications, first M*x then M^t*x.
These multiplications can be done in Spark in a distributed manner.
The working memory used by ARPACK is O(n*k). When k (the number of desired
singular values) is small, it can be easily fit into the memory of the master
machine. The overall model is master machine runs ARPACK, and distribute
matrix-vector multiplication onto working executors in each iteration.
I made a PR to breeze with an ARPACK-backed svds interface
(https://github.com/scalanlp/breeze/pull/240). The interface takes anything
that can be multiplied by a DenseVector. On Spark/milib side, just need to
implement the sparsematrix-vector multiplication.
It might take some time to optimize and fully test this implementation, so
set the workload estimate to 4 weeks.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-874) Have version of `sc.stop()` that blocks until all executors are cleaned up.


 [ 
https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-874:
--

Component/s: Deploy

 Have version of `sc.stop()` that blocks until all executors are cleaned up.
 ---

 Key: SPARK-874
 URL: https://issues.apache.org/jira/browse/SPARK-874
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Patrick Wendell
Assignee: Patrick Cogan
Priority: Minor
  Labels: starter
 Fix For: 1.1.0


 This would make things nicer for writing automated tests of Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-944) Give example of writing to HBase from Spark Streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-944:
--

Fix Version/s: (was: 1.0.0)

 Give example of writing to HBase from Spark Streaming
 -

 Key: SPARK-944
 URL: https://issues.apache.org/jira/browse/SPARK-944
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Patrick Cogan
 Attachments: MetricAggregatorHBase.scala






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1368) HiveTableScan is slow

2014-05-16 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998619#comment-13998619
 ] 

Cheng Lian commented on SPARK-1368:
---

Corresponding PR: https://github.com/apache/spark/pull/758

 HiveTableScan is slow
 -

 Key: SPARK-1368
 URL: https://issues.apache.org/jira/browse/SPARK-1368
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
 Fix For: 1.1.0


 The major issues here are the use of functional programming (.map, .foreach) 
 and the creation of a new Row object for each output tuple. We should switch 
 to while loops in the critical path and a single MutableRow per partition.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1585) Not robust Lasso causes Infinity on weights and losses


[ 
https://issues.apache.org/jira/browse/SPARK-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999123#comment-13999123
 ] 

Xiangrui Meng commented on SPARK-1585:
--

I think the gradient should pull the weights back. If I'm wrong, could you 
create an example code to demonstrate the problem? -Xiangrui

 Not robust Lasso causes Infinity on weights and losses
 --

 Key: SPARK-1585
 URL: https://issues.apache.org/jira/browse/SPARK-1585
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 0.9.1
Reporter: Xusen Yin
Assignee: Xusen Yin
 Fix For: 1.1.0


 Lasso uses LeastSquaresGradient and L1Updater, but 
 diff = brzWeights.dot(brzData) - label
 in LeastSquaresGradient would cause too big diff, then will affect the 
 L1Updater, which increases weights exponentially. Small shrinkage value 
 cannot lasso weights back to zero then. Finally, the weights and losses reach 
 Infinity.
 For example, data = (0.5 repeats 10k times), weights = (0.6 repeats 10k 
 times), then data.dot(weights) approximates 300+, the diff will be 300. Then 
 L1Updater sets weights to approximate 300. In the next iteration, the weights 
 will be set to approximate 3, and so on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1851) Upgrade Avro dependency to 1.7.6 so Spark can read Avro files

[
https://issues.apache.org/jira/browse/SPARK-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell resolved SPARK-1851.

Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 795
[https://github.com/apache/spark/pull/795]

Upgrade Avro dependency to 1.7.6 so Spark can read Avro files
-

Key: SPARK-1851
URL: https://issues.apache.org/jira/browse/SPARK-1851
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: Sandy Ryza
Priority: Critical
Fix For: 1.0.0

I tried to set up a basic example getting a Spark job to read an Avro
container file with Avro specifics. This results in a
ClassNotFoundException: can't convert GenericData.Record to
com.cloudera.sparkavro.User.
The reason is:
* When creating records, to decide whether to be specific or generic, Avro
tries to load a class with the name specified in the schema.
* Initially, executors just have the system jars (which include Avro), and
load the app jars dynamically with a URLClassLoader that's set as the context
classloader for the task threads.
* Avro tries to load the generated classes with
SpecificData.class.getClassLoader(), which sidesteps this URLClassLoader and
goes up to the AppClassLoader.
Avro 1.7.6 has a change (AVRO-987) that falls back to the Thread's context
classloader when the SpecificData.class.getClassLoader() fails. I tested
with Avro 1.7.6 and did not observe the problem.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Xiangrui Meng created SPARK-1855:


 Summary: Provide memory-and-local-disk RDD checkpointing
 Key: SPARK-1855
 URL: https://issues.apache.org/jira/browse/SPARK-1855
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Spark Core
Affects Versions: 1.0.0
Reporter: Xiangrui Meng


Checkpointing is used to cut long lineage while maintaining fault tolerance. 
The current implementation is HDFS-based. Using the BlockRDD we can create 
in-memory-and-local-disk (with replication) checkpoints that are not as 
reliable as HDFS-based solution but faster.

It can help applications that require many iterations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1485) Implement AllReduce


 [ 
https://issues.apache.org/jira/browse/SPARK-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1485:
-

Fix Version/s: 1.1.0

 Implement AllReduce
 ---

 Key: SPARK-1485
 URL: https://issues.apache.org/jira/browse/SPARK-1485
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.1.0


 The current implementations of machine learning algorithms rely on the driver 
 for some computation and data broadcasting. This will create a bottleneck at 
 the driver for both computation and communication, especially in multi-model 
 training. An efficient implementation of AllReduce (or AllAggregate) can help 
 free the driver:
 allReduce(RDD[T], (T, T) = T): RDD[T]
 This JIRA is created for discussing how to implement AllReduce efficiently 
 and possible alternatives.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1852) SparkSQL Queries with Sorts run before the user asks them to

Michael Armbrust created SPARK-1852:
---

 Summary: SparkSQL Queries with Sorts run before the user asks them 
to
 Key: SPARK-1852
 URL: https://issues.apache.org/jira/browse/SPARK-1852
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust


This is related to [SPARK-1021] but will not be fixed by that since we do our 
own partitioning.

Part of the problem here is that we calculate the range partitioning too 
eagerly.  Though this could also be alleviated by avoiding the call to toRdd 
for non DDL queries.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1729) Make Flume pull data from source, rather than the current push model


 [ 
https://issues.apache.org/jira/browse/SPARK-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1729:
-

Fix Version/s: 1.1.0

 Make Flume pull data from source, rather than the current push model
 

 Key: SPARK-1729
 URL: https://issues.apache.org/jira/browse/SPARK-1729
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Assignee: Hari Shreedharan
 Fix For: 1.1.0


 This makes sure that the if the Spark executor running the receiver goes 
 down, the new receiver on a new node can still get data from Flume.
 This is not possible in the current model, as Flume is configured to push 
 data to a executor/worker and if that worker is down, Flume cant push data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1230) Enable SparkContext.addJars() to load classes not in CLASSPATH


 [ 
https://issues.apache.org/jira/browse/SPARK-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1230.


Resolution: Incomplete

I forget what this actually means (hah) so I'm gonna close it for now.

 Enable SparkContext.addJars() to load classes not in CLASSPATH
 --

 Key: SPARK-1230
 URL: https://issues.apache.org/jira/browse/SPARK-1230
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1638) Executors fail to come up if spark.executor.extraJavaOptions is set


 [ 
https://issues.apache.org/jira/browse/SPARK-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1638.


Resolution: Duplicate

 Executors fail to come up if spark.executor.extraJavaOptions is set 
 --

 Key: SPARK-1638
 URL: https://issues.apache.org/jira/browse/SPARK-1638
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
 Environment: Bring up a cluster in EC2 using spark-ec2 scripts
Reporter: Kalpit Shah
 Fix For: 1.0.0


 If you try to launch a PySpark shell with spark.executor.extraJavaOptions 
 set to -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc 
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, the executors never come up on 
 any of the workers.
 I see the following error in log file :
 Spark Executor Command: /usr/lib/jvm/java/bin/java -cp 
 /root/c3/lib/*::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar:
  -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc 
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xms13312M -Xmx13312M 
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://spark@HOSTNAME:45429/user/CoarseGrainedScheduler 7 HOSTNAME 
 4 akka.tcp://sparkWorker@HOSTNAME:39727/user/Worker 
 app-20140423224526-
 
 Unrecognized VM option 'UseCompressedOops -XX:+UseCompressedStrings 
 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps'
 Error: Could not create the Java Virtual Machine.
 Error: A fatal exception has occurred. Program will exit.
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1368) HiveTableScan is slow


 [ 
https://issues.apache.org/jira/browse/SPARK-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1368:


Assignee: Cheng Lian

 HiveTableScan is slow
 -

 Key: SPARK-1368
 URL: https://issues.apache.org/jira/browse/SPARK-1368
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
 Fix For: 1.1.0


 The major issues here are the use of functional programming (.map, .foreach) 
 and the creation of a new Row object for each output tuple. We should switch 
 to while loops in the critical path and a single MutableRow per partition.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1847) Pushdown filters on non-required parquet columns

Michael Armbrust created SPARK-1847:
---

 Summary: Pushdown filters on non-required parquet columns
 Key: SPARK-1847
 URL: https://issues.apache.org/jira/browse/SPARK-1847
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust


From Andre:

TODO: we currently only filter on non-nullable (Parquet REQUIRED) attributes 
until https://github.com/Parquet/parquet-mr/issues/371 has been resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1768) History Server enhancements


 [ 
https://issues.apache.org/jira/browse/SPARK-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1768:
---

Fix Version/s: 1.1.0

 History Server enhancements
 ---

 Key: SPARK-1768
 URL: https://issues.apache.org/jira/browse/SPARK-1768
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.1.0


 The history server currently has some limitations; the one that currently 
 concerns me the most is that it limits the number of applications it will 
 show, to avoid having to hold all applications in memory.
 It would be better if the code were smarter and able to show any application 
 available in the history storage.
 Also, thinking forward a little bit (I'm thinking SPARK-1537), it would be 
 nice to separate the serving logic from the logic to access app log data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1862) Add build support for MapR

Patrick Wendell created SPARK-1862:
--

 Summary: Add build support for MapR
 Key: SPARK-1862
 URL: https://issues.apache.org/jira/browse/SPARK-1862
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.0.0


Would be nice to add support for some of the other distro's in the build. MapR 
is one that I've done before, so it's a starting point.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI


 [ 
https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1853:
-

Fix Version/s: 1.1.0

 Show Streaming application code context (file, line number) in Spark Stages UI
 --

 Key: SPARK-1853
 URL: https://issues.apache.org/jira/browse/SPARK-1853
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
 Fix For: 1.1.0


 Right now, the code context (file, and line number) shown for streaming jobs 
 in stages UI is meaningless as it refers to internal DStream:random line 
 rather than user application file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1682) Add gradient descent w/o sampling and RDA L1 updater


 [ 
https://issues.apache.org/jira/browse/SPARK-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1682:
---

Fix Version/s: (was: 1.0.0)

 Add gradient descent w/o sampling and RDA L1 updater
 

 Key: SPARK-1682
 URL: https://issues.apache.org/jira/browse/SPARK-1682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Dong Wang

 The GradientDescent optimizer does sampling before a gradient step. When 
 input data is already shuffled beforehand, it is possible to scan data and 
 make gradient descent for each data instance. This could be potentially more 
 efficient.
 Add enhanced RDA L1 updater, which could produce even sparse solutions with 
 comparable quality compared with L1. Reference: 
 Lin Xiao, Dual Averaging Methods for Regularized Stochastic Learning and 
 Online Optimization, Journal of Machine Learning Research 11 (2010) 
 2543-2596.
 Small fix: add options to BinaryClassification example to read and write 
 model file



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1272) Don't fail job if some local directories are buggy


 [ 
https://issues.apache.org/jira/browse/SPARK-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1272:
---

Fix Version/s: (was: 1.0.0)
   1.1.0

 Don't fail job if some local directories are buggy
 --

 Key: SPARK-1272
 URL: https://issues.apache.org/jira/browse/SPARK-1272
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: Patrick Wendell
Assignee: OuyangJin
 Fix For: 1.1.0


 If Spark cannot create shuffle directories inside of a local directory it 
 might make sense to just log an error and continue, provided that at least 
 one valid shuffle directory exists. Otherwise if a single disk is wonky the 
 entire job can fail.
 The down side is that this might mask failures if the person actually 
 misconfigures the local directories to point to the wrong disk(s).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK


[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999493#comment-13999493
 ] 

Xiangrui Meng commented on SPARK-1782:
--

This sounds good to me. Let's assume that A is m-by-n where n is small (1e6). 
Note that (A^T A) x = (A^T A)^T x can be computed in a single pass, which is 
sum_i (a_i^T x) a_i . So we don't need to implement A^T y, which simplifies the 
task.

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1580) ALS: Estimate communication and computation costs given a partitioner


 [ 
https://issues.apache.org/jira/browse/SPARK-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1580:
-

Fix Version/s: 1.1.0

 ALS: Estimate communication and computation costs given a partitioner
 -

 Key: SPARK-1580
 URL: https://issues.apache.org/jira/browse/SPARK-1580
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Tor Myklebust
Priority: Minor
 Fix For: 1.1.0


 It would be nice to be able to estimate the amount of work needed to solve an 
 ALS problem.  The chief components of this work are computation time---time 
 spent forming and solving the least squares problems---and communication 
 cost---the number of bytes sent across the network.  Communication cost 
 depends heavily on how the users and products are partitioned.
 We currently do not try to cluster users or products so that fewer feature 
 vectors need to be communicated.  This is intended as a first step toward 
 that end---we ought to be able to tell whether one partitioning is better 
 than another.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1813) Add a utility to SparkConf that makes using Kryo really easy

2014-05-16 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998557#comment-13998557
 ] 

Sandy Ryza commented on SPARK-1813:
---

https://github.com/apache/spark/pull/789 is what I mean. 
[~mrid...@yahoo-inc.com], are there common uses of Kryo you're saying this 
misses?

 Add a utility to SparkConf that makes using Kryo really easy
 

 Key: SPARK-1813
 URL: https://issues.apache.org/jira/browse/SPARK-1813
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza

 It would be nice to have a method in SparkConf that makes it really easy to 
 turn on Kryo serialization and register a set of classes.
 Using Kryo currently requires all this:
 {code}
 import com.esotericsoftware.kryo.Kryo
 import org.apache.spark.serializer.KryoRegistrator
 class MyRegistrator extends KryoRegistrator {
   override def registerClasses(kryo: Kryo) {
 kryo.register(classOf[MyClass1])
 kryo.register(classOf[MyClass2])
   }
 }
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
 conf.set(spark.kryo.registrator, mypackage.MyRegistrator)
 val sc = new SparkContext(conf)
 {code}
 It would be nice if it just required this:
 {code}
 SparkConf.setKryo(Array(classOf[MyClass1], classOf[MyClass2]))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished


 [ 
https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-874:
--

Summary: Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's 
are finished  (was: Have version of `sc.stop()` that blocks until all executors 
are cleaned up.)

 Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are 
 finished
 ---

 Key: SPARK-874
 URL: https://issues.apache.org/jira/browse/SPARK-874
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Patrick Wendell
Priority: Minor
  Labels: starter
 Fix For: 1.1.0


 This would make things nicer for writing automated tests of Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1856) Standardize MLlib interfaces

Xiangrui Meng created SPARK-1856:


 Summary: Standardize MLlib interfaces
 Key: SPARK-1856
 URL: https://issues.apache.org/jira/browse/SPARK-1856
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Critical
 Fix For: 1.1.0


Instead of expanding MLlib based on the current class naming scheme 
(ProblemWithAlgorithm),  we should standardize MLlib's interfaces that clearly 
separate datasets, formulations, algorithms, parameter sets, and models.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1860) Standalone Worker cleanup should not clean up running applications by default

2014-05-16 Thread Aaron Davidson (JIRA)

Aaron Davidson created SPARK-1860:
-

 Summary: Standalone Worker cleanup should not clean up running 
applications by default
 Key: SPARK-1860
 URL: https://issues.apache.org/jira/browse/SPARK-1860
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Priority: Critical


The default values of the standalone worker cleanup code cleanup all 
application data every 7 days. This includes jars that were added to any 
applications that happen to be running for longer than 7 days, hitting 
streaming jobs especially hard.

Applications should not be cleaned up if they're still running. Until then, 
this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1861) ArrayIndexOutOfBoundsException when reading bzip2 files

Xiangrui Meng created SPARK-1861:


 Summary: ArrayIndexOutOfBoundsException when reading bzip2 files
 Key: SPARK-1861
 URL: https://issues.apache.org/jira/browse/SPARK-1861
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Xiangrui Meng


Hadoop uses CBZip2InputStream to decode bzip2 files. However, the 
implementation is not threadsafe and Spark may run multiple tasks in the same 
JVM, which leads to this error. This is not a problem for Hadoop MapReduce 
because Hadoop runs each task in a separate JVM.

A workaround is to set `SPARK_WORKER_CORES=1` in spark-env.sh for a standalone 
cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1585) Not robust Lasso causes Infinity on weights and losses


 [ 
https://issues.apache.org/jira/browse/SPARK-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1585:
---

Fix Version/s: (was: 1.0.0)
   1.1.0

 Not robust Lasso causes Infinity on weights and losses
 --

 Key: SPARK-1585
 URL: https://issues.apache.org/jira/browse/SPARK-1585
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 0.9.1
Reporter: Xusen Yin
Assignee: Xusen Yin
 Fix For: 1.1.0


 Lasso uses LeastSquaresGradient and L1Updater, but 
 diff = brzWeights.dot(brzData) - label
 in LeastSquaresGradient would cause too big diff, then will affect the 
 L1Updater, which increases weights exponentially. Small shrinkage value 
 cannot lasso weights back to zero then. Finally, the weights and losses reach 
 Infinity.
 For example, data = (0.5 repeats 10k times), weights = (0.6 repeats 10k 
 times), then data.dot(weights) approximates 300+, the diff will be 300. Then 
 L1Updater sets weights to approximate 300. In the next iteration, the weights 
 will be set to approximate 3, and so on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib

[
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xiangrui Meng updated SPARK-1486:
-

Priority: Critical (was: Major)

Support multi-model training in MLlib
-

Key: SPARK-1486
URL: https://issues.apache.org/jira/browse/SPARK-1486
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
Fix For: 1.1.0

It is rare in practice to train just one model with a given set of
parameters. Usually, this is done by training multiple models with different
sets of parameters and then select the best based on their performance on the
validation set. MLlib should provide native support for multi-model
training/scoring. It requires decoupling of concepts like problem,
formulation, algorithm, parameter set, and model, which are missing in MLlib
now. MLI implements similar concepts, which we can borrow. There are
different approaches for multi-model training:
0) Keep one copy of the data, and train models one after another (or maybe in
parallel, depending on the scheduler).
1) Keep one copy of the data, and train multiple models at the same time
(similar to `runs` in KMeans).
2) Make multiple copies of the data (still stored distributively), and use
more cores to distribute the work.
3) Collect the data, make the entire dataset available on workers, and train
one or more models on each worker.
Users should be able to choose which execution mode they want to use. Note
that 3) could cover many use cases in practice when the training data is not
huge, e.g., 1GB.
This task will be divided into sub-tasks and this JIRA is created to discuss
the design and track the overall progress.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1765) Modify a typo in monitoring.md


 [ 
https://issues.apache.org/jira/browse/SPARK-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1765.


   Resolution: Fixed
Fix Version/s: 1.0.0

 Modify a typo in monitoring.md
 --

 Key: SPARK-1765
 URL: https://issues.apache.org/jira/browse/SPARK-1765
 Project: Spark
  Issue Type: Bug
Reporter: Kousuke Saruta
Priority: Minor
 Fix For: 1.0.0


 There is a word 'JXM' In monitoring.md.
 I guess, it's a typo for 'JMX'.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results

2014-05-16 Thread Vlad Frolov (JIRA)

Vlad Frolov created SPARK-1859:
--

 Summary: Linear, Ridge and Lasso Regressions with SGD yield 
unexpected results
 Key: SPARK-1859
 URL: https://issues.apache.org/jira/browse/SPARK-1859
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 0.9.1
 Environment: OS: Ubuntu Server 12.04 x64
PySpark
Reporter: Vlad Frolov


Issue:
Linear Regression with SGD don't work as expected on any data, but lpsa.dat 
(example one).
Ridge Regression with SGD *sometimes* works ok.
Lasso Regression with SGD *sometimes* works ok.

Code example (PySpark) based on 
http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
{code:title=regression_example.py}
parsedData = sc.parallelize([
array([2400., 1500.]),
array([240., 150.]),
array([24., 15.]),
array([2.4, 1.5]),
array([0.24, 0.15])
])

# Build the model
model = LinearRegressionWithSGD.train(parsedData)
print model._coeffs
{code}

So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! :)
The resulting model has nan coeffs: {{array([ nan])}}.
Furthermore, if you comment records line by line you will get:
* [-1.55897475e+296] coeff (the first record is commented), 
* [-8.62115396e+104] coeff (the first two records are commented),
* etc

It looks like the implemented regression algorithms diverges somehow.

I get almost the same results on Ridge and Lasso.

I've also tested these inputs in scikit-learn and it works as expected there.

However, I'm still not sure whether it's a bug or SGD 'feature'. Should I 
preprocess my datasets somehow?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1630) PythonRDDs don't handle nulls gracefully


 [ 
https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1630:
---

Fix Version/s: (was: 1.0.0)
   1.1.0

 PythonRDDs don't handle nulls gracefully
 

 Key: SPARK-1630
 URL: https://issues.apache.org/jira/browse/SPARK-1630
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0, 0.9.1
Reporter: Kalpit Shah
 Fix For: 1.1.0

   Original Estimate: 2h
  Remaining Estimate: 2h

 If PythonRDDs receive a null element in iterators, they currently NPE. It 
 would be better do log a DEBUG message and skip the write of NULL elements.
 Here are the 2 stack traces :
 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread 
 Thread[stdin writer for python,5,main]
 java.lang.NullPointerException
   at 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267)
   at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88)
 -
 Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.writeToFile.
 : java.lang.NullPointerException
   at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246)
   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285)
   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280)
   at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
   at py4j.Gateway.invoke(Gateway.java:259)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:744)  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1600) flaky test case in streaming.CheckpointSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1600:
-

Fix Version/s: 1.1.0

 flaky test case in streaming.CheckpointSuite
 

 Key: SPARK-1600
 URL: https://issues.apache.org/jira/browse/SPARK-1600
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Nan Zhu
 Fix For: 1.1.0


 the case recovery with file input stream.recovery with file input stream   
 sometimes fails when the Jenkins is very busy with an unrelated change 
 I have met it for 3 times, I also saw it in other places, 
 the latest example is in 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/
 where the modification is just in YARN related files
 I once reported in dev mail list: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1487) Support record filtering via predicate pushdown in Parquet


 [ 
https://issues.apache.org/jira/browse/SPARK-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1487:


Fix Version/s: 1.1.0

 Support record filtering via predicate pushdown in Parquet
 --

 Key: SPARK-1487
 URL: https://issues.apache.org/jira/browse/SPARK-1487
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Andre Schumacher
Assignee: Andre Schumacher
 Fix For: 1.1.0


 Parquet has support for column filters, which can be used to avoid reading 
 and de-serializing records that fail the column filter condition. This can 
 lead to potentially large savings, depending on the number of columns 
 filtered by and how many records actually pass the filter.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1340) Some Spark Streaming receivers are not restarted when worker fails


 [ 
https://issues.apache.org/jira/browse/SPARK-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-1340.
--

Resolution: Fixed

Resolved with 
https://issues.apache.org/jira/browse/SPARK-1332

 Some Spark Streaming receivers are not restarted when worker fails
 --

 Key: SPARK-1340
 URL: https://issues.apache.org/jira/browse/SPARK-1340
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.0.0


 For some streams like Kafka stream, the receiver do not get restarted if the 
 worker running the receiver fails. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1848) Executors are mysteriously dying when using Spark on Mesos

2014-05-16 Thread Bouke van der Bijl (JIRA)

Bouke van der Bijl created SPARK-1848:
-

 Summary: Executors are mysteriously dying when using Spark on Mesos
 Key: SPARK-1848
 URL: https://issues.apache.org/jira/browse/SPARK-1848
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Spark Core
Affects Versions: 1.0.0
 Environment: Linux 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4 
17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

java version 1.7.0_51
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

Mesos 0.18.0

Spark Master
Reporter: Bouke van der Bijl


Here's a logfile: https://gist.github.com/bouk/b4647e7ba62eb169a40a

We have 47 machines running Mesos that we're trying to run Spark jobs on, but 
they fail at some point because tasks have to get rescheduled too often, which 
is caused by Spark killing the tasks because of executors dying. When I look at 
the stderr or stdout of the Mesos slaves, there seem to be no indication of an 
error happening and sometimes I can see a 14/05/15 17:38:54 INFO DAGScheduler: 
Ignoring possibly bogus ShuffleMapTask completion from id which would 
indicate that the executor just keeps going and hasn't actually died. If I add 
a Thread.dumpStack() at the location where the job is killed, this is the trace 
it returns: 

at java.lang.Thread.dumpStack(Thread.java:1364)
at 
org.apache.spark.scheduler.TaskSetManager.handleFailedTask(TaskSetManager.scala:588)
at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:665)
at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:664)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.scheduler.TaskSetManager.executorLost(TaskSetManager.scala:664)
at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:271)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:266)
at 
org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.statusUpdate(MesosSchedulerBackend.scala:287)

What could cause this? Is this a set up problem with our cluster or a bug in 
spark?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1810) The spark tar ball does not unzip into a separate folder when un-tarred.


 [ 
https://issues.apache.org/jira/browse/SPARK-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1810.


Resolution: Cannot Reproduce

 The spark tar ball does not unzip into a separate folder when un-tarred.
 

 Key: SPARK-1810
 URL: https://issues.apache.org/jira/browse/SPARK-1810
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.0
 Environment: All environments
Reporter: Manikandan Narayanaswamy
Priority: Minor
  Labels: maven

 All other Hadoop components when extracted are contained within a new folder 
 that is created. But, this is not the case for Spark. The Spark.tar 
 decompresses all files into the Current Working Directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1741) Add predict(JavaRDD) to predictive models


[ 
https://issues.apache.org/jira/browse/SPARK-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999115#comment-13999115
 ] 

Patrick Wendell commented on SPARK-1741:


This might end up in 1.0 or not depending on whether we release the current RC.

 Add predict(JavaRDD) to predictive models
 -

 Key: SPARK-1741
 URL: https://issues.apache.org/jira/browse/SPARK-1741
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.0.0, 1.0.1


 `model.predict` returns a RDD of Scala primitive type (Int/Double), which is 
 recognized as Object in Java. Adding predict(JavaRDD) could make life easier 
 for Java users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1487) Support record filtering via predicate pushdown in Parquet


[ 
https://issues.apache.org/jira/browse/SPARK-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999034#comment-13999034
 ] 

Michael Armbrust commented on SPARK-1487:
-

PR here: https://github.com/apache/spark/pull/511/

 Support record filtering via predicate pushdown in Parquet
 --

 Key: SPARK-1487
 URL: https://issues.apache.org/jira/browse/SPARK-1487
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Andre Schumacher
Assignee: Andre Schumacher
 Fix For: 1.1.0


 Parquet has support for column filters, which can be used to avoid reading 
 and de-serializing records that fail the column filter condition. This can 
 lead to potentially large savings, depending on the number of columns 
 filtered by and how many records actually pass the filter.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1154) Spark fills up disk with app-* folders


[ 
https://issues.apache.org/jira/browse/SPARK-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999601#comment-13999601
 ] 

Patrick Wendell edited comment on SPARK-1154 at 5/16/14 5:00 AM:
-

[~mkim] yes you are correct - this is broken. Checkout SPARK-1860. Are you 
interested in fixing this?


was (Author: pwendell):
[~mkim] yes you are correct - this is broken. Checkout SPARK-1860.

 Spark fills up disk with app-* folders
 --

 Key: SPARK-1154
 URL: https://issues.apache.org/jira/browse/SPARK-1154
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Evan Chan
Assignee: Evan Chan
Priority: Critical
  Labels: starter
 Fix For: 1.0.0


 Current version of Spark fills up the disk with many app-* folders:
 $ ls /var/lib/spark
 app-20140210022347-0597  app-20140212173327-0627  app-20140218154110-0657  
 app-20140225232537-0017  app-20140225233548-0047
 app-20140210022407-0598  app-20140212173347-0628  app-20140218154130-0658  
 app-20140225232551-0018  app-20140225233556-0048
 app-20140210022427-0599  app-20140212173754-0629  app-20140218164232-0659  
 app-20140225232611-0019  app-20140225233603-0049
 app-20140210022447-0600  app-20140212182235-0630  app-20140218165133-0660  
 app-20140225232802-0020  app-20140225233610-0050
 app-20140210022508-0601  app-20140212182256-0631  app-20140218165148-0661  
 app-20140225232822-0021  app-20140225233617-0051
 app-20140210022528-0602  app-2014021314-0632  app-20140218165225-0662  
 app-20140225232940-0022  app-20140225233624-0052
 app-20140211024356-0603  app-20140213002026-0633  app-20140218165249-0663  
 app-20140225233002-0023  app-20140225233631-0053
 app-20140211024417-0604  app-20140213154948-0634  app-20140218172030-0664  
 app-20140225233056-0024  app-20140225233725-0054
 app-20140211024437-0605  app-20140213171810-0635  app-20140218193853-0665  
 app-20140225233108-0025  app-20140225233731-0055
 app-20140211024457-0606  app-20140213193637-0636  app-20140218194442-0666  
 app-20140225233124-0026  app-20140225233733-0056
 app-20140211024517-0607  app-20140214011513-0637  app-20140218194746-0667  
 app-20140225233133-0027  app-20140225233734-0057
 app-20140211024538-0608  app-20140214012151-0638  app-20140218194822-0668  
 app-20140225233147-0028  app-20140225233749-0058
 app-20140211193443-0609  app-20140214013134-0639  app-20140218212317-0669  
 app-20140225233208-0029  app-20140225233759-0059
 app-20140211195210-0610  app-20140214013332-0640  app-20140225180142-  
 app-20140225233215-0030  app-20140225233809-0060
 app-20140211213935-0611  app-20140214013642-0641  app-20140225180411-0001  
 app-20140225233224-0031  app-20140225233828-0061
 app-20140211214227-0612  app-20140214014246-0642  app-20140225180431-0002  
 app-20140225233232-0032  app-20140225234719-0062
 app-20140211215317-0613  app-20140214014607-0643  app-20140225180452-0003  
 app-20140225233239-0033  app-20140226032845-0063
 app-20140211224601-0614  app-20140214184943-0644  app-20140225180512-0004  
 app-20140225233320-0034  app-20140226033004-0064
 app-20140212022206-0615  app-20140214185118-0645  app-20140225180533-0005  
 app-20140225233328-0035  app-20140226033119-0065
 app-2014021206-0616  app-20140214185851-0646  app-20140225180553-0006  
 app-20140225233354-0036  app-2014022604-0066
 app-20140212022246-0617  app-20140214222856-0647  app-20140225181115-0007  
 app-20140225233402-0037  app-20140226033354-0067
 app-20140212043704-0618  app-20140214231312-0648  app-20140225181244-0008  
 app-20140225233409-0038  app-20140226033538-0068
 app-20140212043724-0619  app-20140214231434-0649  app-20140225182051-0009  
 app-20140225233416-0039  app-20140226033826-0069
 app-20140212043745-0620  app-20140214231542-0650  app-20140225183009-0010  
 app-20140225233426-0040  app-20140226034002-0070
 app-20140212044016-0621  app-20140214231616-0651  app-20140225184133-0011  
 app-20140225233432-0041  app-20140226034053-0071
 app-20140212044203-0622  app-20140214233016-0652  app-20140225184318-0012  
 app-20140225233439-0042  app-20140226034234-0072
 app-20140212044224-0623  app-20140214233037-0653  app-20140225184709-0013  
 app-20140225233447-0043  app-20140226034426-0073
 app-20140212045034-0624  app-20140218153242-0654  app-20140225184844-0014  
 app-20140225233526-0044  app-20140226034447-0074
 app-20140212045119-0625  app-20140218153341-0655  app-20140225190051-0015  
 app-20140225233534-0045
 app-20140212173310-0626  app-20140218153442-0656  app-20140225232516-0016  
 app-20140225233540-0046
 This problem is particularly bad if you have a whole bunch of fast jobs.   
 Also what makes the problem worse is that any jars for jobs is downloaded 
 into the app-* folder, so that fills up the disk

[jira] [Updated] (SPARK-1623) SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name