Matrix Multiplication of two RDD[Array[Double]]'s

2014-05-18 Thread Liquan Pei
Hi I am currently implementing an algorithm involving matrix multiplication. Basically, I have matrices represented as RDD[Array[Double]]. For example, If I have A:RDD[Array[Double]] and B:RDD[Array[Double]] and what would be the most efficient way to get C = A * B Both A and B are large, so it

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-18 Thread Mridul Muralidharan
On 18-May-2014 5:05 am, Mark Hamstra m...@clearstorydata.com wrote: I don't understand. We never said that interfaces wouldn't change from 0.9 Agreed. to 1.0. What we are committing to is stability going forward from the 1.0.0 baseline. Nobody is disputing that backward-incompatible

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-18 Thread Mridul Muralidharan
So I think I need to clarify a few things here - particularly since this mail went to the wrong mailing list and a much wider audience than I intended it for :-) Most of the issues I mentioned are internal implementation detail of spark core : which means, we can enhance them in future without

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Xiangrui Meng
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870 DB, could you add more info to that JIRA? Thanks! -Xiangrui On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng men...@gmail.com wrote: Btw, I tried rdd.map { i = System.getProperty(java.class.path) }.collect() but didn't

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Jacek Laskowski
Hi, I'm curious if it's a common approach to have discussions in JIRA not here. I don't think it's the ASF way. Pozdrawiam, Jacek Laskowski http://blog.japila.pl 17 maj 2014 23:55 Matei Zaharia matei.zaha...@gmail.com napisał(a): We do actually have replicated StorageLevels in Spark. You can

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Andrew Ash
The nice thing about putting discussion on the Jira is that everything about the bug is in one place. So people looking to understand the discussion a few years from now only have to look on the jira ticket rather than also search the mailing list archives and hope commenters all put the string

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Patrick Wendell
@db - it's possible that you aren't including the jar in the classpath of your driver program (I think this is what mridul was suggesting). It would be helpful to see the stack trace of the CNFE. - Patrick On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell pwend...@gmail.com wrote: @xiangrui -

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Patrick Wendell
@xiangrui - we don't expect these to be present on the system classpath, because they get dynamically added by Spark (e.g. your application can call sc.addJar well after the JVM's have started). @db - I'm pretty surprised to see that behavior. It's definitely not intended that users need

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-18 Thread Matei Zaharia
I took the always fun task of testing it on Windows, and unfortunately, I found some small problems with the prebuilt packages due to recent changes to the launch scripts: bin/spark-class2.cmd looks in ./jars instead of ./lib for the assembly JAR, and bin/run-example2.cmd doesn’t quite match

Re: Matrix Multiplication of two RDD[Array[Double]]'s

2014-05-18 Thread Andrew Ash
Hi Liquan, There is some working being done on implementing linear algebra algorithms on Spark for use in higher-level machine learning algorithms. That work is happening in the MLlib project, which has a org.apache.spark.mllib.linalgpackage you may find useful. See

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Matei Zaharia
JIRAs comments are mirrored to the iss...@spark.apache.org list, so people who want to get them by email can do so. In theory one should also be able to reply to one of those emails and have the message show up in JIRA, but I don’t think ours is configured that way. I’m not sure why it wouldn’t

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Jacek Laskowski
On Sun, May 18, 2014 at 8:28 PM, Andrew Ash and...@andrewash.com wrote: The nice thing about putting discussion on the Jira is that everything about the bug is in one place. So people looking to understand the discussion a few years from now only have to look on the jira ticket rather than

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Xiangrui Meng
Hi Patrick, If spark-submit works correctly, user only needs to specify runtime jars via `--jars` instead of using `sc.addJar`. Is it correct? I checked SparkSubmit and yarn.Client but didn't find any code to handle `args.jars` for YARN mode. So I don't know where in the code the jars in the

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Xiangrui Meng
Hi Sandy, It is hard to imagine that a user needs to create an object in that way. Since the jars are already in distributed cache before the executor starts, is there any reason we cannot add the locally cached jars to classpath directly? Best, Xiangrui On Sun, May 18, 2014 at 4:00 PM, Sandy

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Matei Zaharia
BTW in Spark the consensus so far was that we’d use the dev@ list for high-level discussions (e.g. change in the development process, major features, proposals of new components, release votes) and keep lower-level issue tracking in JIRA. This is just how the project operated before so it was

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Sandy Ryza
Hey Xiangrui, If the jars are placed in the distributed cache and loaded statically, as the primary app jar is in YARN, then it shouldn't be an issue. Other jars, however, including additional jars that are sc.addJar'd and jars specified with the spark-submit --jars argument, are loaded

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread DB Tsai
The reflection actually works. But you need to get the loader by `val loader = Thread.currentThread.getContextClassLoader` which is set by Spark executor. Our team verified this, and uses it as workaround. Sincerely, DB Tsai --- My Blog:

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread DB Tsai
The jars are included in my driver, and I can successfully use them in the driver. I'm working on a patch, and it's almost working. Will submit a PR soon. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn:

Fwd: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread DB Tsai
Since the additional jars added by sc.addJars are through http server, even it works, we still want to have a better way due to scalability (imagine that thousands of workers downloading jars from driver). If we ignore the fundamental scalability issue, this can be fixed by using the

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-18 Thread Matei Zaharia
Alright, I’ve opened https://github.com/apache/spark/pull/819 with the Windows fixes. I also found one other likely bug, https://issues.apache.org/jira/browse/SPARK-1875, in the binary packages for Hadoop1 built in this RC. I think this is due to Hadoop 1’s security code depending on a

Re: can RDD be shared across mutil spark applications?

2014-05-18 Thread qingyang li
thanks for sharing, I am using tachyon to store RDD now. 2014-05-18 12:02 GMT+08:00 Christopher Nguyen c...@adatao.com: Qing Yang, Andy is correct in answering your direct question. At the same time, depending on your context, you may be able to apply a pattern where you turn the single

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-18 Thread Tom Graves
no ideas off hand, I'll take a look tomorrow. Tom On Sunday, May 18, 2014 7:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Alright, I’ve opened https://github.com/apache/spark/pull/819 with the Windows fixes. I also found one other likely bug,

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-18 Thread Patrick Wendell
Hey Matei - the issue you found is not related to security. This patch a few days ago broke builds for Hadoop 1 with YARN support enabled. The patch directly altered the way we deal with commons-lang dependency, which is what is at the base of this stack trace.

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Mridul Muralidharan
My bad ... I was replying via mobile, and I did not realize responses to JIRA mails were not mirrored to JIRA - unlike PR responses ! Regards, Mridul On Sun, May 18, 2014 at 2:50 AM, Matei Zaharia matei.zaha...@gmail.com wrote: We do actually have replicated StorageLevels in Spark. You can use