Re: Spark as a application library vs infra

2014-07-27 Thread Sandy Ryza
At Cloudera we recommend bundling your application separately from the Spark libraries. The two biggest reasons are: * No need to modify your application jar when upgrading or applying a patch. * When running on YARN, the Spark jar can be cached as a YARN local resource, meaning it doesn't need to

subscribe

2014-07-27 Thread James Todd

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-27 Thread Jianshi Huang
Hi Andrew, Thanks for the reply, I figured out the cause of the issue. Some resource files were missing in JARs. A class initialization depends on the resource files so it got that exception. I appended the resource files explicitly to --jars option and it worked fine. The "Caused by..." message

spark checkpoint details

2014-07-27 Thread Madabhattula Rajesh Kumar
Hi Team, Could you please help me on below query. I'm using JavaStreamingContext to read streaming files from hdfs shared directory. When i start spark streaming job it is reading files from hdfs shared directory and doing some process. When i stop and restart the job it is again reading old file

Re: Spark as a application library vs infra

2014-07-27 Thread Tobias Pfeiffer
Mayur, I don't know if I exactly understand the context of what you are asking, but let me just mention issues I had with deploying. * As my application is a streaming application, it doesn't read any files from disk, so therefore I have no Hadoop/HDFS in place and I there is no need for it, eith

Re: Spark as a application library vs infra

2014-07-27 Thread Koert Kuipers
i used to do 1) but couldnt get it to work on yarn and the trend seemed towards 2) using spark-submit so i gave in the main promise of 2) is tha you can provide an application that can run on multiple hadoop and spark versions. however for that to become true spark needs to address the issue of us

Re: Spark as a application library vs infra

2014-07-27 Thread Krishna Sankar
- IMHO, #2 is preferred as it could work in any environment (Mesos, Standalone et al). While Spark needs HDFS (for any decent distributed system) YARN is not required at all - Meson is a lot better. - Also managing the app with appropriate bootstrap/deployment framework is more flexi

RE: Strange exception on coalesce()

2014-07-27 Thread innowireless TaeYun Kim
Thank you. It works. (I've applied the changed source code to my local 1.0.0 source) -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Friday, July 25, 2014 11:47 PM To: user@spark.apache.org Subject: Re: Strange exception on coalesce() I'm pretty sure this was already

Spark as a application library vs infra

2014-07-27 Thread Mayur Rustagi
Based on some discussions with my application users, I have been trying to come up with a standard way to deploy applications built on Spark 1. Bundle the version of spark with your application and ask users store it in hdfs before referring it in yarn to boot your application 2. Provide ways to

Re: Kmeans: set initial centers explicitly

2014-07-27 Thread Xiangrui Meng
I think this is nice to have. Feel free to create a JIRA for it and it would be great if you can send a PR. Thanks! -Xiangrui On Thu, Jul 24, 2014 at 12:39 PM, SK wrote: > > Hi, > > The mllib.clustering.kmeans implementation supports a random or parallel > initialization mode to pick the initial

Re: KMeans: expensiveness of large vectors

2014-07-27 Thread Xiangrui Meng
If you have an m-by-n dataset and train a k-means model with k, the cost for each iteration is O(m * n * k) (assuming dense data) Since m * n * k == n * m * k, so ideally you would expect the same run time. However, 1. Communication. We need to broadcast current centers (m * k), do the computati

Maximum jobs finish very soon, some of them take longer time.

2014-07-27 Thread Sarthak Dash
Hello everyone, I am trying out Spark for the first time, and after a month of work - I am stuck with an issue. I have a very simple program that, given a directed Graph with nodes/edges parameters and a particular node, tries to figure out all the siblings(in the traditional sense) of the given n

Re: "Spilling in-memory..." messages in log even with MEMORY_ONLY

2014-07-27 Thread lokesh.gidra
I have confirmed, it is not GC related. Oprofile shows stop-the-world pauses separately from the regular java methods. However, I was wrong when I said that the amount of time spent in writeObject0 is much more in local[n] mode as compared to standalone mode. It is instead hashCode function. Th

Re: Spark MLlib vs BIDMach Benchmark

2014-07-27 Thread Ameet Talwalkar
To add to the last point, multimodel training is something we've explored as part of the MLbase Optimizer, and we've seen some nice speedups. This feature will be added to MLlib soon (not sure if it'll make it into the 1.1 release though). On Sat, Jul 26, 2014 at 11:27 PM, Matei Zaharia wrote:

Re: MLlib NNLS implementation is buggy, returning wrong solutions

2014-07-27 Thread DB Tsai
Could you help to provide a test case to verify this issue and open a JIRA to track this? Also, are you interested in submit a PR to fix it? Thanks. Sent from my Google Nexus 5 On Jul 27, 2014 11:07 AM, "Aureliano Buendia" wrote: > Hi, > > The recently added NNLS implementation in MLlib returns

Re: SparkSQL extensions

2014-07-27 Thread Michael Armbrust
Ah, I understand now. That sounds pretty useful and is something we would currently plan very inefficiently. On Sun, Jul 27, 2014 at 1:07 AM, Christos Kozanitis wrote: > Thanks Michael for the recommendations. Actually the region-join (or I > could name it range-join or interval-join) that I w

MLlib NNLS implementation is buggy, returning wrong solutions

2014-07-27 Thread Aureliano Buendia
Hi, The recently added NNLS implementation in MLlib returns wrong solutions. This is not data specific, just try any data in R's nnls, and then the same data in MLlib's NNLS. The results are very different. Also, the elected algorithm Polyak(1969) is not the best one around. The most popular one

Re: "Spilling in-memory..." messages in log even with MEMORY_ONLY

2014-07-27 Thread Aaron Davidson
I see. There should not be a significant algorithmic difference between those two cases, as far as I can think, but there is a good bit of "local-mode-only" logic in Spark. One typical problem we see on large-heap, many-core JVMs, though, is much more time spent in garbage collection. I'm not sure

Re: "Spilling in-memory..." messages in log even with MEMORY_ONLY

2014-07-27 Thread lokesh.gidra
I am comparing the total time spent in finishing the job. And What I am comparing, to be precise, is on a 48-core machine. I am comparing the performance of local[48] vs. standalone mode with 8 nodes of 6 cores each (totalling 48 cores) on localhost. In this comparison, the standalone mode outperfo

Re: "Spilling in-memory..." messages in log even with MEMORY_ONLY

2014-07-27 Thread Aaron Davidson
What are you comparing in your last experiment? Time spent in writeObject0 on a 100-core machine (!) vs a cluster? On Sat, Jul 26, 2014 at 11:59 PM, lokesh.gidra wrote: > Thanks a lot for clarifying this. This explains why there is less > serialization happening with lesser parallelism. There w

Re: akka 2.3.x?

2014-07-27 Thread yardena
Thanks Matei! I looked at the pull request, but we are not yet ready to move to Scala 2.11 and at this point prefer to upgrade only Akka, so I filed https://issues.apache.org/jira/browse/SPARK-2707 as a separate issue. Yardena -- View this message in context: http://apache-spark-user-list.

Re: SparkSQL extensions

2014-07-27 Thread Christos Kozanitis
Thanks Michael for the recommendations. Actually the region-join (or I could name it range-join or interval-join) that I was thinking should join the entries of two tables with inequality predicates. For example if table A(col1 int, col2 int) contains entries (1,4) and (10,12) and table b(c1 int, c

Re: "Spilling in-memory..." messages in log even with MEMORY_ONLY

2014-07-27 Thread lokesh.gidra
Thanks a lot for clarifying this. This explains why there is less serialization happening with lesser parallelism. There would be less network communication, and hence less serialization right? But then if we compare 100 cores in local mode vs. 10 nodes of 10 cores each in standalone mode, then am