At Cloudera we recommend bundling your application separately from the
Spark libraries. The two biggest reasons are:
* No need to modify your application jar when upgrading or applying a patch.
* When running on YARN, the Spark jar can be cached as a YARN local
resource, meaning it doesn't need to
Hi Andrew,
Thanks for the reply, I figured out the cause of the issue. Some resource
files were missing in JARs. A class initialization depends on the resource
files so it got that exception.
I appended the resource files explicitly to --jars option and it worked
fine.
The "Caused by..." message
Hi Team,
Could you please help me on below query.
I'm using JavaStreamingContext to read streaming files from hdfs shared
directory. When i start spark streaming job it is reading files from hdfs
shared directory and doing some process. When i stop and restart the job it
is again reading old file
Mayur,
I don't know if I exactly understand the context of what you are asking,
but let me just mention issues I had with deploying.
* As my application is a streaming application, it doesn't read any files
from disk, so therefore I have no Hadoop/HDFS in place and I there is no
need for it, eith
i used to do 1) but couldnt get it to work on yarn and the trend seemed
towards 2) using spark-submit so i gave in
the main promise of 2) is tha you can provide an application that can run
on multiple hadoop and spark versions. however for that to become true
spark needs to address the issue of us
- IMHO, #2 is preferred as it could work in any environment (Mesos,
Standalone et al). While Spark needs HDFS (for any decent distributed
system) YARN is not required at all - Meson is a lot better.
- Also managing the app with appropriate bootstrap/deployment framework
is more flexi
Thank you. It works.
(I've applied the changed source code to my local 1.0.0 source)
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Friday, July 25, 2014 11:47 PM
To: user@spark.apache.org
Subject: Re: Strange exception on coalesce()
I'm pretty sure this was already
Based on some discussions with my application users, I have been trying to come
up with a standard way to deploy applications built on Spark
1. Bundle the version of spark with your application and ask users store it in
hdfs before referring it in yarn to boot your application
2. Provide ways to
I think this is nice to have. Feel free to create a JIRA for it and it
would be great if you can send a PR. Thanks! -Xiangrui
On Thu, Jul 24, 2014 at 12:39 PM, SK wrote:
>
> Hi,
>
> The mllib.clustering.kmeans implementation supports a random or parallel
> initialization mode to pick the initial
If you have an m-by-n dataset and train a k-means model with k, the
cost for each iteration is
O(m * n * k) (assuming dense data)
Since m * n * k == n * m * k, so ideally you would expect the same run
time. However,
1. Communication. We need to broadcast current centers (m * k), do the
computati
Hello everyone,
I am trying out Spark for the first time, and after a month of work - I am
stuck with an issue. I have a very simple program that, given a directed
Graph with nodes/edges parameters and a particular node, tries to figure out
all the siblings(in the traditional sense) of the given n
I have confirmed, it is not GC related. Oprofile shows stop-the-world
pauses separately from the regular java methods.
However, I was wrong when I said that the amount of time spent in
writeObject0 is much more in local[n] mode as compared to standalone mode.
It is instead hashCode function. Th
To add to the last point, multimodel training is something we've explored
as part of the MLbase Optimizer, and we've seen some nice speedups. This
feature will be added to MLlib soon (not sure if it'll make it into the 1.1
release though).
On Sat, Jul 26, 2014 at 11:27 PM, Matei Zaharia
wrote:
Could you help to provide a test case to verify this issue and open a JIRA
to track this? Also, are you interested in submit a PR to fix it? Thanks.
Sent from my Google Nexus 5
On Jul 27, 2014 11:07 AM, "Aureliano Buendia" wrote:
> Hi,
>
> The recently added NNLS implementation in MLlib returns
Ah, I understand now. That sounds pretty useful and is something we would
currently plan very inefficiently.
On Sun, Jul 27, 2014 at 1:07 AM, Christos Kozanitis
wrote:
> Thanks Michael for the recommendations. Actually the region-join (or I
> could name it range-join or interval-join) that I w
Hi,
The recently added NNLS implementation in MLlib returns wrong solutions.
This is not data specific, just try any data in R's nnls, and then the same
data in MLlib's NNLS. The results are very different.
Also, the elected algorithm Polyak(1969) is not the best one around. The
most popular one
I see. There should not be a significant algorithmic difference between
those two cases, as far as I can think, but there is a good bit of
"local-mode-only" logic in Spark.
One typical problem we see on large-heap, many-core JVMs, though, is much
more time spent in garbage collection. I'm not sure
I am comparing the total time spent in finishing the job. And What I am
comparing, to be precise, is on a 48-core machine. I am comparing the
performance of local[48] vs. standalone mode with 8 nodes of 6 cores each
(totalling 48 cores) on localhost. In this comparison, the standalone mode
outperfo
What are you comparing in your last experiment? Time spent in writeObject0
on a 100-core machine (!) vs a cluster?
On Sat, Jul 26, 2014 at 11:59 PM, lokesh.gidra
wrote:
> Thanks a lot for clarifying this. This explains why there is less
> serialization happening with lesser parallelism. There w
Thanks Matei!
I looked at the pull request, but we are not yet ready to move to Scala 2.11
and at this point prefer to upgrade only Akka, so I filed
https://issues.apache.org/jira/browse/SPARK-2707 as a separate issue.
Yardena
--
View this message in context:
http://apache-spark-user-list.
Thanks Michael for the recommendations. Actually the region-join (or I
could name it range-join or interval-join) that I was thinking should join
the entries of two tables with inequality predicates. For example if table
A(col1 int, col2 int) contains entries (1,4) and (10,12) and table b(c1
int, c
Thanks a lot for clarifying this. This explains why there is less
serialization happening with lesser parallelism. There would be less network
communication, and hence less serialization right?
But then if we compare 100 cores in local mode vs. 10 nodes of 10 cores each
in standalone mode, then am
23 matches
Mail list logo