Hi Tsai,
Could you share more information about the machine you used and the
training parameters (runs, k, and iterations)? It can help solve your
issues. Thanks!
Best,
Xiangrui
On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming mailingl...@ltsai.com wrote:
Hi,
At the reduceBuyKey stage, it takes
Hi,
This is on a 4 nodes cluster each with 32 cores/256GB Ram.
(0.9.0) is deployed in a stand alone mode.
Each worker is configured with 192GB. Spark executor memory is also 192GB.
This is on the first iteration. K=50. Here’s the code I use:
http://pastebin.com/2yXL3y8i , which is a
Hi,
I have few questions regarding log file management in spark:
1. Currently I did not find any way to modify the lof file name for
executor/drivers). Its hardcoded as stdout and stderr. Also there is no log
rotation.
In case of streaming application this will grow forever and become
Hi All !! I am getting the following error in interactive spark-shell
[0.8.1]
*org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more than
0 times; aborting job java.lang.OutOfMemoryError: GC overhead limit
exceeded*
But i had set the following in the spark.env.sh and
K = 50 is certainly a large number for k-means. If there is no
particular reason to have 50 clusters, could you try to reduce it
to, e.g, 100 or 1000? Also, the example code is not for large-scale
problems. You should use the KMeans algorithm in mllib clustering for
your problem.
Thanks, Let me try with a smaller K.
Does the size of the input data matters for the example? Currently I have 50M
rows. What is a reasonable size to demonstrate the capability of Spark?
On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng men...@gmail.com wrote:
K = 50 is certainly a large
To be clear on what your configuration will do:
- SPARK_DAEMON_MEMORY=8g will make your standalone master and worker
schedulers have a lot of memory. These do not impact the actual amount of
useful memory given to executors or your driver, however, so you probably
don't need to set this.
-
PS you have a typo in DEAMON - its DAEMON. Thanks Latin.
On Mar 24, 2014 7:25 AM, Sai Prasanna ansaiprasa...@gmail.com wrote:
Hi All !! I am getting the following error in interactive spark-shell
[0.8.1]
*org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more
than 0 times;
I am also facing the same problem. I have implemented Serializable for my
code, but the exception is thrown from third party libraries on which I have
no control .
Exception in thread main org.apache.spark.SparkException: Job aborted:
Task not serializable: java.io.NotSerializableException: (lib
Can someone answer this question please?
Specifically about the Serializable implementation of dependent jars .. ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-Of-dependent-Java-lib-tp1973p3087.html
Sent from the Apache
Yes my input data is partitioned in a completely random manner, so each
worker that produces shuffle data processes only a part of it. The way I
understand it is that before each stage each workers needs to distribute
correct partitions (based on hash key ranges?) to other workers. And this
is
Hello,
Has anyone got any ideas? I am not quite sure if my problem is an exact fit
for Spark. Since in reality in this
section of my program i am not really doing a reduce job simply a group by
and partition.
Would calling pipe on the Partiotined JavaRDD do the trick? Are there any
examples
Patrick, correct. I have a 16 node cluster. On 14 machines out of 16,
the inode usage was about 50%. On two of the slaves, one had inode usage
of 96% and on the other it was 100%. When i went into /tmp on these two
nodes - there were a bunch of /tmp/spark* subdirectories which I
deleted. This
Oh, glad to know it's that simple!
Patrick, in your last comment did you mean filter in? As in I start with
one year of data and filter it so I have one day left? I'm assuming in that
case the empty partitions would be for all the days that got filtered out.
Nick
2014년 3월 24일 월요일, Patrick
We now have a method to work this around.
For the classes that can't easily implement serialized, we wrap this class a
scala object.
For example:
class A {} // This class is not serializable,
object AHolder {
private val a: A = new A()
def get: A = a
}
This
Dear all,
Sorry for asking such a basic question, but someone can explain when one
should use mapPartiontions instead of map.
Thanks
Jaonary
What does this error mean:
@hadoop-s2.oculus.local:45186]: Error [Association failed with
[akka.tcp://spark@hadoop-s2.oculus.local:45186]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@hadoop-s2.oculus.local:45186]
Caused by:
Another thing I have noticed is that out of my master+15 slaves, two
slaves always carry a higher inode load. So for example right now I am
running an intensive job that takes about an hour to finish and two
slaves have been showing an increase in inode consumption (they are
about 10% above
I've seen two cases most commonly:
The first is when I need to create some processing object to process each
record. If that object creation is expensive, creating one per record
becomes prohibitive. So instead, we use mapPartition, and create one per
partition, and use it on each record in the
Hi All,
I found out why this problem exists. Consider the following scenario:
- a DStream is created from any source. (I've checked with file and socket)
- No actions are applied to this DStream
- Sliding Window operation is applied to this DStream and an action is applied
to the sliding window.
Mark,
This appears to be a Scala-only feature. :(
Patrick,
Are we planning to add this to PySpark?
Nick
On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.comwrote:
It's much simpler: rdd.partitions.size
On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
Oh, I also forgot to mention:
I start the master and workers (call ./sbin/start-all.sh), and then start
the shell:
MASTER=spark://localhost:7077 ./bin/spark-shell
Then I get the exceptions...
Thanks
--
View this message in context:
I have a DStream like this:
..RDD[a,b],RDD[b,c]..
Is there a way to remove duplicates across the entire DStream? Ie: I would like
the output to be (by removing one of the b's):
..RDD[a],RDD[b,c].. or ..RDD[a,b],RDD[c]..
Thanks
-Adrian
After a long time (may be a month) I could do a fresh build for
2.0.0-mr1-cdh4.5.0...I was using the cached files in .ivy2/cache
My case is especially painful since I have to build behind a firewall...
@Sean thanks for the fix...I think we should put a test for https/firewall
compilation as
Number of rows doesn't matter much as long as you have enough workers
to distribute the work. K-means has complexity O(n * d * k), where n
is number of points, d is the dimension, and k is the number of
clusters. If you use the KMeans implementation from MLlib, the
initialization stage is done on
Hi,
I have a very simple use case:
I have an rdd as following:
d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]]
Now, I want to remove all the duplicates from a column and return the
remaining frame..
For example:
If i want to remove the duplicate based on column 1.
Then basically I would remove either row
1. Note sure on this, I don't believe we change the defaults from Java.
2. SPARK_JAVA_OPTS can be used to set the various Java properties (other
than memory heap size itself)
3. If you want to have 8 GB executors then, yes, only two can run on each
16 GB node. (In fact, you should also keep a
Hello,
I'm interested in extending the comparison between GraphX and GraphLab
presented in Xin et. al (2013). The evaluation presented there is rather
limited as it only compares the frameworks for one algorithm (PageRank) on
a cluster with a fixed number of nodes. Are there any graph algorithms
Niko,
Comparing some other components will be very useful as wellsvd++ from
graphx vs the same algorithm in graphlabalso mllib.recommendation.als
implicit/explicit compared to the collaborative filtering toolkit in
graphlab...
To stress test what's the biggest sparse dataset that you
Just found this issue and wanted to link it here, in case somebody finds this
thread later:
https://spark-project.atlassian.net/browse/SPARK-939
On Thursday, March 20, 2014 at 11:14 AM, Matei Zaharia wrote:
Hi Jaka,
I’d recommend rebuilding Spark with a new version of the HTTPClient
Has anyone successfully followed the instructions on the Quick Start page
of the Spark home page to run a standalone Scala application? I can't,
and I figure I must be missing something obvious!
I'm trying to follow the instructions here as close to word for word as
possible:
Hi, Diana,
See my inlined answer
--
Nan Zhu
On Monday, March 24, 2014 at 3:44 PM, Diana Carroll wrote:
Has anyone successfully followed the instructions on the Quick Start page of
the Spark home page to run a standalone Scala application? I can't, and I
figure I must be missing
I am able to run standalone apps. I think you are making one mistake
that throws you off from there onwards. You don't need to put your app
under SPARK_HOME. I would create it in its own folder somewhere, it
follows the rules of any standalone scala program (including the
layout). In the giude,
Hi Niko,
The GraphX team recently wrote a longer paper with more benchmarks and
optimizations: http://arxiv.org/abs/1402.2394
Regarding the performance of GraphX vs. GraphLab, I believe GraphX
currently outperforms GraphLab only in end-to-end benchmarks of pipelines
involving both graph-parallel
Yana: Thanks. Can you give me a transcript of the actual commands you are
running?
THanks!
Diana
On Mon, Mar 24, 2014 at 3:59 PM, Yana Kadiyska yana.kadiy...@gmail.comwrote:
I am able to run standalone apps. I think you are making one mistake
that throws you off from there onwards. You
I have what I would call unexpected behaviour when using window on a stream.
I have 2 windowed streams with a 5s batch interval. One window stream is
(5s,5s)=smallWindow and the other (10s,5s)=bigWindow
What I've noticed is that the 1st RDD produced by bigWindow is incorrect and is
of the size
Diana,
Anywhere on the filesystem you have read/write access (you need not be
in your spark home directory):
mkdir myproject
cd myproject
mkdir project
mkdir target
mkdir -p src/main/scala
cp $mypath/$mymysource.scala src/main/scala/
cp $mypath/myproject.sbt .
Make sure that myproject.sbt
Thanks, Nan Zhu.
You say that my problems are because you are in Spark directory, don't
need to do that actually , the dependency on Spark is resolved by sbt
I did try it initially in what I thought was a much more typical place,
e.g. ~/mywork/sparktest1. But as I said in my email:
(Just for
Thanks Ongen.
Unfortunately I'm not able to follow your instructions either. In
particular:
sbt compile
sbt run arguments if any
This doesn't work for me because there's no program on my path called
sbt. The instructions in the Quick Start guide are specific that I
should call
There is no direct way to get this in pyspark, but you can get it from the
underlying java rdd. For example
a = sc.parallelize([1,2,3,4], 2)
a._jrdd.splits().size()
On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Mark,
This appears to be a Scala-only
Hi,
Quick question about partitions. If my RDD is partitioned into 5
partitions, does that mean that I am constraining it to exist on at most 5
machines?
Thanks
Yeah, that's exactly what I did. Unfortunately it doesn't work:
$SPARK_HOME/sbt/sbt package
awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for
reading (No such file or directory)
Attempting to fetch sbt
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
Ah we should just add this directly in pyspark - it's as simple as the
code Shivaram just wrote.
- Patrick
On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
shivaram.venkatara...@gmail.com wrote:
There is no direct way to get this in pyspark, but you can get it from the
underlying java
Ah crud, I guess you are right, I am using the sbt I installed manually
with my Scala installation.
Well, here is what you can do:
mkdir ~/bin
cd ~/bin
wget
http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.1/sbt-launch.jar
vi sbt
Put the following contents into
Hi, Diana,
You don’t need to use spark-distributed sbt
just download sbt from its official website and set your PATH to the right place
Best,
--
Nan Zhu
On Monday, March 24, 2014 at 4:30 PM, Diana Carroll wrote:
Yeah, that's exactly what I did. Unfortunately it doesn't work:
For instance, I need to work with an RDD in terms of N parts. Will calling
RDD.coalesce(N) possibly cause processing bottlenecks?
On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat walrusthe...@gmail.comwrote:
Hi,
Quick question about partitions. If my RDD is partitioned into 5
partitions,
Thanks for your help, everyone. Several folks have explained that I can
surely solve the problem by installing sbt.
But I'm trying to get the instructions working *as written on the Spark
website*. The instructions not only don't have you install sbt
separately...they actually specifically have
RDD.coalesce should be fine for rebalancing data across all RDD partitions.
Coalesce is pretty handy in situations where you have sparse data and want
to compact it (e.g. data after applying a strict filter) OR you know the
magic number of partitions according to your cluster which will be
Ongen:
I don't know why your process is hanging, sorry. But I do know that the
way saveAsTextFile works is that you give it a path to a directory, not a
file. The file is saved in multiple parts, corresponding to the
partitions. (part-0, part-1 etc.)
(Presumably it does this because it
Diana, thanks. I am not very well acquainted with HDFS. I use hdfs -put
to put things as files into the filesystem (and sc.textFile to get stuff
out of them in Spark) and I see that they appear to be saved as files
that are replicated across 3 out of the 16 nodes in the hdfs cluster
(which is
Thanks Matei, unfortunately doesn't seem to fix it. I tried batchSize = 10,
100, as well as 1 (which should reproduce the 0.8.1 behavior?), and it stalls
at the same point in each case.
-- Jeremy
-
jeremy freeman, phd
neuroscientist
@thefreemanlab
On Mar 23, 2014, at 9:56
Hello Sanjay,
Yes, your understanding of lazy semantics is correct. But ideally
every batch should read based on the batch interval provided in the
StreamingContext. Can you open a JIRA on this?
On Mon, Mar 24, 2014 at 7:45 AM, Sanjay Awatramani
sanjay_a...@yahoo.com wrote:
Hi All,
I found
Yes, I believe that is current behavior. Essentially, the first few
RDDs will be partial windows (assuming window duration sliding
interval).
TD
On Mon, Mar 24, 2014 at 1:12 PM, Adrian Mocanu
amoc...@verticalscope.com wrote:
I have what I would call unexpected behaviour when using window on a
Syed,
Thanks for the tip. I'm not sure if coalesce is doing what I'm intending
to do, which is, in effect, to subdivide the RDD into N parts (by calling
coalesce and doing operations on the partitions.) It sounds like, however,
this won't bottleneck my processing power. If this sets off any
Just so I can close this thread (in case anyone else runs into this
stuff) - I did sleep through the basics of Spark ;). The answer on why
my job is in waiting state (hanging) is here:
http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling
Ognen
On 3/24/14,
Hi guys,
thanks for the information, I'll give it a try with Algebird,
thanks again,
Richard
@Patrick, thanks for the release calendar
On Mon, Mar 24, 2014 at 12:16 AM, Patrick Wendell pwend...@gmail.comwrote:
Hey All,
I think the old thread is here:
Hi,I have large data set of numbers ie RDD and wanted to perform a
computation only on groupof two values at a time.For
example1,2,3,4,5,6,7... is an RDDCan i group the RDD into
(1,2),(3,4),(5,6)...?? and perform the respective computations ?in an
efficient manner?As we do'nt have a way to index
Diana, I think you are correct - I just installed
wget
http://mirror.symnds.com/software/Apache/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-cdh4.tgz
and indeed I see the same error that you see
It looks like in previous versions sbt-launch used to just come down
in the
I found that I never read the document carefully and I never find that Spark
document is suggesting you to use Spark-distributed sbt……
Best,
--
Nan Zhu
On Monday, March 24, 2014 at 5:47 PM, Diana Carroll wrote:
Thanks for your help, everyone. Several folks have explained that I can
We need some one who can explain us with short code snippet on given example
so that we get clear cut idea on RDDs indexing..
Guys please help us
--
View this message in context:
Hi,
sc.parallelize(Array.tabulate(100)(i=i)).filter( _ % 20 == 0
).coalesce(5,true).glom.collect yields
Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(),
Array(), Array())
How do I get something more like:
Array(Array(0), Array(20), Array(40), Array(60), Array(80))
There is also https://github.com/apache/spark/pull/18 against the current
repo which may be easier to apply.
On Fri, Mar 21, 2014 at 8:58 AM, Hai-Anh Trinh a...@adatao.com wrote:
Hi Jaonary,
You can find the code for k-fold CV in
https://github.com/apache/incubator-spark/pull/448. I have
It is suggested implicitly in giving you the command ./sbt/sbt. The
separately installed sbt isn't in a folder called sbt, whereas Spark's
version is. And more relevantly, just a few paragraphs earlier in the
tutorial you execute the command sbt/sbt assembly which definitely refers
to the spark
partition your input into even number partitions
use mapPartition to operate on Iterator[Int]
maybe there are some more efficient way….
Best,
--
Nan Zhu
On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote:
Hi, I have large data set of numbers ie RDD and wanted to perform a
Yes, actually even for spark, I mostly use the sbt I installed…..so always
missing this issue….
If you can reproduce the problem with a spark-distribtued sbt…I suggest
proposing a PR to fix the document, before 0.9.1 is officially released
Best,
--
Nan Zhu
On Monday, March 24, 2014
Ognen, can you comment if you were actually able to run two jobs
concurrently with just restricting spark.cores.max? I run Shark on the
same cluster and was not able to see a standalone job get in (since
Shark is a long running job) until I restricted both spark.cores.max
_and_
I didn’t group the integers, but process them in group of two,
partition that
scala val a = sc.parallelize(List(1, 2, 3, 4), 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at
console:12
process each partition and process elements in the partition in group
points.foreach(p=p.y = another_value) will return a new modified RDD.
2014-03-24 18:13 GMT+08:00 Chieh-Yen r01944...@csie.ntu.edu.tw:
Dear all,
I have a question about the usage of RDD.
I implemented a class called AppDataPoint, it looks like:
case class AppDataPoint(input_y : Double,
Hi hequn, a relative question, is that mean the memory usage will doubled? And
further more, if the compute function in a rdd is not idempotent, rdd will
changed during the job running, is that right?
-原始邮件-
发件人: hequn cheng chenghe...@gmail.com
发送时间: 2014/3/25 9:35
收件人:
No, it won't. The type of RDD#foreach is Unit, so it doesn't return an
RDD. The utility of foreach is purely for the side effects it generates,
not for its return value -- and modifying an RDD in place via foreach is
generally not a very good idea.
On Mon, Mar 24, 2014 at 6:35 PM, hequn cheng
Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui
On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming mailingl...@ltsai.com wrote:
Thanks again.
If you use the KMeans implementation from MLlib, the
initialization stage is done on master,
The master here is the
First question:
If you save your modified RDD like this:
points.foreach(p=p.y = another_value).collect() or
points.foreach(p=p.y = another_value).saveAsTextFile(...)
the modified RDD will be materialized and this will not use any work's
memory.
If you have more transformatins after the map(), the
Hi hequn, I dig into the source of spark a bit deeper, and I got some ideas,
firstly, immutable is a feather of rdd but not a solid rule, there are ways to
change it, for excample, a rdd with non-idempotent compute function, though
it is really a bad design to make that function non-idempotent
73 matches
Mail list logo