..@mail.gmail.com%3E
)
My question is: why is it that when I sort the objects retrieved from the
sequence files, there's a ton more memory used than just building the objects
manually? It doesn't make sense to me. I'm theoretically performing the same
operation on bo
on master only? Is the default
parallelism
system property being set prior to creating SparkContext?
On Mon, Dec 9, 2013 at 10:45 PM, Matt Cheah
mailto:mch...@palantir.com>> wrote:
Thanks for the prompt response. For the sort, the sequence file is 129GB in
size in HDFS. I have 10 EC2 m2.4
irRDD
From: Ashish Rangole [arang...@gmail.com]
Sent: Monday, December 09, 2013 7:41 PM
To: user@spark.incubator.apache.org
Subject: Re: JavaRDD, Specify number of tasks
AFAIK yes. IIRC, there is a 2nd parameter numPartitions that one can provide
to these operations.
On Dec 9, 2013 8:19 PM, "
0. None of
these configurations lets me sort the dataset without the cluster collapsing.
-Matt Cheah
From: Matei Zaharia [matei.zaha...@gmail.com]
Sent: Monday, December 09, 2013 7:02 PM
To: user@spark.incubator.apache.org
Cc: Mingyu Kim
Subject: Re: Hadoop RDD
ecember 09, 2013 7:41 PM
To: user@spark.incubator.apache.org
Subject: Re: JavaRDD, Specify number of tasks
AFAIK yes. IIRC, there is a 2nd parameter numPartitions that one can provide
to these operations.
On Dec 9, 2013 8:19 PM, "Matt Cheah"
mailto:mch...@palantir.com>&g
Hi
When I use a JavaPairRDD's groupByKey(), reduceByKey(), or sortByKey(), is
there a way for me to specify the number of reduce tasks, as there is in a
scala RDD? Or do I have to set them all to use spark.default.parallelism?
Thanks,
-Matt Cheah
(feels like I've been askin
formed?
And, does this have to be a concern when RDDs are retrieved when Spark is run
against a cluster? Or will I only see these anomalies if I'm running Spark on
local[N]?
Thanks! Hope that wasn't too confusing,
-Matt Cheah
Thanks a lot for that. There's definitely a lot of subtleties that we need to
consider. We appreciate the thorough explanation!
-Matt Cheah
From: Aaron Davidson mailto:ilike...@gmail.com>>
Reply-To:
"user@spark.incubator.apache.org<mailto:user@spark.incubator.apache
es will be (presumably
proportional to the size of the dataset).
Thanks for the quick response!
-Matt Cheah
From: Aaron Davidson mailto:ilike...@gmail.com>>
Reply-To:
"user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>"
mailto:user@spark.incubator.apach
group-by
queries, so I was trying to simulate this kind of workload in the spark shell.
Is there any good way to consistently get these kinds of queries to work?
Assume that during the general use-case it can't be known a-priori how many
groups there will be.
Thanks,
-Matt Cheah
se case has been involving large group-by
queries, so I was trying to simulate this kind of workload in the spark shell.
Is there any good way to consistently get these kinds of queries to work?
Assume that during the general use-case it can't be known a-priori how many
groups there will be.
about the ramifications of turning up this value, but I was
wondering what the actual maximum number that could be set for it is. I'll
benchmark the performance hit accordingly.
Thanks!
-Matt Cheah
n't working here. Thanks!
-Matt Cheah
Actually, we want the opposite – we want as much data to be computed as
possible.
It's only for benchmarking purposes, of course.
-Matt Cheah
From: Matei Zaharia mailto:matei.zaha...@gmail.com>>
Reply-To:
"user@spark.incubator.apache.org<mailto:user@spark.incubator.apache
dea of how long transforming the
whole dataset takes.
Thanks,
-Matt Cheah
I'm reading the paper now, thanks. It states 100-node clusters were used. Is
this typical in the field to have 100 node clusters for the 1TB scale? We were
expecting to be using ~10 nodes.
I'm still pretty new to cluster computing, so just not sure how people have set
these up.
-
Hi everyone,
I notice the benchmark page for AMPLab provides some numbers on Gbs of data:
https://amplab.cs.berkeley.edu/benchmark/ I was wondering if similar benchmark
numbers existed for even larger data sets, in the terabytes if possible.
Also, are there any for just raw spark, i.e. No shark
s odd to me that I'd have to do so. Especially
since the tuning guide suggests to use Externalizable:
http://spark.incubator.apache.org/docs/latest/tuning.html
-Matt Cheah
From: Andrew Ash mailto:and...@andrewash.com>>
Reply-To:
"user@spark.incubator.apache.org<mailto:user@sp
;m running on a spark cluster generated by the EC2 Scripts. This doesn't
happen if I'm running things with local[N]. Any ideas?
Thanks,
-Matt Cheah
ect's
ivy xml file. We also want to give users the EC2 scripts as an easy way to
get started with setting up a Spark cluster. The EC2 scripts would ideally
set up the cluster with CDH4, the version of Hadoop that our version of
the product is built against.
-Matt Cheah
On 11/26/13 12:10 PM
to easily spawn clusters with the spark-ec2
scripts – but we want Spark to be built against the same Hadoop jars in both
cases.
Thanks,
-Matt Cheah
nt to create a SparkContext per compute-session to sandbox the jars in each
user's job.
Is this a use case that could be done by only using one SparkContext in the JVM?
-Matt Cheah
From: Dmitriy Lyubimov mailto:dlie...@gmail.com>>
Reply-To:
"user@spark.incubator.a
y
allowing all traffic is bad…
-Matt Cheah
From: Aaron Davidson mailto:ilike...@gmail.com>>
Reply-To:
"user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>"
mailto:user@spark.incubator.apache.org>>
Date: Monday, November 18, 2013 8:28 PM
To: &
I was wondering if other EC2 nodes could have their
firewalls configured to allow this.
We don't want to deploy the web server on the master node of the spark cluster.
Thanks,
-Matt Cheah
overridden
without explicitly saving the RDD to disk?
-Matt Cheah
From: Andrew Winings mailto:mch...@palantir.com>>
Date: Friday, November 1, 2013 3:51 PM
To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>"
mailto:user@spark.incubator.apac
s that the entire RDD is being
collected on-heap in the local case. Am I misunderstanding the documentation?
Thanks,
-Matt Cheah
Hi everyone,
I see there is a take() function for RDDs, getting the first n elements. Is
there a way to get the last n elements?
Thanks,
-Matt Cheah
treaming it all back to the driver seems
wasteful where in reality we could fetch chunks of it at a time and load only
parts in driver memory, as opposed to using 2GB of RAM on the driver. In fact I
don't know what the maximum frame size that can be set would be via
spark.akka.frames
at a time and load only
parts in driver memory, as opposed to using 2GB of RAM on the driver. In fact I
don't know what the maximum frame size that can be set would be via
spark.akka.framesize.
-Matt Cheah
From: Mark Hamstra mailto:m...@clearstorydata.com>>
Reply-To:
"user@spark
read back out again to get
this sequential behavior.
I appreciate the discussion though. Quite enlightening.
Thanks,
-Matt Cheah
From: Christopher Nguyen mailto:c...@adatao.com>>
Date: Tuesday, October 22, 2013 2:23 PM
To: "user@spark.incubator.apache.org<mailto:user@sp
sociative
and commutative.
On Tue, Oct 22, 2013 at 12:28 PM, Matt Cheah
mailto:mch...@palantir.com>> wrote:
Hi everyone,
I have a driver holding a reference to an RDD. The driver would like to "visit"
each item in the RDD in order, say with a visitor object that invokes
visit
nternal iterator() method. In
some cases, we get a stack trace (running locally with 3 threads). I've
included the stack trace below.
Thanks,
-Matt Cheah
org.apache.spark.SparkException: Error communicating with MapOutputTracker
at org.apache.spark.MapOutputTracker.
Ah, I misunderstood the functionality then – I was under the impression that
exactly that fraction would be returned.
Thanks,
-Matt Cheah
From: Aaron Davidson mailto:ilike...@gmail.com>>
Reply-To:
"user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org&
representation as printed by
Eclipse is 0.14285714285714285. The resulting RDD ends up getting 2 items back
instead of 1.
Is it expected to get that much error in precision? I'd rather not use the
takeSample() function which would materialize the whole sample in the driver's
memory.
Thanks,
-Matt Cheah
34 matches
Mail list logo