Sorting data from sequence files is overly memory intensive

2013-12-11 Thread Matt Cheah
..@mail.gmail.com%3E ) My question is: why is it that when I sort the objects retrieved from the sequence files, there's a ton more memory used than just building the objects manually? It doesn't make sense to me. I'm theoretically performing the same operation on bo

Re: Hadoop RDD incorrect data

2013-12-10 Thread Matt Cheah
on master only? Is the default parallelism system property being set prior to creating SparkContext? On Mon, Dec 9, 2013 at 10:45 PM, Matt Cheah mailto:mch...@palantir.com>> wrote: Thanks for the prompt response. For the sort, the sequence file is 129GB in size in HDFS. I have 10 EC2 m2.4

RE: JavaRDD, Specify number of tasks

2013-12-09 Thread Matt Cheah
irRDD From: Ashish Rangole [arang...@gmail.com] Sent: Monday, December 09, 2013 7:41 PM To: user@spark.incubator.apache.org Subject: Re: JavaRDD, Specify number of tasks AFAIK yes. IIRC, there is a 2nd parameter numPartitions that one can provide to these operations. On Dec 9, 2013 8:19 PM, "

RE: Hadoop RDD incorrect data

2013-12-09 Thread Matt Cheah
0. None of these configurations lets me sort the dataset without the cluster collapsing. -Matt Cheah From: Matei Zaharia [matei.zaha...@gmail.com] Sent: Monday, December 09, 2013 7:02 PM To: user@spark.incubator.apache.org Cc: Mingyu Kim Subject: Re: Hadoop RDD

RE: JavaRDD, Specify number of tasks

2013-12-09 Thread Matt Cheah
ecember 09, 2013 7:41 PM To: user@spark.incubator.apache.org Subject: Re: JavaRDD, Specify number of tasks AFAIK yes. IIRC, there is a 2nd parameter numPartitions that one can provide to these operations. On Dec 9, 2013 8:19 PM, "Matt Cheah" mailto:mch...@palantir.com>&g

JavaRDD, Specify number of tasks

2013-12-09 Thread Matt Cheah
Hi When I use a JavaPairRDD's groupByKey(), reduceByKey(), or sortByKey(), is there a way for me to specify the number of reduce tasks, as there is in a scala RDD? Or do I have to set them all to use spark.default.parallelism? Thanks, -Matt Cheah (feels like I've been askin

Hadoop RDD incorrect data

2013-12-09 Thread Matt Cheah
formed? And, does this have to be a concern when RDDs are retrieved when Spark is run against a cluster? Or will I only see these anomalies if I'm running Spark on local[N]? Thanks! Hope that wasn't too confusing, -Matt Cheah

Re: groupBy() with really big groups fails

2013-12-09 Thread Matt Cheah
Thanks a lot for that. There's definitely a lot of subtleties that we need to consider. We appreciate the thorough explanation! -Matt Cheah From: Aaron Davidson mailto:ilike...@gmail.com>> Reply-To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache

Re: groupBy() with really big groups fails

2013-12-09 Thread Matt Cheah
es will be (presumably proportional to the size of the dataset). Thanks for the quick response! -Matt Cheah From: Aaron Davidson mailto:ilike...@gmail.com>> Reply-To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>" mailto:user@spark.incubator.apach

Re: groupBy() with really big groups fails

2013-12-09 Thread Matt Cheah
group-by queries, so I was trying to simulate this kind of workload in the spark shell. Is there any good way to consistently get these kinds of queries to work? Assume that during the general use-case it can't be known a-priori how many groups there will be. Thanks, -Matt Cheah

groupBy() with really big groups fails

2013-12-09 Thread Matt Cheah
se case has been involving large group-by queries, so I was trying to simulate this kind of workload in the spark shell. Is there any good way to consistently get these kinds of queries to work? Assume that during the general use-case it can't be known a-priori how many groups there will be.

Biggest spark.akka.framesize possible

2013-12-07 Thread Matt Cheah
about the ramifications of turning up this value, but I was wondering what the actual maximum number that could be set for it is. I'll benchmark the performance hit accordingly. Thanks! -Matt Cheah

Using Distcp when EC2 deployed with CDH4

2013-12-06 Thread Matt Cheah
n't working here. Thanks! -Matt Cheah

Re: takeSample() computation

2013-12-05 Thread Matt Cheah
Actually, we want the opposite – we want as much data to be computed as possible. It's only for benchmarking purposes, of course. -Matt Cheah From: Matei Zaharia mailto:matei.zaha...@gmail.com>> Reply-To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache

takeSample() computation

2013-12-05 Thread Matt Cheah
dea of how long transforming the whole dataset takes. Thanks, -Matt Cheah

Re: Benchmark numbers for terabytes of data

2013-12-04 Thread Matt Cheah
I'm reading the paper now, thanks. It states 100-node clusters were used. Is this typical in the field to have 100 node clusters for the 1TB scale? We were expecting to be using ~10 nodes. I'm still pretty new to cluster computing, so just not sure how people have set these up. -

Benchmark numbers for terabytes of data

2013-12-03 Thread Matt Cheah
Hi everyone, I notice the benchmark page for AMPLab provides some numbers on Gbs of data: https://amplab.cs.berkeley.edu/benchmark/ I was wondering if similar benchmark numbers existed for even larger data sets, in the terabytes if possible. Also, are there any for just raw spark, i.e. No shark

Re: Serializable incompatible with Externalizable error

2013-12-03 Thread Matt Cheah
s odd to me that I'd have to do so. Especially since the tuning guide suggests to use Externalizable: http://spark.incubator.apache.org/docs/latest/tuning.html -Matt Cheah From: Andrew Ash mailto:and...@andrewash.com>> Reply-To: "user@spark.incubator.apache.org<mailto:user@sp

Serializable incompatible with Externalizable error

2013-12-02 Thread Matt Cheah
;m running on a spark cluster generated by the EC2 Scripts. This doesn't happen if I'm running things with local[N]. Any ideas? Thanks, -Matt Cheah

Re: Spark, EC2, and CDH4 Questions

2013-11-26 Thread Matt Cheah
ect's ivy xml file. We also want to give users the EC2 scripts as an easy way to get started with setting up a Spark cluster. The EC2 scripts would ideally set up the cluster with CDH4, the version of Hadoop that our version of the product is built against. -Matt Cheah On 11/26/13 12:10 PM

Spark, EC2, and CDH4 Questions

2013-11-26 Thread Matt Cheah
to easily spawn clusters with the spark-ec2 scripts – but we want Spark to be built against the same Hadoop jars in both cases. Thanks, -Matt Cheah

Re: Multiple SparkContexts in one JVM

2013-11-20 Thread Matt Cheah
nt to create a SparkContext per compute-session to sandbox the jars in each user's job. Is this a use case that could be done by only using one SparkContext in the JVM? -Matt Cheah From: Dmitriy Lyubimov mailto:dlie...@gmail.com>> Reply-To: "user@spark.incubator.a

Re: EC2 node submit jobs to separate Spark Cluster

2013-11-19 Thread Matt Cheah
y allowing all traffic is bad… -Matt Cheah From: Aaron Davidson mailto:ilike...@gmail.com>> Reply-To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>" mailto:user@spark.incubator.apache.org>> Date: Monday, November 18, 2013 8:28 PM To: &

EC2 node submit jobs to separate Spark Cluster

2013-11-18 Thread Matt Cheah
I was wondering if other EC2 nodes could have their firewalls configured to allow this. We don't want to deploy the web server on the master node of the spark cluster. Thanks, -Matt Cheah

Re: Out of memory building RDD on local[N]

2013-11-01 Thread Matt Cheah
overridden without explicitly saving the RDD to disk? -Matt Cheah From: Andrew Winings mailto:mch...@palantir.com>> Date: Friday, November 1, 2013 3:51 PM To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>" mailto:user@spark.incubator.apac

Out of memory building RDD on local[N]

2013-11-01 Thread Matt Cheah
s that the entire RDD is being collected on-heap in the local case. Am I misunderstanding the documentation? Thanks, -Matt Cheah

Take last k elements from RDD?

2013-10-24 Thread Matt Cheah
Hi everyone, I see there is a take() function for RDDs, getting the first n elements. Is there a way to get the last n elements? Thanks, -Matt Cheah

Re: Visitor function to RDD elements

2013-10-22 Thread Matt Cheah
treaming it all back to the driver seems wasteful where in reality we could fetch chunks of it at a time and load only parts in driver memory, as opposed to using 2GB of RAM on the driver. In fact I don't know what the maximum frame size that can be set would be via spark.akka.frames

Re: Visitor function to RDD elements

2013-10-22 Thread Matt Cheah
at a time and load only parts in driver memory, as opposed to using 2GB of RAM on the driver. In fact I don't know what the maximum frame size that can be set would be via spark.akka.framesize. -Matt Cheah From: Mark Hamstra mailto:m...@clearstorydata.com>> Reply-To: "user@spark

Re: Visitor function to RDD elements

2013-10-22 Thread Matt Cheah
read back out again to get this sequential behavior. I appreciate the discussion though. Quite enlightening. Thanks, -Matt Cheah From: Christopher Nguyen mailto:c...@adatao.com>> Date: Tuesday, October 22, 2013 2:23 PM To: "user@spark.incubator.apache.org<mailto:user@sp

Re: Visitor function to RDD elements

2013-10-22 Thread Matt Cheah
sociative and commutative. On Tue, Oct 22, 2013 at 12:28 PM, Matt Cheah mailto:mch...@palantir.com>> wrote: Hi everyone, I have a driver holding a reference to an RDD. The driver would like to "visit" each item in the RDD in order, say with a visitor object that invokes visit

Visitor function to RDD elements

2013-10-22 Thread Matt Cheah
nternal iterator() method. In some cases, we get a stack trace (running locally with 3 threads). I've included the stack trace below. Thanks, -Matt Cheah org.apache.spark.SparkException: Error communicating with MapOutputTracker at org.apache.spark.MapOutputTracker.

Re: RDD sample fraction precision

2013-10-21 Thread Matt Cheah
Ah, I misunderstood the functionality then – I was under the impression that exactly that fraction would be returned. Thanks, -Matt Cheah From: Aaron Davidson mailto:ilike...@gmail.com>> Reply-To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org&

RDD sample fraction precision

2013-10-21 Thread Matt Cheah
representation as printed by Eclipse is 0.14285714285714285. The resulting RDD ends up getting 2 items back instead of 1. Is it expected to get that much error in precision? I'd rather not use the takeSample() function which would materialize the whole sample in the driver's memory. Thanks, -Matt Cheah