I believe our full 60 days of data contains over ten million unique entities. Across 10 days I'm not sure, but it should be in the millions. I haven't verified that myself though. So that's the scale of the RDD we're writing to disk (each entry is entityId -> profile).
I think it's hard to know how Spark will hold up without trying yourself, on your own flow. Also, keep in mind this was with a Spark Standalone cluster - perhaps Mesos or YARN would hold up better. On Tue, Jul 8, 2014 at 1:04 PM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > I'll respond for Dan. > > Our test dataset was a total of 10 GB of input data (full production > dataset for this particular dataflow would be 60 GB roughly). > > I'm not sure what the size of the final output data was but I think it was > on the order of 20 GBs for the given 10 GB of input data. Also, I can say > that when we were experimenting with persist(DISK_ONLY), the size of all > RDDs on disk was around 200 GB, which gives a sense of overall transient > memory usage with no persistence. > > In terms of our test cluster, we had 15 nodes. Each node had 24 cores and > 2 workers each. Each executor got 14 GB of memory. > > -Suren > > > > On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey <kevin.mar...@oracle.com> > wrote: > >> When you say "large data sets", how large? >> Thanks >> >> >> On 07/07/2014 01:39 PM, Daniel Siegmann wrote: >> >> From a development perspective, I vastly prefer Spark to MapReduce. The >> MapReduce API is very constrained; Spark's API feels much more natural to >> me. Testing and local development is also very easy - creating a local >> Spark context is trivial and it reads local files. For your unit tests you >> can just have them create a local context and execute your flow with some >> test data. Even better, you can do ad-hoc work in the Spark shell and if >> you want that in your production code it will look exactly the same. >> >> Unfortunately, the picture isn't so rosy when it gets to production. In >> my experience, Spark simply doesn't scale to the volumes that MapReduce >> will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN >> would be better, but I haven't had the opportunity to try them. I find jobs >> tend to just hang forever for no apparent reason on large data sets (but >> smaller than what I push through MapReduce). >> >> I am hopeful the situation will improve - Spark is developing quickly - >> but if you have large amounts of data you should proceed with caution. >> >> Keep in mind there are some frameworks for Hadoop which can hide the >> ugly MapReduce with something very similar in form to Spark's API; e.g. >> Apache Crunch. So you might consider those as well. >> >> (Note: the above is with Spark 1.0.0.) >> >> >> >> On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanat...@accenture.com> >> wrote: >> >>> Hello Experts, >>> >>> >>> >>> I am doing some comparative study on the below: >>> >>> >>> >>> Spark vs Impala >>> >>> Spark vs MapREduce . Is it worth migrating from existing MR >>> implementation to Spark? >>> >>> >>> >>> >>> >>> Please share your thoughts and expertise. >>> >>> >>> >>> >>> >>> Thanks, >>> Santosh >>> >>> ------------------------------ >>> >>> This message is for the designated recipient only and may contain >>> privileged, proprietary, or otherwise confidential information. If you have >>> received it in error, please notify the sender immediately and delete the >>> original. Any other use of the e-mail by you is prohibited. Where allowed >>> by local law, electronic communications with Accenture and its affiliates, >>> including e-mail and instant messaging (including content), may be scanned >>> by our systems for the purposes of information security and assessment of >>> internal compliance with Accenture policy. >>> >>> ______________________________________________________________________________________ >>> >>> www.accenture.com >>> >> >> >> >> -- >> Daniel Siegmann, Software Developer >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 >> E: daniel.siegm...@velos.io W: www.velos.io >> >> >> > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io > W: www.velos.io > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io