I don't have those numbers off-hand. Though the shuffle spill to disk was
coming to several gigabytes per node, if I recall correctly.

The MapReduce pipeline takes about 2-3 hours I think for the full 60 day
data set. Spark chugs along fine for awhile and then hangs. We restructured
the flow a few times, but in the last iteration it was hanging when trying
to save the feature profiles with just a couple of tasks remaining (those
tasks ran for 10+ hours before we killed it). In a previous iteration we
did get it to run through. We broke our flow into two parts though - first
saving the raw profiles out to disk, then reading them back in for scoring.

That was on just 10 days of data, by the way - one sixth of what the
MapReduce flow normally runs through on the same cluster.

I haven't tracked down the cause. YMMV


On Mon, Jul 7, 2014 at 8:14 PM, Soumya Simanta <soumya.sima...@gmail.com>
wrote:

>
>
> Daniel,
>
> Do you mind sharing the size of your cluster and the production data
> volumes ?
>
> Thanks
> Soumya
>
> On Jul 7, 2014, at 3:39 PM, Daniel Siegmann <daniel.siegm...@velos.io>
> wrote:
>
> From a development perspective, I vastly prefer Spark to MapReduce. The
> MapReduce API is very constrained; Spark's API feels much more natural to
> me. Testing and local development is also very easy - creating a local
> Spark context is trivial and it reads local files. For your unit tests you
> can just have them create a local context and execute your flow with some
> test data. Even better, you can do ad-hoc work in the Spark shell and if
> you want that in your production code it will look exactly the same.
>
> Unfortunately, the picture isn't so rosy when it gets to production. In my
> experience, Spark simply doesn't scale to the volumes that MapReduce will
> handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be
> better, but I haven't had the opportunity to try them. I find jobs tend to
> just hang forever for no apparent reason on large data sets (but smaller
> than what I push through MapReduce).
>
> I am hopeful the situation will improve - Spark is developing quickly -
> but if you have large amounts of data you should proceed with caution.
>
> Keep in mind there are some frameworks for Hadoop which can hide the ugly
> MapReduce with something very similar in form to Spark's API; e.g. Apache
> Crunch. So you might consider those as well.
>
> (Note: the above is with Spark 1.0.0.)
>
>
>
> On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanat...@accenture.com>
> wrote:
>
>>  Hello Experts,
>>
>>
>>
>> I am doing some comparative study on the below:
>>
>>
>>
>> Spark vs Impala
>>
>> Spark vs MapREduce . Is it worth migrating from existing MR
>> implementation to Spark?
>>
>>
>>
>>
>>
>> Please share your thoughts and expertise.
>>
>>
>>
>>
>>
>> Thanks,
>> Santosh
>>
>> ------------------------------
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy.
>>
>> ______________________________________________________________________________________
>>
>> www.accenture.com
>>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to