Re: Comparative study

Surendranauth Hiraman Tue, 08 Jul 2014 12:32:59 -0700

We kind of hijacked Santos' original thread, so apologies for that and let
me try to get back to Santos' original question on Map/Reduce versus Spark.


I would say it's worth migrating from M/R, with the following thoughts.

Just my opinion but I would summarize the latest emails in this thread as
Spark can scale to datasets in 10s and 100s of GBs. I've seen some
companies talk about TBs of data but I'm unclear if that is for a single
flow.

At the same time, some folks (like my team) that I've seen on the user
group have a lot of difficulty with the same sized datasets, which points
to either environmental issues (machines, cluster mode, etc.), nature of
the data or nature of the transforms/flow complexity (though Kevin's
experience runs counter to the latter, which is very positive) or we are
just doing something subtle wrong.

My overall opinion right now is Map/Reduce is easier to get working in
general on very large, heterogeneous datasets but the programming model for
Spark is the right way to go and worth the effort.

Libraries like Scoobi, Scrunch and Scalding (and their associated Java
versions) provide a Spark-like wrapper around Map/Reduce but my guess is
that, since they are limited to Map/Reduce under the covers, they cannot do
some of the optimizations that Spark can, such as collapsing several
transforms into a single stage.

In addition, my company believes that having batch, streaming and SQL (ad
hoc querying) on a single platform has worthwhile benefits.

We're still relatively new with Spark (a few months), so would also love to
hear more from others in the community.

-Suren



On Tue, Jul 8, 2014 at 2:17 PM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:

> Also, our exact same flow but with 1 GB of input data completed fine.
>
> -Suren
>
>
> On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> How wide are the rows of data, either the raw input data or any generated
>> intermediate data?
>>
>> We are at a loss as to why our flow doesn't complete. We banged our heads
>> against it for a few weeks.
>>
>> -Suren
>>
>>
>>
>> On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey <kevin.mar...@oracle.com>
>> wrote:
>>
>>>  Nothing particularly custom.  We've tested with small (4 node)
>>> development clusters, single-node pseudoclusters, and bigger, using
>>> plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master,
>>> Spark local, Spark Yarn (client and cluster) modes, with total memory
>>> resources ranging from 4GB to 256GB+.
>>>
>>> K
>>>
>>>
>>>
>>> On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote:
>>>
>>> To clarify, we are not persisting to disk. That was just one of the
>>> experiments we did because of some issues we had along the way.
>>>
>>>  At this time, we are NOT using persist but cannot get the flow to
>>> complete in Standalone Cluster mode. We do not have a YARN-capable cluster
>>> at this time.
>>>
>>>  We agree with what you're saying. Your results are what we were hoping
>>> for and expecting. :-)  Unfortunately we still haven't gotten the flow to
>>> run end to end on this relatively small dataset.
>>>
>>>  It must be something related to our cluster, standalone mode or our
>>> flow but as far as we can tell, we are not doing anything unusual.
>>>
>>>  Did you do any custom configuration? Any advice would be appreciated.
>>>
>>>  -Suren
>>>
>>>
>>>
>>>
>>> On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey <kevin.mar...@oracle.com>
>>> wrote:
>>>
>>>>  It seems to me that you're not taking full advantage of the lazy
>>>> evaluation, especially persisting to disk only.  While it might be true
>>>> that the cumulative size of the RDDs looks like it's 300GB, only a small
>>>> portion of that should be resident at any one time.  We've evaluated data
>>>> sets much greater than 10GB in Spark using the Spark master and Spark with
>>>> Yarn (cluster -- formerly standalone -- mode).  Nice thing about using Yarn
>>>> is that it reports the actual memory *demand*, not just the memory
>>>> requested for driver and workers.  Processing a 60GB data set through
>>>> thousands of stages in a rather complex set of analytics and
>>>> transformations consumed a total cluster resource (divided among all
>>>> workers and driver) of only 9GB.  We were somewhat startled at first by
>>>> this result, thinking that it would be much greater, but realized that it
>>>> is a consequence of Spark's lazy evaluation model.  This is even with
>>>> several intermediate computations being cached as input to multiple
>>>> evaluation paths.
>>>>
>>>> Good luck.
>>>>
>>>> Kevin
>>>>
>>>>
>>>>
>>>> On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote:
>>>>
>>>> I'll respond for Dan.
>>>>
>>>>  Our test dataset was a total of 10 GB of input data (full production
>>>> dataset for this particular dataflow would be 60 GB roughly).
>>>>
>>>>  I'm not sure what the size of the final output data was but I think
>>>> it was on the order of 20 GBs for the given 10 GB of input data. Also, I
>>>> can say that when we were experimenting with persist(DISK_ONLY), the size
>>>> of all RDDs on disk was around 200 GB, which gives a sense of overall
>>>> transient memory usage with no persistence.
>>>>
>>>>  In terms of our test cluster, we had 15 nodes. Each node had 24 cores
>>>> and 2 workers each. Each executor got 14 GB of memory.
>>>>
>>>>  -Suren
>>>>
>>>>
>>>>
>>>> On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey <kevin.mar...@oracle.com>
>>>> wrote:
>>>>
>>>>>  When you say "large data sets", how large?
>>>>> Thanks
>>>>>
>>>>>
>>>>> On 07/07/2014 01:39 PM, Daniel Siegmann wrote:
>>>>>
>>>>>  From a development perspective, I vastly prefer Spark to MapReduce.
>>>>> The MapReduce API is very constrained; Spark's API feels much more natural
>>>>> to me. Testing and local development is also very easy - creating a local
>>>>> Spark context is trivial and it reads local files. For your unit tests you
>>>>> can just have them create a local context and execute your flow with some
>>>>> test data. Even better, you can do ad-hoc work in the Spark shell and if
>>>>> you want that in your production code it will look exactly the same.
>>>>>
>>>>>  Unfortunately, the picture isn't so rosy when it gets to production.
>>>>> In my experience, Spark simply doesn't scale to the volumes that MapReduce
>>>>> will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
>>>>> would be better, but I haven't had the opportunity to try them. I find 
>>>>> jobs
>>>>> tend to just hang forever for no apparent reason on large data sets (but
>>>>> smaller than what I push through MapReduce).
>>>>>
>>>>>  I am hopeful the situation will improve - Spark is developing
>>>>> quickly - but if you have large amounts of data you should proceed with
>>>>> caution.
>>>>>
>>>>>  Keep in mind there are some frameworks for Hadoop which can hide the
>>>>> ugly MapReduce with something very similar in form to Spark's API; e.g.
>>>>> Apache Crunch. So you might consider those as well.
>>>>>
>>>>>  (Note: the above is with Spark 1.0.0.)
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanat...@accenture.com>
>>>>> wrote:
>>>>>
>>>>>>  Hello Experts,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am doing some comparative study on the below:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Spark vs Impala
>>>>>>
>>>>>> Spark vs MapREduce . Is it worth migrating from existing MR
>>>>>> implementation to Spark?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Please share your thoughts and expertise.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Santosh
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> This message is for the designated recipient only and may contain
>>>>>> privileged, proprietary, or otherwise confidential information. If you 
>>>>>> have
>>>>>> received it in error, please notify the sender immediately and delete the
>>>>>> original. Any other use of the e-mail by you is prohibited. Where allowed
>>>>>> by local law, electronic communications with Accenture and its 
>>>>>> affiliates,
>>>>>> including e-mail and instant messaging (including content), may be 
>>>>>> scanned
>>>>>> by our systems for the purposes of information security and assessment of
>>>>>> internal compliance with Accenture policy.
>>>>>>
>>>>>> ______________________________________________________________________________________
>>>>>>
>>>>>> www.accenture.com
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>  Daniel Siegmann, Software Developer
>>>>> Velos
>>>>>  Accelerating Machine Learning
>>>>>
>>>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>>>>> E: daniel.siegm...@velos.io W: www.velos.io
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>  --
>>>>
>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>> Velos
>>>> Accelerating Machine Learning
>>>>
>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>> NEW YORK, NY 10001
>>>> O: (917) 525-2466 ext. 105 <%28917%29%20525-2466%20ext.%20105>
>>>> F: 646.349.4063
>>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>>>> W: www.velos.io
>>>>
>>>>
>>>>
>>>
>>>
>>>  --
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>>> W: www.velos.io
>>>
>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
> W: www.velos.io
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
W: www.velos.io

Re: Comparative study

Reply via email to