I believe our full 60 days of data contains over ten million unique
entities. Across 10 days I'm not sure, but it should be in the millions. I
haven't verified that myself though. So that's the scale of the RDD we're
writing to disk (each entry is entityId -> profile).

I think it's hard to know how Spark will hold up without trying yourself,
on your own flow. Also, keep in mind this was with a Spark Standalone
cluster - perhaps Mesos or YARN would hold up better.


On Tue, Jul 8, 2014 at 1:04 PM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:

> I'll respond for Dan.
>
> Our test dataset was a total of 10 GB of input data (full production
> dataset for this particular dataflow would be 60 GB roughly).
>
> I'm not sure what the size of the final output data was but I think it was
> on the order of 20 GBs for the given 10 GB of input data. Also, I can say
> that when we were experimenting with persist(DISK_ONLY), the size of all
> RDDs on disk was around 200 GB, which gives a sense of overall transient
> memory usage with no persistence.
>
> In terms of our test cluster, we had 15 nodes. Each node had 24 cores and
> 2 workers each. Each executor got 14 GB of memory.
>
> -Suren
>
>
>
> On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey <kevin.mar...@oracle.com>
> wrote:
>
>>  When you say "large data sets", how large?
>> Thanks
>>
>>
>> On 07/07/2014 01:39 PM, Daniel Siegmann wrote:
>>
>>  From a development perspective, I vastly prefer Spark to MapReduce. The
>> MapReduce API is very constrained; Spark's API feels much more natural to
>> me. Testing and local development is also very easy - creating a local
>> Spark context is trivial and it reads local files. For your unit tests you
>> can just have them create a local context and execute your flow with some
>> test data. Even better, you can do ad-hoc work in the Spark shell and if
>> you want that in your production code it will look exactly the same.
>>
>>  Unfortunately, the picture isn't so rosy when it gets to production. In
>> my experience, Spark simply doesn't scale to the volumes that MapReduce
>> will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
>> would be better, but I haven't had the opportunity to try them. I find jobs
>> tend to just hang forever for no apparent reason on large data sets (but
>> smaller than what I push through MapReduce).
>>
>>  I am hopeful the situation will improve - Spark is developing quickly -
>> but if you have large amounts of data you should proceed with caution.
>>
>>  Keep in mind there are some frameworks for Hadoop which can hide the
>> ugly MapReduce with something very similar in form to Spark's API; e.g.
>> Apache Crunch. So you might consider those as well.
>>
>>  (Note: the above is with Spark 1.0.0.)
>>
>>
>>
>> On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanat...@accenture.com>
>> wrote:
>>
>>>  Hello Experts,
>>>
>>>
>>>
>>> I am doing some comparative study on the below:
>>>
>>>
>>>
>>> Spark vs Impala
>>>
>>> Spark vs MapREduce . Is it worth migrating from existing MR
>>> implementation to Spark?
>>>
>>>
>>>
>>>
>>>
>>> Please share your thoughts and expertise.
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>> Santosh
>>>
>>> ------------------------------
>>>
>>> This message is for the designated recipient only and may contain
>>> privileged, proprietary, or otherwise confidential information. If you have
>>> received it in error, please notify the sender immediately and delete the
>>> original. Any other use of the e-mail by you is prohibited. Where allowed
>>> by local law, electronic communications with Accenture and its affiliates,
>>> including e-mail and instant messaging (including content), may be scanned
>>> by our systems for the purposes of information security and assessment of
>>> internal compliance with Accenture policy.
>>>
>>> ______________________________________________________________________________________
>>>
>>> www.accenture.com
>>>
>>
>>
>>
>> --
>>  Daniel Siegmann, Software Developer
>> Velos
>>  Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>> E: daniel.siegm...@velos.io W: www.velos.io
>>
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
> W: www.velos.io
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to