Task is created per input split, and input splits are created one per
block of each input file by default. If block size is 60~200 MB, 1 ~
3GB memory per task is enough.

Yeah, there's still a queueing/messaging scalability issue as you
know. However, according to my experiences, message bundler and
compressor are mainly responsible for poor scalability and consumes
huge memory. This is more urgent than "queue".

On Sat, Dec 8, 2012 at 2:05 AM, Thomas Jungblut
<[email protected]> wrote:
>>
>>  not disk-based.
>
>
> So how do you want to archieve scalability without that?
> In order to process tasks independend of each other (not in parallel, but
> e.g. in small mini batches), you have to save the state. RAM is limited and
> can't store huge states (persistent in case of crashes).
>
> 2012/12/7 Suraj Menon <[email protected]>
>
>> On Thu, Dec 6, 2012 at 8:27 PM, Edward J. Yoon <[email protected]
>> >wrote:
>>
>> > I think large data processing capability is more important than fault
>> > tolerance at the moment.
>> >
>>
>> +1
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Reply via email to