P.S., Additionally, we should focus more on memory efficiency and fast
parallel algorithms, not disk-based.

What I was meaning is, parallel algorithms (examples).

On Sat, Dec 8, 2012 at 9:26 PM, Edward J. Yoon <[email protected]> wrote:
> 'GraphJobRunner' BSP program already showed why disk-queue is
> important. The user always could be faced with memory issues.
>
> But, I'm talking about our task's priority. High performance computers
> and its parts are cheap and getting cheaper. And, I'm sure the
> message-passing and In-memory technologies are receiving attention as
> a near-future trend.
>
> In my case, the memory is 40GB per node. I want to confirm whether
> Hama is good candidate (ASAP). Hama can't processing large data but
> Hama team is currently working on YARN, FT, and Disk-queue.
>
> On Sat, Dec 8, 2012 at 6:28 PM, Thomas Jungblut
> <[email protected]> wrote:
>> Yes that's nothing new, my rule of thumb is 10x the input size.
>> Which is bad, but the scalability must be done on multiple levels.
>> Spilling the graph to disk is just one part, because it consumes at least
>> the half of the memory for really sparse graphs.
>> The other is messaging, removing the bundling and the compression will not
>> save you much space.
>> We are writing messages to disk in fault tolerance anyways, so why not
>> directly writing it and then bundle/compress stuff on the fly while sending
>> (e.g. in 32m chunks)?
>>
>> 2012/12/8 Edward J. Yoon <[email protected]>
>>
>>> Task is created per input split, and input splits are created one per
>>> block of each input file by default. If block size is 60~200 MB, 1 ~
>>> 3GB memory per task is enough.
>>>
>>> Yeah, there's still a queueing/messaging scalability issue as you
>>> know. However, according to my experiences, message bundler and
>>> compressor are mainly responsible for poor scalability and consumes
>>> huge memory. This is more urgent than "queue".
>>>
>>> On Sat, Dec 8, 2012 at 2:05 AM, Thomas Jungblut
>>> <[email protected]> wrote:
>>> >>
>>> >>  not disk-based.
>>> >
>>> >
>>> > So how do you want to archieve scalability without that?
>>> > In order to process tasks independend of each other (not in parallel, but
>>> > e.g. in small mini batches), you have to save the state. RAM is limited
>>> and
>>> > can't store huge states (persistent in case of crashes).
>>> >
>>> > 2012/12/7 Suraj Menon <[email protected]>
>>> >
>>> >> On Thu, Dec 6, 2012 at 8:27 PM, Edward J. Yoon <[email protected]
>>> >> >wrote:
>>> >>
>>> >> > I think large data processing capability is more important than fault
>>> >> > tolerance at the moment.
>>> >> >
>>> >>
>>> >> +1
>>> >>
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Reply via email to