P.S., Additionally, we should focus more on memory efficiency and fast parallel algorithms, not disk-based.
What I was meaning is, parallel algorithms (examples). On Sat, Dec 8, 2012 at 9:26 PM, Edward J. Yoon <[email protected]> wrote: > 'GraphJobRunner' BSP program already showed why disk-queue is > important. The user always could be faced with memory issues. > > But, I'm talking about our task's priority. High performance computers > and its parts are cheap and getting cheaper. And, I'm sure the > message-passing and In-memory technologies are receiving attention as > a near-future trend. > > In my case, the memory is 40GB per node. I want to confirm whether > Hama is good candidate (ASAP). Hama can't processing large data but > Hama team is currently working on YARN, FT, and Disk-queue. > > On Sat, Dec 8, 2012 at 6:28 PM, Thomas Jungblut > <[email protected]> wrote: >> Yes that's nothing new, my rule of thumb is 10x the input size. >> Which is bad, but the scalability must be done on multiple levels. >> Spilling the graph to disk is just one part, because it consumes at least >> the half of the memory for really sparse graphs. >> The other is messaging, removing the bundling and the compression will not >> save you much space. >> We are writing messages to disk in fault tolerance anyways, so why not >> directly writing it and then bundle/compress stuff on the fly while sending >> (e.g. in 32m chunks)? >> >> 2012/12/8 Edward J. Yoon <[email protected]> >> >>> Task is created per input split, and input splits are created one per >>> block of each input file by default. If block size is 60~200 MB, 1 ~ >>> 3GB memory per task is enough. >>> >>> Yeah, there's still a queueing/messaging scalability issue as you >>> know. However, according to my experiences, message bundler and >>> compressor are mainly responsible for poor scalability and consumes >>> huge memory. This is more urgent than "queue". >>> >>> On Sat, Dec 8, 2012 at 2:05 AM, Thomas Jungblut >>> <[email protected]> wrote: >>> >> >>> >> not disk-based. >>> > >>> > >>> > So how do you want to archieve scalability without that? >>> > In order to process tasks independend of each other (not in parallel, but >>> > e.g. in small mini batches), you have to save the state. RAM is limited >>> and >>> > can't store huge states (persistent in case of crashes). >>> > >>> > 2012/12/7 Suraj Menon <[email protected]> >>> > >>> >> On Thu, Dec 6, 2012 at 8:27 PM, Edward J. Yoon <[email protected] >>> >> >wrote: >>> >> >>> >> > I think large data processing capability is more important than fault >>> >> > tolerance at the moment. >>> >> > >>> >> >>> >> +1 >>> >> >>> >>> >>> >>> -- >>> Best Regards, Edward J. Yoon >>> @eddieyoon >>> > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon -- Best Regards, Edward J. Yoon @eddieyoon
