I'm reading ϕ failure detector. It seems widely used for distributed database. I guess, the reason is that the real-time operations of database. If I am wrong, Please point me.
In our case, it's a batch job processing. I'm not sure that we really need to adopt ϕ failure detector. During job processing, "too sensitivity detection" could not be a help, rather a hindrance. What do you think? On Mon, Mar 28, 2011 at 3:14 PM, Edward J. Yoon <[email protected]> wrote: > I'm attach the IRC chat log here. > > In this thread, we'll talk about HAMA-370 "Fault Tolerant" system > design and future architecture. :) > > ---- > [14:12] == server : stross.freenode.net [Corvallis, OR, USA] > [14:12] == idle : 0 days 0 hours 0 minutes 6 seconds [connected: > Mon Mar 28 12:42:54 2011] > [14:12] == End of WHOIS > [14:14] <edyoon_korea> I'm heading out to lunch. CU~ > [14:25] <chl5011> Sorry, I can not see the differences. I think that's > because I view the adapting to e.g. mapreduce2.0 is the same as > standalone mode; both of which have fault tolerance, etc. features. > Why would users want to run hama without those features? > [14:29] <chl5011> Just curious. I am not keen on to porting anything > to new arch (e.g. mesos) immediately before issues are getting clear. > It is just that when thinking of fault tolerance issue, we may also > need to consider the communication, nexus (master/ workers) etc. issue > into account. > [14:38] <edyoon_korea> Oh, OK. I think, it's a mis-communication. 1) > Basically, hama cluster should be able to handle their jobs without > other helps. 2) at the same time, we should consider compatibility > with hadoop or mesos. Right? > [14:46] <chl5011> Regarding to the first issue, it looks like mesos or > mapreduce 2.0 is not suitable for hama because they separate > scheduling from the original function of the master server (in our > case it is bspmaster). > [14:48] <chl5011> Then we might take the original approach which > simply makes bspmaster fault tolerance ( zookeeper + multiple masters) > and tasks fault tolerance with e.g. checkpoint + re-executing failure > tasks. > [14:54] <edyoon_korea> yes. > [14:56] <edyoon_korea> so i'm still not sure, that HAMA-370 is really > necessary for us. > [14:59] <chl5011> I think HAMA 370 can be seen as part of 363 as the > monitoring is a broaden issue which should cover the probing of the > process failure. > [15:00] <chl5011> the master can deterministicly identify if a process > fail without needing to know the usage of network, etc. > [15:01] <chl5011> because the worker does not send any report back (using > udp). > [15:02] <chl5011> But I think we should implement 363 because it > covers more issues such as which groomserver the master should assign > task to. > [15:07] <edyoon_korea> if you are OK, let's move this discussion to > our mailing list. > [15:08] <chl5011> np. > > > -- > Best Regards, Edward J. Yoon > http://blog.udanax.org > http://twitter.com/eddieyoon > -- Best Regards, Edward J. Yoon http://blog.udanax.org http://twitter.com/eddieyoon
