I'm attach the IRC chat log here. In this thread, we'll talk about HAMA-370 "Fault Tolerant" system design and future architecture. :)
---- [14:12] == server : stross.freenode.net [Corvallis, OR, USA] [14:12] == idle : 0 days 0 hours 0 minutes 6 seconds [connected: Mon Mar 28 12:42:54 2011] [14:12] == End of WHOIS [14:14] <edyoon_korea> I'm heading out to lunch. CU~ [14:25] <chl5011> Sorry, I can not see the differences. I think that's because I view the adapting to e.g. mapreduce2.0 is the same as standalone mode; both of which have fault tolerance, etc. features. Why would users want to run hama without those features? [14:29] <chl5011> Just curious. I am not keen on to porting anything to new arch (e.g. mesos) immediately before issues are getting clear. It is just that when thinking of fault tolerance issue, we may also need to consider the communication, nexus (master/ workers) etc. issue into account. [14:38] <edyoon_korea> Oh, OK. I think, it's a mis-communication. 1) Basically, hama cluster should be able to handle their jobs without other helps. 2) at the same time, we should consider compatibility with hadoop or mesos. Right? [14:46] <chl5011> Regarding to the first issue, it looks like mesos or mapreduce 2.0 is not suitable for hama because they separate scheduling from the original function of the master server (in our case it is bspmaster). [14:48] <chl5011> Then we might take the original approach which simply makes bspmaster fault tolerance ( zookeeper + multiple masters) and tasks fault tolerance with e.g. checkpoint + re-executing failure tasks. [14:54] <edyoon_korea> yes. [14:56] <edyoon_korea> so i'm still not sure, that HAMA-370 is really necessary for us. [14:59] <chl5011> I think HAMA 370 can be seen as part of 363 as the monitoring is a broaden issue which should cover the probing of the process failure. [15:00] <chl5011> the master can deterministicly identify if a process fail without needing to know the usage of network, etc. [15:01] <chl5011> because the worker does not send any report back (using udp). [15:02] <chl5011> But I think we should implement 363 because it covers more issues such as which groomserver the master should assign task to. [15:07] <edyoon_korea> if you are OK, let's move this discussion to our mailing list. [15:08] <chl5011> np. -- Best Regards, Edward J. Yoon http://blog.udanax.org http://twitter.com/eddieyoon
