[DISCUSS] "Fault Tolerant" system design and consideration of compatibility for the hadoop nextgen or mesos platform.

Edward J. Yoon Sun, 27 Mar 2011 23:15:23 -0700

I'm attach the IRC chat log here.

In this thread, we'll talk about HAMA-370 "Fault Tolerant" system
design and future architecture. :)


----
[14:12] ==  server   : stross.freenode.net [Corvallis, OR, USA]
[14:12] ==  idle     : 0 days 0 hours 0 minutes 6 seconds [connected:
Mon Mar 28 12:42:54 2011]
[14:12] == End of WHOIS
[14:14] <edyoon_korea> I'm heading out to lunch. CU~
[14:25] <chl5011> Sorry, I can not see the differences. I think that's
because I view the adapting to e.g. mapreduce2.0 is the same as
standalone mode; both of which have fault tolerance, etc. features.
Why would users want to run hama without those features?
[14:29] <chl5011> Just curious. I am not keen on to porting anything
to new arch (e.g. mesos) immediately before issues are getting clear.
It is just that when thinking of fault tolerance issue, we may also
need to consider the communication, nexus (master/ workers) etc. issue
into account.
[14:38] <edyoon_korea> Oh, OK. I think, it's a mis-communication. 1)
Basically, hama cluster should be able to handle their jobs without
other helps. 2) at the same time, we should consider compatibility
with hadoop or mesos. Right?
[14:46] <chl5011> Regarding to the first issue, it looks like mesos or
mapreduce 2.0 is not suitable for hama because they separate
scheduling from the original function of the master server (in our
case it is bspmaster).
[14:48] <chl5011> Then we might take the original approach which
simply makes bspmaster fault tolerance ( zookeeper + multiple masters)
and tasks fault tolerance with e.g. checkpoint + re-executing failure
tasks.
[14:54] <edyoon_korea> yes.
[14:56] <edyoon_korea> so i'm still not sure, that HAMA-370 is really
necessary for us.
[14:59] <chl5011> I think HAMA 370 can be seen as part of 363 as the
monitoring is a broaden issue which should cover the probing of the
process failure.
[15:00] <chl5011> the master can deterministicly identify if a process
fail without needing to know the usage of network, etc.
[15:01] <chl5011> because the worker does not send any report back (using udp).
[15:02] <chl5011> But I think we should implement 363 because it
covers more issues such as which groomserver the master should assign
task to.
[15:07] <edyoon_korea> if you are OK, let's move this discussion to
our mailing list.
[15:08] <chl5011> np.


-- 
Best Regards, Edward J. Yoon
http://blog.udanax.org
http://twitter.com/eddieyoon

[DISCUSS] "Fault Tolerant" system design and consideration of compatibility for the hadoop nextgen or mesos platform.

Reply via email to