Re: [DISCUSS] "Fault Tolerant" system design and consideration of compatibility for the hadoop nextgen or mesos platform.

Edward J. Yoon Mon, 28 Mar 2011 19:35:05 -0700

I'm reading ϕ failure detector. It seems widely used for distributed
database. I guess, the reason is that the real-time operations of
database. If I am wrong, Please point me.


In our case, it's a batch job processing. I'm not sure that we really
need to adopt ϕ failure detector. During job processing, "too
sensitivity detection" could not be a help, rather a hindrance.

What do you think?

On Mon, Mar 28, 2011 at 3:14 PM, Edward J. Yoon <[email protected]> wrote:
> I'm attach the IRC chat log here.
>
> In this thread, we'll talk about HAMA-370 "Fault Tolerant" system
> design and future architecture. :)
>
> ----
> [14:12] ==  server   : stross.freenode.net [Corvallis, OR, USA]
> [14:12] ==  idle     : 0 days 0 hours 0 minutes 6 seconds [connected:
> Mon Mar 28 12:42:54 2011]
> [14:12] == End of WHOIS
> [14:14] <edyoon_korea> I'm heading out to lunch. CU~
> [14:25] <chl5011> Sorry, I can not see the differences. I think that's
> because I view the adapting to e.g. mapreduce2.0 is the same as
> standalone mode; both of which have fault tolerance, etc. features.
> Why would users want to run hama without those features?
> [14:29] <chl5011> Just curious. I am not keen on to porting anything
> to new arch (e.g. mesos) immediately before issues are getting clear.
> It is just that when thinking of fault tolerance issue, we may also
> need to consider the communication, nexus (master/ workers) etc. issue
> into account.
> [14:38] <edyoon_korea> Oh, OK. I think, it's a mis-communication. 1)
> Basically, hama cluster should be able to handle their jobs without
> other helps. 2) at the same time, we should consider compatibility
> with hadoop or mesos. Right?
> [14:46] <chl5011> Regarding to the first issue, it looks like mesos or
> mapreduce 2.0 is not suitable for hama because they separate
> scheduling from the original function of the master server (in our
> case it is bspmaster).
> [14:48] <chl5011> Then we might take the original approach which
> simply makes bspmaster fault tolerance ( zookeeper + multiple masters)
> and tasks fault tolerance with e.g. checkpoint + re-executing failure
> tasks.
> [14:54] <edyoon_korea> yes.
> [14:56] <edyoon_korea> so i'm still not sure, that HAMA-370 is really
> necessary for us.
> [14:59] <chl5011> I think HAMA 370 can be seen as part of 363 as the
> monitoring is a broaden issue which should cover the probing of the
> process failure.
> [15:00] <chl5011> the master can deterministicly identify if a process
> fail without needing to know the usage of network, etc.
> [15:01] <chl5011> because the worker does not send any report back (using 
> udp).
> [15:02] <chl5011> But I think we should implement 363 because it
> covers more issues such as which groomserver the master should assign
> task to.
> [15:07] <edyoon_korea> if you are OK, let's move this discussion to
> our mailing list.
> [15:08] <chl5011> np.
>
>
> --
> Best Regards, Edward J. Yoon
> http://blog.udanax.org
> http://twitter.com/eddieyoon
>



-- 
Best Regards, Edward J. Yoon
http://blog.udanax.org
http://twitter.com/eddieyoon

Re: [DISCUSS] "Fault Tolerant" system design and consideration of compatibility for the hadoop nextgen or mesos platform.

Reply via email to