Re: [DISCUSS] "Fault Tolerant" system design and consideration of compatibility for the hadoop nextgen or mesos platform.

Chia-Hung Lin Mon, 28 Mar 2011 21:33:51 -0700

Failure detection is used to solve consensus issue because of FLP
impossibility. But it has its usefulness in asynchronous distributed
system. For instance, [1] presents an accrual failure detector and
advocates such service should be valuable for system management,
replication, etc. [2] supposes that failure detection should be the
basic service for distributed system in supporting the scenario when
failure presents.


Regarding to the worker failure, I think hadoop metrics can be applied
for cropping the internal statistic of the groomserver/ jvm. But for
quickly identify when a failure occurs (regardless of network or host
failure) without knowing the internal stage of a progress, phi accrual
failure detection seems to better fit the case. Also, when considering
the scenario in the presence of master failure, failure detection
would be required.

The overall design I think basically can mimic what has been done in
hadoop and take further actions preventing issues that previously
happened in hadoop if any.

[1.] A gossip-style failure detection service.
http://portal.acm.org/citation.cfm?id=866975
[2]. A Fault Detection Service for Wide Area Distributed Computations.
http://portal.acm.org/citation.cfm?id=823194



2011/3/29 Edward J. Yoon <[email protected]>:
> I'm reading ϕ failure detector. It seems widely used for distributed
> database. I guess, the reason is that the real-time operations of
> database. If I am wrong, Please point me.
>
> In our case, it's a batch job processing. I'm not sure that we really
> need to adopt ϕ failure detector. During job processing, "too
> sensitivity detection" could not be a help, rather a hindrance.
>
> What do you think?
>
> On Mon, Mar 28, 2011 at 3:14 PM, Edward J. Yoon <[email protected]> wrote:
>> I'm attach the IRC chat log here.
>>
>> In this thread, we'll talk about HAMA-370 "Fault Tolerant" system
>> design and future architecture. :)
>>
>> ----
>> [14:12] ==  server   : stross.freenode.net [Corvallis, OR, USA]
>> [14:12] ==  idle     : 0 days 0 hours 0 minutes 6 seconds [connected:
>> Mon Mar 28 12:42:54 2011]
>> [14:12] == End of WHOIS
>> [14:14] <edyoon_korea> I'm heading out to lunch. CU~
>> [14:25] <chl5011> Sorry, I can not see the differences. I think that's
>> because I view the adapting to e.g. mapreduce2.0 is the same as
>> standalone mode; both of which have fault tolerance, etc. features.
>> Why would users want to run hama without those features?
>> [14:29] <chl5011> Just curious. I am not keen on to porting anything
>> to new arch (e.g. mesos) immediately before issues are getting clear.
>> It is just that when thinking of fault tolerance issue, we may also
>> need to consider the communication, nexus (master/ workers) etc. issue
>> into account.
>> [14:38] <edyoon_korea> Oh, OK. I think, it's a mis-communication. 1)
>> Basically, hama cluster should be able to handle their jobs without
>> other helps. 2) at the same time, we should consider compatibility
>> with hadoop or mesos. Right?
>> [14:46] <chl5011> Regarding to the first issue, it looks like mesos or
>> mapreduce 2.0 is not suitable for hama because they separate
>> scheduling from the original function of the master server (in our
>> case it is bspmaster).
>> [14:48] <chl5011> Then we might take the original approach which
>> simply makes bspmaster fault tolerance ( zookeeper + multiple masters)
>> and tasks fault tolerance with e.g. checkpoint + re-executing failure
>> tasks.
>> [14:54] <edyoon_korea> yes.
>> [14:56] <edyoon_korea> so i'm still not sure, that HAMA-370 is really
>> necessary for us.
>> [14:59] <chl5011> I think HAMA 370 can be seen as part of 363 as the
>> monitoring is a broaden issue which should cover the probing of the
>> process failure.
>> [15:00] <chl5011> the master can deterministicly identify if a process
>> fail without needing to know the usage of network, etc.
>> [15:01] <chl5011> because the worker does not send any report back (using 
>> udp).
>> [15:02] <chl5011> But I think we should implement 363 because it
>> covers more issues such as which groomserver the master should assign
>> task to.
>> [15:07] <edyoon_korea> if you are OK, let's move this discussion to
>> our mailing list.
>> [15:08] <chl5011> np.
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> http://blog.udanax.org
>> http://twitter.com/eddieyoon
>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> http://blog.udanax.org
> http://twitter.com/eddieyoon
>



-- 
ChiaHung Lin @ nuk, tw

Re: [DISCUSS] "Fault Tolerant" system design and consideration of compatibility for the hadoop nextgen or mesos platform.

Reply via email to