[Hadoop Wiki] Update of "ZooKeeper/GSoCFailureDetector" by AbmarBarros

Apache Wiki Mon, 16 Aug 2010 11:45:52 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "ZooKeeper/GSoCFailureDetector" page has been changed by AbmarBarros.
http://wiki.apache.org/hadoop/ZooKeeper/GSoCFailureDetector?action=diff&rev1=11&rev2=12

--------------------------------------------------

   1. --(Refactor server-side code of the client-server monitoring to use the 
proposed FailureDetector interface)--
   1. --(Refactor the code of the server-server monitoring to use the proposed 
FailureDetector interface)--
   1. --(Make the failure detection and its parameters configurable on the 
server (to server-server and client-server monitoring))--
-  1. Evaluate the QoS metrics with experimentation
+  1. --(Evaluate the QoS metrics with experimentation)--
   1. --(Write Forrest docs)--
  
  == Related JIRA ==
@@ -106, +106 @@

   * Made Chen's alpha parameter configurable, and not a quarter of the timeout
  
  ==== 16/Aug/10 ====
+  * Finished experimentation and written experiment report
- 
- ----
  
  == Experimentation ==
  
@@ -115, +114 @@

  
   * '''First batch of tests''':
    * 1 client and 1 server connected by an transcontinental link (Campina 
Grande-Brazil / Newark-USA)
+   * client running during 10 min (average)
    * link = 1MBps, 250ms
    * timeout = 5000ms
    * replication = 5
@@ -126, +126 @@

  
   * '''Second batch of tests''':
    * 200 clients and 1 server connected in an emulated WAN in emulab
+   * clients running during 10 min (average)
    * link = 2MBps, 250ms, message loss probability of 0.1 
    * timeout = 5000ms
-   * used the following failure detectors with default parameters:
+   * used the following failure detectors with fixed parameters:
     * Fixed heartbeat
     * Chen (alpha = 1250)
     * Bertier (moderationstep = 1000)
+    * Phi-accrual (threshold = 2)
  
  ==== Results ====
+  * '''First batch of tests''':
+ 
+  * '''Second batch of tests''':
+    * In these tests, Fixed heartbeat and Bertier's strategies did not present 
any false suspicion. With the given alpha, Chen's presented 13/200 false 
suspicions, and the Phi-accrual, with the windowminsize parameter equals to 0, 
have made false suspicion on all the clients. Below, we show the average 
detection time of all methods but the Phi-accrual: 
+ 
+    * The Phi-accrual method must be evaluated again with a better 
windowminsize parameter and in a scenario with larger duration, so the warm-up 
period is not considered.   
  
  ==== Concluding remarks ====
  
  As expected, we noticed that the fixed heartbeat method works well when we 
run ZooKeeper in a controlled environment, where the network behavior is 
expected. In this cases we can tune the fixed timeout after some network 
analysis. However, in scenarios where we have a changing network behavior, such 
in a WAN, the adaptive methods can be a good pick. Below, there is an overview 
of each failure detector:
   * '''Fixed heartbeat''': In average, with default parameters, the fixed 
heartbeat strategy had the highest detection time, but with no false suspicion. 
However, if the timeout is not well defined, failures may take a long time to 
be detected, or false suspicion rate would be increased. As said before, this 
strategy is useful when there is a controlled environment, in which the network 
can be characterized.
   * '''Chen''': This strategy requires some assumption over the network, once 
the administrator needs to define the alpha parameter - the safety margin for 
the estimation. However, with default parameters, Chen et al. method performed 
well in a WAN deploy. It managed to decrease the average detection time with a 
low false suspicion rate.
-  * '''Bertier''': Bertier et al initially proposed a failure detector that 
requires no assumption over the network but a single moderation step to be 
added to the estimation when the monitored is at a suspected state when a 
heartbeat is received. With these experiments, we have come to same conclusion 
as Hayashibara et al: that this failure detector is very sensitive to message 
loss and fluctuation in the arrival times of heartbeats. In this sense, the 
moderation step turned out to be an important parameter for this failure 
detector. With a moderation step of 1000, Bertier's failure detector reached a 
lower average detection time than the Chen's method, higher than the fixed 
hearbeat strategy, however there were no false suspicions.
+  * '''Bertier''': Bertier et al initially proposed a failure detector that 
requires no assumption over the network but a single moderation step to be 
added to the estimation when the monitored is at a suspected state when a 
heartbeat is received. With these experiments, we have come to same conclusion 
as Hayashibara et al: that this failure detector is very sensitive to message 
loss and fluctuation in the arrival times of heartbeats. In this sense, the 
moderation step turned out to be an important parameter for this failure 
detector. With a moderation step of 1000, Bertier's failure detector reached a 
higher average detection time than the Chen's method, but lower than the fixed 
hearbeat strategy. It is worth to mention that Bertier’s failure detector was 
primarily designed to be used over local area networks (LANs), that is, 
environments wherein messages are seldom lost.
-  * '''Phi-accrual''':
+  * '''Phi-accrual''': The phi-accrual is the method that requires less 
information about the network behavior. However it relies on a large sampling 
window to perform a good estimation. As we could see, in the experiments that a 
minimum window size was not used, there was a huge number of false suspicions. 
The effect of the threshold is only noticeable when there is some deviation 
from the average. The phi-accrual stands out in a WAN with unknown behavior, 
but it is mandatory to set a good (high) initial timeout value for the warm-up 
period of the method, which happens while the minimum window size is not 
reached.
+   
  ----
  == Design decisions ==
  
@@ -154, +163 @@

   * Decided to use the FD on the same thread of the application
  
  ==== How to use application message in adaptive failure detectors? ====
+  * Decided to just delay the estimated arrival time if the next message, and 
to not use this message in the timeout adaptation.
  
  ==== Due to the usage of application messages as heartbeat, the actual 
heartbeats are not sent regularly. How to compute the next estimated arrival 
time? ====
+  * Decided to use interarrival heartbeat times. When a application message is 
received, the time of the last heartbeat received in shifted. 
  
  ==== How to report sampling window statistical data from Learners to Leader? 
====
+  * Decided to do heartbeat tracking on the Learners, and then mean and 
standard deviation of the interarrival heartbeat times is reported to Leader. A 
new method in the FailureDetector interface was created to comply with this 
requirement.
  
  == Future work ==
   * Update C client to use the Failure Detector module. This may require to 
have all failure detectors implemented in C. 
[[https://issues.apache.org/jira/browse/ZOOKEEPER-848]]
   * Analyze the overhead of the timeout computation on adaptive FDs.
-  * Contrast adaptive FDs behaviour with sampling window full and in a warm-up 
period (when sampling window is not full). 
+  * Contrast adaptive FDs behaviour with sampling window full and in a warm-up 
period (when sampling window is not full).
+  * Extend experimentation in order to cover other scenarios, such as 
different number of nodes, experiment duration, infrastructure (link 
characteristics) and failure detection parameters.

[Hadoop Wiki] Update of "ZooKeeper/GSoCFailureDetector" by AbmarBarros

Reply via email to