A question that has appeared several times on this list is: "how should
a node monitor its peers to quickly detect failed nodes without wasting
too much bandwidth?" We are writing to announce a toolkit that can help
address this question. 

Failure detection is an important problem in P2P systems, where
detecting failures quickly enables the system to avoid timeouts, to
reduce stale entries, and to heal faster. But determining the right
detection strategy is nontrivial: on the one hand, a node can use up all
its bandwidth on pinging peers, which reduces failure detection latency
but leaves no useful bandwidth for real work. Or a node can ping its
peers very infrequently, which saves bandwidth but increases detection
latency. In between these two extremes, there is an optimal point where
each node is pinged at just the right frequency for its failure
probability, such that bandwidth spent on failure detection is below a
desired limit and failure detection latency is minimal.

In a recent paper, we determined the optimal strategy for such a failure
detector. We now have Python and Java implementations of this detector
that can be used in P2P applications. The apps define how the pings are
implemented (e.g. ICMP, RMI, or some other app-level protocol); the
toolkit enables pings to be piggy-backed onto naturally occurring
application traffic; and there is a simple demo that monitors web
servers.

The code is GPL'ed and available here:
    http://www.cs.cornell.edu/People/egs/sqrt-s/

If you have questions, please direct them to sqrts AT
systems.cs.cornell.edu.

Hope this is useful, and happy hacking,
Gün Sirer &
Kelvin So.


_______________________________________________
p2p-hackers mailing list
[email protected]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to