[
https://issues.apache.org/jira/browse/CASSANDRA-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036471#comment-13036471
]
paul cannon commented on CASSANDRA-2597:
----------------------------------------
Neither are wrong, as far as getting valid answers. Both are wrong in that they
do much more work than necessary.
h3. FailureDetector/ArrivalWindow
In creating their failure predictor between gossip nodes in a distributed
system, the original Cassandra authors made a modification to the sample φ
Accrual Failure Detector implementation from the [original
paper|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.7427&rep=rep1&type=pdf].
They mention in their own [Cassandra
paper|http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf]
that "Although the original paper suggests that the distribution is
approximated by the Gaussian distribution we found the Exponential Distribution
to be a better approximation, because of the nature of the gossip channel and
its impact on latency." Nothing more is said on the topic, but it was likely
because the original Phi Accrual paper implementation was expecting regular
heartbeat messages, while ArrivalWindow measures only the intervals between
reception of Gossip 'Syn', 'Ack', and 'Ack2' messages from a given endpoint.
Regular message transmissions experiencing typical random jitter [will follow a
normal
distribution|http://www.maxim-ic.com/app-notes/index.mvp/id/1916/CMP/WP-34],
but since gossip messages from endpoint A to endpoint B are sent at random
intervals, they likely make up a [Poisson
process|http://en.wikipedia.org/wiki/Poisson_process], making the exponential
distribution appropriate.
I'll show the math here, since someone at some point will probably wonder wtf
is going on in this code again, and hopefully I can save them the depth of
exploration I went to.
Take the definition of {{P_later}}, which determines how likely it is that
endpoint B has failed. The {{t}} parameter is the amount of time elapsed since
the last Syn/Ack/Ack2 gossip message seen from endpoint B:
{code}
P_later(t) = 1 - F(t)
{code}
where {{F(t)}} is the
[CDF|http://en.wikipedia.org/wiki/Cumulative_distribution_function] of the
event distribution. For the exponential distribution, the CDF is {{1 -
e^(-Lt)}}, where {{L}} is the rate parameter.
{code}
P_later(t) = 1 - (1 - e^(-Lt))
{code}
The maximum likelihood estimation for the rate parameter L is given by
{{1/mean}}, where mean is the arithmetic mean of observed times from the actual
data (here, the most recent gossip message inter-arrival times from endpoint
B). It is this rate parameter we expect to vary over time, making necessary the
storage of the sliding window of arrival intervals.
{code}
P_later(t) = 1 - (1 - e^(-t/mean))
{code}
The original Cassandra authors stopped here. The Apache Cassandra developers
made the obvious simplification:
{code}
P_later(t) = e^(-t/mean)
{code}
But I will go further and look at the way {{P_later}} is used in the phi
calculation:
{code}
phi(t) = -log10(P_later(t))
{code}
Expanding to:
{code}
phi(t) = -log10(e^(-t/mean))
{code}
But wait, the log of an exponentiation? Doesn't that mean...
{code}
phi(t) = -log(e^(-t/mean)) / log(10)
= (t/mean) / log(10)
{code}
so approximately
{code}
phi(t) = 0.4342945 * t/mean
{code}
Yep, that's a hell of a lot simpler for computers to calculate than
{code}
(-1) * Math.log10(Math.pow(Math.e, ((-1) * (t)/mean)))
{code}
, the way we have been doing it.
h3. DynamicEndpointSnitch/AdaptiveLatencyTracker
This version originated as an optimization of BoundedStatsDeque plus
ArrivalWindow. The aim was to sort known endpoints by their {{phi}} values,
assuming the same constant {{t}} value for each.
{code}
score(E) = phi_E(0.0001)
{code}
(E is the endpoint, and {{phi_E}} is the {{phi}} function which uses the mean
of recent message inter-arrival times from E, which we'll call {{E_mean}}).
Expanded:
{code}
score(E) = -log10(P_later_E(0.0001))
P_later_E(t) = e^(-t/E_mean)
{code}
However, when the developer found that the {{score()}} values were going _down_
for nodes with higher average latency instead of up, he most likely looked at
the original version of {{P_later()}} and added back one of the "one minuses",
and not the other, because that made the signs work as expected. It currently
uses:
{code}
P_later_E(t) = 1 - e^(-t/E_mean)
{code}
However, this was probably misguided: the scores were going down because the
phi accrual failure detector assigns higher badness values to nodes with a low
recent latency than to nodes with high recent latency, given the same {{t}}
values for both (because it waits longer to convict a node with high recent
latency). There's no good mathematical reason for the
{{AdaptiveLatencyTracker.p()}} method to look exactly like it does. However,
it's mathematically correct in that it will indeed sort endpoints by recent
latency.
{code}
score(E_mean) = -log10(1 - e^(-0.0001/E_mean))
{code}
The {{score()}} method is only used as a measure to sort endpoints. However, as
defined, it is monotonically increasing \(*), meaning that the exact same
sorting could be made using the identity function:
{code}
score(E_mean) = E_mean
{code}
I will attach a patch which makes these simplifications and adjusts the related
unit tests accordingly.
\(*) The math to show this seems unnecessary here. It's not too hard to work
through though.
Jira wiki syntax sucks.
> inconsistent implementation of 'cumulative distribution function' for
> Exponential Distribution
> ----------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2597
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2597
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Jonathan Ellis
> Assignee: paul cannon
> Priority: Minor
> Fix For: 0.7.7
>
>
> As reported on the mailing list
> (http://mail-archives.apache.org/mod_mbox/cassandra-dev/201104.mbox/%[email protected]%3E),
> {quote}
> I just found there are two implementations of 'cumulative distribution
> function' for Exponential Distribution and there are inconsistent :
> *FailureDetector*
> {code:java}
> org.apache.cassandra.gms.ArrivalWindow.p(double)
> double p(double t)
> {
> double mean = mean();
> double exponent = (-1)*(t)/mean;
> return *Math.pow(Math.E, exponent)*;
> }
> {code}
> *DynamicEndpointSnitch*
> {code:java}
> org.apache.cassandra.locator.AdaptiveLatencyTracker.p(double)
> double p(double t)
> {
> double mean = mean();
> double exponent = (-1) * (t) / mean;
> return *1 - Math.pow( Math.E, exponent);*
> }
> {code}
> According to the Exponential Distribution cumulative distribution function
> definition<http://en.wikipedia.org/wiki/Exponential_distribution#Cumulative_distribution_function>,
> the later one is correct
> {quote}
> ... however FailureDetector has been working as advertised for some time now.
> Does this mean the Snitch version is actually wrong?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira