[ 
https://issues.apache.org/jira/browse/CASSANDRA-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036471#comment-13036471
 ] 

paul cannon commented on CASSANDRA-2597:
----------------------------------------

Neither are wrong, as far as getting valid answers. Both are wrong in that they 
do much more work than necessary.

h3. FailureDetector/ArrivalWindow

In creating their failure predictor between gossip nodes in a distributed 
system, the original Cassandra authors made a modification to the sample φ 
Accrual Failure Detector implementation from the [original 
paper|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.7427&rep=rep1&type=pdf].
 They mention in their own [Cassandra 
paper|http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf]
 that "Although the original paper suggests that the distribution is 
approximated by the Gaussian distribution we found the Exponential Distribution 
to be a better approximation, because of the nature of the gossip channel and 
its impact on latency." Nothing more is said on the topic, but it was likely 
because the original Phi Accrual paper implementation was expecting regular 
heartbeat messages, while ArrivalWindow measures only the intervals between 
reception of Gossip 'Syn', 'Ack', and 'Ack2' messages from a given endpoint. 
Regular message transmissions experiencing typical random jitter [will follow a 
normal 
distribution|http://www.maxim-ic.com/app-notes/index.mvp/id/1916/CMP/WP-34], 
but since gossip messages from endpoint A to endpoint B are sent at random 
intervals, they likely make up a [Poisson 
process|http://en.wikipedia.org/wiki/Poisson_process], making the exponential 
distribution appropriate.

I'll show the math here, since someone at some point will probably wonder wtf 
is going on in this code again, and hopefully I can save them the depth of 
exploration I went to.

Take the definition of {{P_later}}, which determines how likely it is that 
endpoint B has failed. The {{t}} parameter is the amount of time elapsed since 
the last Syn/Ack/Ack2 gossip message seen from endpoint B:

{code}
    P_later(t) = 1 - F(t)
{code}

where {{F(t)}} is the 
[CDF|http://en.wikipedia.org/wiki/Cumulative_distribution_function] of the 
event distribution. For the exponential distribution, the CDF is {{1 - 
e^(-Lt)}}, where {{L}} is the rate parameter.

{code}
    P_later(t) = 1 - (1 - e^(-Lt))
{code}

The maximum likelihood estimation for the rate parameter L is given by 
{{1/mean}}, where mean is the arithmetic mean of observed times from the actual 
data (here, the most recent gossip message inter-arrival times from endpoint 
B). It is this rate parameter we expect to vary over time, making necessary the 
storage of the sliding window of arrival intervals.

{code}
    P_later(t) = 1 - (1 - e^(-t/mean))
{code}

The original Cassandra authors stopped here. The Apache Cassandra developers 
made the obvious simplification:

{code}
    P_later(t) = e^(-t/mean)
{code}

But I will go further and look at the way {{P_later}} is used in the phi 
calculation:

{code}
    phi(t) = -log10(P_later(t))
{code}

Expanding to:

{code}
    phi(t) = -log10(e^(-t/mean))
{code}

But wait, the log of an exponentiation? Doesn't that mean...

{code}
    phi(t) = -log(e^(-t/mean)) / log(10)
           = (t/mean) / log(10)
{code}

so approximately

{code}
    phi(t) = 0.4342945 * t/mean
{code}

Yep, that's a hell of a lot simpler for computers to calculate than

{code}
    (-1) * Math.log10(Math.pow(Math.e, ((-1) * (t)/mean)))
{code}

, the way we have been doing it.

h3. DynamicEndpointSnitch/AdaptiveLatencyTracker

This version originated as an optimization of BoundedStatsDeque plus 
ArrivalWindow. The aim was to sort known endpoints by their {{phi}} values, 
assuming the same constant {{t}} value for each.

{code}
    score(E) = phi_E(0.0001)
{code}

(E is the endpoint, and {{phi_E}} is the {{phi}} function which uses the mean 
of recent message inter-arrival times from E, which we'll call {{E_mean}}). 
Expanded:

{code}
score(E) = -log10(P_later_E(0.0001))

P_later_E(t) = e^(-t/E_mean)
{code}

However, when the developer found that the {{score()}} values were going _down_ 
for nodes with higher average latency instead of up, he most likely looked at 
the original version of {{P_later()}} and added back one of the "one minuses", 
and not the other, because that made the signs work as expected. It currently 
uses:

{code}
    P_later_E(t) = 1 - e^(-t/E_mean)
{code}

However, this was probably misguided: the scores were going down because the 
phi accrual failure detector assigns higher badness values to nodes with a low 
recent latency than to nodes with high recent latency, given the same {{t}} 
values for both (because it waits longer to convict a node with high recent 
latency). There's no good mathematical reason for the 
{{AdaptiveLatencyTracker.p()}} method to look exactly like it does. However, 
it's mathematically correct in that it will indeed sort endpoints by recent 
latency.

{code}
    score(E_mean) = -log10(1 - e^(-0.0001/E_mean))
{code}

The {{score()}} method is only used as a measure to sort endpoints. However, as 
defined, it is monotonically increasing \(*), meaning that the exact same 
sorting could be made using the identity function:

{code}
    score(E_mean) = E_mean
{code}

I will attach a patch which makes these simplifications and adjusts the related 
unit tests accordingly.

\(*) The math to show this seems unnecessary here. It's not too hard to work 
through though.

Jira wiki syntax sucks.

> inconsistent implementation of 'cumulative distribution function' for 
> Exponential Distribution
> ----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2597
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2597
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: paul cannon
>            Priority: Minor
>             Fix For: 0.7.7
>
>
> As reported on the mailing list 
> (http://mail-archives.apache.org/mod_mbox/cassandra-dev/201104.mbox/%[email protected]%3E),
> {quote}
> I just found there are two implementations of 'cumulative distribution
> function' for Exponential Distribution and there are inconsistent :
> *FailureDetector*
> {code:java}
> org.apache.cassandra.gms.ArrivalWindow.p(double)
>    double p(double t)
>    {
>        double mean = mean();
>        double exponent = (-1)*(t)/mean;
>        return *Math.pow(Math.E, exponent)*;
>    }
> {code}
> *DynamicEndpointSnitch*
> {code:java}
> org.apache.cassandra.locator.AdaptiveLatencyTracker.p(double)
>    double p(double t)
>    {
>        double mean = mean();
>        double exponent = (-1) * (t) / mean;
>        return *1 - Math.pow( Math.E, exponent);*
>    }
> {code}
> According to the  Exponential Distribution cumulative distribution function
> definition<http://en.wikipedia.org/wiki/Exponential_distribution#Cumulative_distribution_function>,
> the later one is correct
> {quote}
> ... however FailureDetector has been working as advertised for some time now. 
>  Does this mean the Snitch version is actually wrong?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to