Re: [igraph] Possible error in documentation of KS.p for R/igraph fit_power_law (or in my understanding)

Tamas Nepusz Tue, 01 Oct 2019 04:22:25 -0700

>
> The ‘plfit’ implementation also uses the maximum likelihood principle to
> determine alpha for a given xmin; When xmin is not given in advance, the
> algorithm will attempt to find itsoptimal value *for which the p-value of
> a Kolmogorov-Smirnov test between the fitted distribution and the original
> sample is the largest.*
>
> This is not true; plfit does the opposite and looks for the _smallest_
value of the _KS test statistic_ instead (not the p-value); see this line
in the plfit code:


https://github.com/ntamas/plfit/blob/master/src/plfit.c#L1155

Basically, the test statistic of the Kolmogorov-Smirnov test is the largest
absolute difference between the observed and the fitted CDF along the Y
axis. A small test statistic means a good fit. However, it is probably true
that a smaller test statistic means a larger p-value, so in some sense the
documentation is correct, but it does not describe exactly what's going on
behind the scenes in the algorithm.

KS.p    Numeric scalar, the p-value of the Kolmogorov-Smirnov test. *Small
> p-values (less than 0.05) indicate that the test rejected the hypothesis
> that the original data could have been drawn from the fitted power-law
> distribution*.
>
> This suggests that large KS.p means greater likelihood that the
> distribution could have come from the power-law distribution.
>

Let me explain what the underlined part means; I think it is correct.

The KS test goes like this. You have a null hypothesis that the observed
sample was drawn from a certain power-law distribution (whose parameters we
have determined with the fitting process). You calculate the test statistic
D, which is constructed in a way that smaller D values mean that the
observed sample is "more similar" to the CDF of the fitted power law.

The p-value is then the probability that, given that the null hypothesis is
true, the test statistic is larger than or equal to the test statistic that
we have observed. So, if you test statistic is, say, 0.02 and the
corresponding p-value is, say 0.04, it means that _if_ I draw a sample from
the fitted distribution, the probability of seeing a test statistic that is
larger than or equal to 0.02 is less than 0.04 (in other words, unlikely).
It does _not_ say anything about whether the sample was really drawn from
the fitted distribution; in other words, it does _not_ say anything about
whether the fitted distribution is correct or not. It simply says: _if_ the
null hypothesis is true, it is unlikely that you would have achieved a
result that is at least as extreme as the one that you have seen.

The typical interpretation of the p-value is that small p-values mean that
your null hypothesis is most likely not true (it was "disproven" by the
test), while large p-values mean that your null hypothesis could either be
true or false, and the test could not disprove it. A statistical test can
never "confirm" your null hypothesis, but usually it is not the goal anyway
because the null hypothesis usually tends to be something "uninteresting".

The power-law fit is an odd beast, though: here, first you perform a
_fitting_ of the parameters of the power-law distribution to your observed
data, and _then_ perform a test where the null hypothesis is that the fit
is good. In this case, the null hypothesis is _exactly_ what you are
looking for, and a small p-value mean that the test "refuted" the null
hypothesis, hence your fit is not good. Large p-values are good. Small
p-values mean that no matter how hard you try to fit a power-law to your
observed sample, it is most likely not a power-law distributed sample.

In a complete graph, each of N vertices has degree N-1; definitely not a
> power-law. Yet: [...]
>
> $KS.p
> [1] 1
>
> If the explanation of KS.p is correct, this suggests a strong fit to power
> law,
>
No, it does not. Large p-values do not mean anything; it is as if the test
was throwing its hands in the air and say something like "I have no idea
whether your data is a power-law or not".


> However, looking at the other extreme, let's generate a distribution
> expected to follow the power law:
>
> sfp <- sample_fitness_pl(1000, 50000, 2.2)
>
1000 vertices is probably too small to observe a "real" power law; your
sample will suffer from finite size effects, and I think that's why the
test says that it is probably not a power-law. Another problem could be the
number of edges; it means that your mean degree will be around 50, which is
not very typical for "real" power-laws. Discrete power-law distributions
like the Yule-Simon or the zeta distribution have means closer to 1; for
instance, for the Yule-Simon distribution, it is (s-1) / s where s is its
shape parameter.

T.

_______________________________________________
igraph-help mailing list
igraph-help@nongnu.org
https://lists.nongnu.org/mailman/listinfo/igraph-help

Re: [igraph] Possible error in documentation of KS.p for R/igraph fit_power_law (or in my understanding)

Reply via email to