[
https://issues.apache.org/jira/browse/MATH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669588#comment-13669588
]
Phil Steitz edited comment on MATH-984 at 5/29/13 7:12 PM:
-----------------------------------------------------------
Thanks, I get the second problem now. To really address that issue, I think we
would need to depart from the current simple equal-sized bins model. I have
thought before about either introducing a new class or a config option for
EmpiricalDistribution that supported alternative binning structures, such as:
1. equiprobable bins (so bin size is not constant)
2. variable bin sizes (break the total range into subranges, allow a number of
fixed-size bins to be specified for each range, so grid could become very fine
in densely packed subranges).
Regarding the default kernel choice, in the absence of any information about
the within-bin distributions, I would expect the (correctly truncated) Gaussian
smoother to perform better than uniform or triangular. I don't have a proof of
this statement either; but I will see if I can hunt down some references,
starting with [1], referenced in the class javadoc as what the current
implementation is based on.
As [1] states, heavy tails create problems for this approach and in some cases
an alternative to the Gaussian kernel may work better. This could actually be
tested on a case basis by comparing probabilities computed using the
Distribution methods now implemented in the class with "true" empirical
probabilities from the raw data. Some experiments doing this with different
kernels and data would be interesting to look at.
[1] http://ned.ipac.caltech.edu/level5/March02/Silverman/Silver2_4.html
was (Author: psteitz):
Thanks, I get the second problem now. To really address that issue, I
think we would need to depart from the current simple equal-sized bins model.
I have thought before about either introducing a new class or a config option
for EmpiricalDistribution that supported alternative binning structures, such
as:
1. equiprobable bins (so bin size is not constant)
2. variable bin sizes (break the total range into subranges, allow a number of
fixed-size bins to be specified for each range, so grid could become very fine
in densely packed subranges).
Regarding the default kernel choice, in the absence of any information about
the within-bin distributions, I would expect the (correctly truncated) Gaussian
smoother to perform better than uniform or triangular. I don't have a proof of
this statement either; but I will see if I can hunt down some references,
starting with [1], referenced in the class javadoc as what the current
implementation is based on.
As [1] states, heavy tails create problems for this approach and in some cases
an alternative to the Gaussian kernel may work better. This could actually be
tested on a case basis by comparing probabilities computed using the
Distribution methods now implemented in the class with "true" empirical
probabilities from the raw data. Some experiments doing this with different
kernels and data would be interesting to look at.
[1] http://ned.ipac.caltech.edu/level5/March02/Silverman/Silver2_4.html
[1]
> Incorrect (bugged) generating function getNextValue() in
> .random.EmpiricalDistribution
> --------------------------------------------------------------------------------------
>
> Key: MATH-984
> URL: https://issues.apache.org/jira/browse/MATH-984
> Project: Commons Math
> Issue Type: Bug
> Affects Versions: 3.2, 3.1.1
> Environment: all
> Reporter: Radoslav Tsvetkov
>
> The generating function getNextValue() in
> org.apache.commons.math3.random.EmpiricalDistribution
> will generate wrong values for all Distributions that are single tailed or
> limited. For example Data which are resembling Exponential or Lognormal
> distributions.
> The problem could be easily seen in code and tested.
> In last version code
> ...
> 490 return getKernel(stats).sample();
> ...
> it samples from Gaussian distribution to "smooth" in_the_bin. Obviously
> Gaussian Distribution is not limited and sometimes it does generates numbers
> outside the bin. In the case when it is the last bin it will generate wrong
> numbers.
> For example for empirical non-negative data it will generate negative rubbish.
> Additionally the proposed algorithm boldly returns only the mean value of
> the bin in case of one value! This last makes the generating function
> unusable for heavy tailed distributions with small number of values. (for
> example computer network traffic)
> On the last place usage of Gaussian soothing in the bin will change greatly
> some empirical distribution properties.
> The proposed method should be reworked to be applicable for real data which
> have often limited ranges. (either non-negative or both sides limited)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira