[jira] [Comment Edited] (MATH-984) Incorrect (bugged) generating function getNextValue() in .random.EmpiricalDistribution

Phil Steitz (JIRA) Wed, 29 May 2013 12:12:22 -0700

    [ 
https://issues.apache.org/jira/browse/MATH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669588#comment-13669588
 ]


Phil Steitz edited comment on MATH-984 at 5/29/13 7:12 PM:
-----------------------------------------------------------

Thanks, I get the second problem now.  To really address that issue, I think we 
would need to depart from the current simple equal-sized bins model.  I have 
thought before about either introducing a new class or a config option for 
EmpiricalDistribution that supported alternative binning structures, such as:

1. equiprobable bins (so bin size is not constant)
2. variable bin sizes (break the total range into subranges, allow a number of 
fixed-size bins to be specified for each range, so grid could become very fine 
in densely packed subranges).

Regarding the default kernel choice, in the absence of any information about 
the within-bin distributions, I would expect the (correctly truncated) Gaussian 
smoother to perform better than uniform or triangular. I don't have a proof of 
this statement either; but I will see if I can hunt down some references, 
starting with [1], referenced in the class javadoc as what the current 
implementation is based on.  

As [1] states, heavy tails create problems for this approach and in some cases 
an alternative to the Gaussian kernel may work better.  This could actually be 
tested on a case basis by comparing probabilities computed using the 
Distribution methods now implemented in the class with "true" empirical 
probabilities from the raw data.  Some experiments doing this with different 
kernels and data would be interesting to look at.

[1] http://ned.ipac.caltech.edu/level5/March02/Silverman/Silver2_4.html

                
      was (Author: psteitz):
    Thanks, I get the second problem now.  To really address that issue, I 
think we would need to depart from the current simple equal-sized bins model.  
I have thought before about either introducing a new class or a config option 
for EmpiricalDistribution that supported alternative binning structures, such 
as:

1. equiprobable bins (so bin size is not constant)
2. variable bin sizes (break the total range into subranges, allow a number of 
fixed-size bins to be specified for each range, so grid could become very fine 
in densely packed subranges).

Regarding the default kernel choice, in the absence of any information about 
the within-bin distributions, I would expect the (correctly truncated) Gaussian 
smoother to perform better than uniform or triangular. I don't have a proof of 
this statement either; but I will see if I can hunt down some references, 
starting with [1], referenced in the class javadoc as what the current 
implementation is based on.  

As [1] states, heavy tails create problems for this approach and in some cases 
an alternative to the Gaussian kernel may work better.  This could actually be 
tested on a case basis by comparing probabilities computed using the 
Distribution methods now implemented in the class with "true" empirical 
probabilities from the raw data.  Some experiments doing this with different 
kernels and data would be interesting to look at.

[1] http://ned.ipac.caltech.edu/level5/March02/Silverman/Silver2_4.html



[1] 
                  
> Incorrect (bugged) generating function getNextValue() in 
> .random.EmpiricalDistribution
> --------------------------------------------------------------------------------------
>
>                 Key: MATH-984
>                 URL: https://issues.apache.org/jira/browse/MATH-984
>             Project: Commons Math
>          Issue Type: Bug
>    Affects Versions: 3.2, 3.1.1
>         Environment: all
>            Reporter: Radoslav Tsvetkov
>
> The generating function getNextValue() in 
> org.apache.commons.math3.random.EmpiricalDistribution
> will generate wrong values for all Distributions that are single tailed or 
> limited. For example Data which are resembling Exponential or Lognormal 
> distributions.
> The problem could be easily seen in code and tested.
> In last version code
> ...
> 490               return getKernel(stats).sample();
> ...
> it samples from Gaussian distribution to "smooth" in_the_bin. Obviously 
> Gaussian Distribution is not limited and sometimes it does generates numbers 
> outside the bin. In the case when it is the last bin it will generate wrong 
> numbers. 
> For example for empirical non-negative data it will generate negative rubbish.
>   Additionally the proposed algorithm boldly returns only the mean value of 
> the bin in case of one value! This last makes the generating function 
> unusable for heavy tailed distributions with small number of values. (for 
> example computer network traffic)
> On the last place usage of Gaussian soothing in the bin will change greatly 
> some empirical distribution properties.
> The proposed method should be reworked to be applicable for real data which 
> have often limited ranges. (either non-negative or both sides limited)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MATH-984) Incorrect (bugged) generating function getNextValue() in .random.EmpiricalDistribution

Reply via email to