[ 
https://issues.apache.org/jira/browse/MATH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13674430#comment-13674430
 ] 

Phil Steitz commented on MATH-984:
----------------------------------

Thanks, Radoslav.  I agree with your point 2 and after looking at the code some 
more, I am going to retract my comment above about direct implementation of 
getNextValue being more efficient.  I think the simplest and best fix for both 
of these problems is to fix getKernel to return a uniform distribution on one 
value for singleton bins, drop the implementation of sample() and have 
getNextValue delegate to the parent's (inversion-based) sample() 
implementation.  The probability methods basically truncate the bin kernels 
now.  The problem is in the direct implementation of sampling.
                
> Incorrect (bugged) generating function getNextValue() in 
> .random.EmpiricalDistribution
> --------------------------------------------------------------------------------------
>
>                 Key: MATH-984
>                 URL: https://issues.apache.org/jira/browse/MATH-984
>             Project: Commons Math
>          Issue Type: Bug
>    Affects Versions: 3.2, 3.1.1
>         Environment: all
>            Reporter: Radoslav Tsvetkov
>
> The generating function getNextValue() in 
> org.apache.commons.math3.random.EmpiricalDistribution
> will generate wrong values for all Distributions that are single tailed or 
> limited. For example Data which are resembling Exponential or Lognormal 
> distributions.
> The problem could be easily seen in code and tested.
> In last version code
> ...
> 490               return getKernel(stats).sample();
> ...
> it samples from Gaussian distribution to "smooth" in_the_bin. Obviously 
> Gaussian Distribution is not limited and sometimes it does generates numbers 
> outside the bin. In the case when it is the last bin it will generate wrong 
> numbers. 
> For example for empirical non-negative data it will generate negative rubbish.
>   Additionally the proposed algorithm boldly returns only the mean value of 
> the bin in case of one value! This last makes the generating function 
> unusable for heavy tailed distributions with small number of values. (for 
> example computer network traffic)
> On the last place usage of Gaussian soothing in the bin will change greatly 
> some empirical distribution properties.
> The proposed method should be reworked to be applicable for real data which 
> have often limited ranges. (either non-negative or both sides limited)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to