[ 
https://issues.apache.org/jira/browse/MATH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669059#comment-13669059
 ] 

Radoslav Tsvetkov commented on MATH-984:
----------------------------------------

For the second issue:
Consider long tailed distribution as shown on 
http://en.wikipedia.org/wiki/Long_tail  (In case of network traffic. The 
biggest 90% data volume comes comes from less then 10% of connections.) 
In this case we have extremely wide spread __important__ values with only 
___single_ occurrences.
If we want to generate similar variable (for ex. larger sample) we'll get fixed 
values for all this bins in the long tail.
It signifies 10% of generated values will be __fixed_ values - their respective 
bin means!


Last Issue: I would question usage of Gaussian Kernel at all. Without having a 
mathematical prove, I nevertheless suppose it could disturb parameters of 
generation if we have non Gaussian empirical data. (for ex. Pareto, Tweedie, ..)

Why we don't stick with triangular or uniform distribution as default for 
Kernel within the bean?
                
> Incorrect (bugged) generating function getNextValue() in 
> .random.EmpiricalDistribution
> --------------------------------------------------------------------------------------
>
>                 Key: MATH-984
>                 URL: https://issues.apache.org/jira/browse/MATH-984
>             Project: Commons Math
>          Issue Type: Bug
>    Affects Versions: 3.2, 3.1.1
>         Environment: all
>            Reporter: Radoslav Tsvetkov
>
> The generating function getNextValue() in 
> org.apache.commons.math3.random.EmpiricalDistribution
> will generate wrong values for all Distributions that are single tailed or 
> limited. For example Data which are resembling Exponential or Lognormal 
> distributions.
> The problem could be easily seen in code and tested.
> In last version code
> ...
> 490               return getKernel(stats).sample();
> ...
> it samples from Gaussian distribution to "smooth" in_the_bin. Obviously 
> Gaussian Distribution is not limited and sometimes it does generates numbers 
> outside the bin. In the case when it is the last bin it will generate wrong 
> numbers. 
> For example for empirical non-negative data it will generate negative rubbish.
>   Additionally the proposed algorithm boldly returns only the mean value of 
> the bin in case of one value! This last makes the generating function 
> unusable for heavy tailed distributions with small number of values. (for 
> example computer network traffic)
> On the last place usage of Gaussian soothing in the bin will change greatly 
> some empirical distribution properties.
> The proposed method should be reworked to be applicable for real data which 
> have often limited ranges. (either non-negative or both sides limited)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to