[jira] [Comment Edited] (MATH-1120) Need Percentile computations that can be matched with standard spreadsheet formula

Venkatesha Murthy TS (JIRA) Wed, 18 Jun 2014 11:28:52 -0700

    [ 
https://issues.apache.org/jira/browse/MATH-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14036080#comment-14036080
 ]


Venkatesha Murthy TS edited comment on MATH-1120 at 6/18/14 6:27 PM:
---------------------------------------------------------------------

Hi Luc, Gilles

First of, Iam immensly thankful to all your comments on this patch. Next, i am 
attaching my new patch with today's date(18-jun). However please advise if i 
need to remove the old patch file if it confuses.

Please find my response below. The new patch has the suggested changes in the 
switch case for nan handling; But; However i have my view points on the 
different default nan strategies associated to types. Please permit me to 
explain (sorry for long summary)

First, 
i would like to leave the Default implementation of Percentile as-is (Meaning 
in my MATH-1120 patch it is mapped to Type.CM)since otherwise we will break 
user old expectation even for non nan and non inf entries as well. The existing 
Percentile implementation also did a copy of the array slice before doing kth 
slection(the old evaluate method can be refered) and even i am doing the same 
it slightly in different circumstance in the switch case with nan filtering. 
The existing tests could fail if we change the default types to say R_8 (please 
refer to PercentileTest.java code as well to see the finer variations in 
expected values that is being looked at for different types. I have used R 
execution as the basis for setting those values).  

Secondly,
Percentile.java header comment states somewhere to an effect that NaNs would be 
(left as-is and) handled by java's default sort behavior and no removal being 
done. So for me to map this behaviour to new implementation; it was 
NaNStrategy.FIXED that came close and didnt require any of the existing test 
cases for the existing Percentile behavior to change. What i am re-iterating 
here is the existing behavior tests have completely passed with new Type.CM and 
FIXED. (And now i have added several more tests including different types as 
well).

Thirdly,
While all the R_x (where x :[1-9])  types as run and verified by R tool; seemed 
to clearly convey the NaNs needed to be removed and hence you see that i have 
used different strategy NaNStrategy#REMOVED. 
I agree while multiple defaults are not wise to have ; however; if we are 
forced to have Apache CM as supported type (which is not one of R_x types) and 
we have the need to support multiple variants (R1- R9) ; then it is inevitable 
to have type sepcific NaNStrategy as per the need.
I also feel ; NaN handling should be allowed for overriding atleast in a 
controlled manner as different use cases may exist for needing this variation 
in nan handling. Therefore IMO while we could avoid the public access to change 
these defaults; it is relevant to support these variations of nan handling on a 
per type and allow atleast sub classes to override if a rare need arises.
While the very name NaNStrategy reminds me of  different ways to look at that; 
i feel we will be much restricted  if we just said that we stick to one way of 
NaNHandling for all types. Please let me know your thoughts.

Next,
Regarding the PivotingStrategy; At first, i wanted to convey here that to have 
all the partitioning, pivoting and selection in separated classes/enums than 
inside main class. I have made it as static due to the fact that; it is more of 
a non-functional requirement and felt that it need not be set for every 
instance (more of a global setting that doesnt vary across types). 
Please correct me here and let know if it still needs to be per instance. 
I also made it package accessable/settable solely because medianOf3 method had 
been package level for the sole intent of possible overriding of the same 
within that package. Meaning; if some one really needed to tinker around 
pivoting they need a way to do it which i have provided it using a strategy.

Next,
In the current patch that i am going to attach as new dated patch (since you 
have already started looking at the old one ; which i would leave it as is). I 
find many utility type methods; replaceNaN, removeNaN( Predicated Lists ) and 
copyOf(values, begin.length) and as well as KthSelector with PivotingStrategy 
etc all of which can perhaps make its way to MathArrays and MathUtils. Please 
let me know.If so i will once again re-factor these changes and submit the 
patch.

Thanks for reading this through and for your time in reviewing . Please let me 
know your opinion on all of these.

thanks
venkat.



was (Author: vmurthy):
Hi Luc, Gilles

First of, Iam immensly thankful to all your comments on this patch. Next, i am 
attaching my new patch with today's date(18-jun). However please advise if i 
need to remove the old patch file if it confuses.

Please find my response below. The new patch has the suggested changes in the 
switch case for nan handling; But; However i have my view points on the 
different default nan strategies associated to types. Please permit me to 
explain (sorry for long summary)

First, 
i would like to leave the Default implementation of Percentile as-is (Meaning 
in my MATH-1120 patch it is mapped to Type.CM)since otherwise we will break 
user old expectation even for non nan and non inf entries as well. The existing 
tests does fail if we change the default types (please refer to 
PercentileTest.java code as well to see the finer variations that is being 
looked at for different types)

Secondly,
Percentile.java header comment states somewhere to an effect that NaNs would be 
(left as-is and) handled by java's default sort behavior and no removal being 
done. So for me to map this behaviour to new implementation; it was 
NaNStrategy.FIXED that came close and didnt require any of the existing test 
cases for the existing Percentile behavior to change. What i am re-iterating 
here is the existing behavior tests have completely passed with new Type.CM and 
FIXED. (And now i have added several more tests including different types as 
well).

Thirdly,
While all the R_x (where x :[1-9])  types as run and verified by R tool; seemed 
to clearly convey the NaNs needed to be removed and hence you see that i have 
used different strategy NaNStrategy#REMOVED. 
I agree while multiple defaults are not wise to have ; however; if we are 
forced to have Apache CM as supported type (which is not one of R_x types) and 
we have the need to support multiple variants (R1- R9) ; then it is inevitable 
to have type sepcific NaNStrategy as per the need.
I also feel ; NaN handling should be allowed for overriding atleast in a 
controlled manner as different use cases may exist for needing this variation 
in nan handling. Therefore IMO while we could avoid the public access to change 
these defaults; it is relevant to support these variations of nan handling on a 
per type and allow atleast sub classes to override if a rare need arises.
While the very name NaNStrategy reminds me of  different ways to look at that; 
i feel we will be much restricted  if we just said that we stick to one way of 
NaNHandling for all types. Please let me know your thoughts.

Next,
Regarding the PivotingStrategy; At first, i wanted to convey here that to have 
all the partitioning, pivoting and selection in separated classes/enums than 
inside main class. I have made it as static due to the fact that; it is more of 
a non-functional requirement and felt that it need not be set for every 
instance (more of a global setting that doesnt vary across types). 
Please correct me here and let know if it still needs to be per instance. 
I also made it package accessable/settable solely because medianOf3 method had 
been package level for the sole intent of possible overriding of the same 
within that package. Meaning; if some one really needed to tinker around 
pivoting they need a way to do it which i have provided it using a strategy.

Next,
In the current patch that i am going to attach as new dated patch (since you 
have already started looking at the old one ; which i would leave it as is). I 
find many utility type methods; replaceNaN, removeNaN( Predicated Lists ) and 
copyOf(values, begin.length) and as well as KthSelector with PivotingStrategy 
etc all of which can perhaps make its way to MathArrays and MathUtils. Please 
let me know.If so i will once again re-factor these changes and submit the 
patch.

Thanks for reading this through and for your time in reviewing . Please let me 
know your opinion on all of these.

thanks
venkat.


> Need Percentile computations that can be matched with standard spreadsheet 
> formula
> ----------------------------------------------------------------------------------
>
>                 Key: MATH-1120
>                 URL: https://issues.apache.org/jira/browse/MATH-1120
>             Project: Commons Math
>          Issue Type: Improvement
>    Affects Versions: 3.2
>            Reporter: Venkatesha Murthy TS
>              Labels: Percentile
>             Fix For: 4.0
>
>         Attachments: 18-jun-percentile-with-estimation-patch, 
> excel-percentile-patch, percentile-with-estimation-patch, r-output.txt
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> The current Percentile implementation assumes and hard-codes the quantile pth 
> position as 
> p * (N+1)/100 and provides a kth selected value.
> However if we need to verify compare/contrast with standard statistical tools 
> such as say MS Excel; it would be good to provide an extensible way of 
> morphing this selection of position than hard code.
> For example in order to generate the percentile closely matching with MS 
> Excel the position required may be [p*(N-1)/100]+1.
> Please let me know if i could submit this as a patch.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (MATH-1120) Need Percentile computations that can be matched with standard spreadsheet formula

Reply via email to