Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation

venkatesha m Wed, 18 Jun 2014 11:28:52 -0700

Hi Luc, Gilles

First of, Iam immensly thankful to all your comments on this patch. Next, i am 
attaching my new patch with today's date(18-jun). However please advise if i 
need to remove the old patch file if it confuses.

Please find my response below. The new patch has the suggested changes in the 
switch case for nan handling; But; However i have my view points on the 
different default nan strategies associated to types. Please permit me to 
explain (sorry for long summary)

First,
i would like to leave the Default implementation of Percentile as-is (Meaning 
in my MATH-1120 patch it is mapped to Type.CM)since otherwise we will break 
user old expectation even for non nan and non inf entries as well. The existing 
Percentile implementation also did a copy of the array slice before doing kth 
slection(the old evaluate method can be refered) and even i am doing the same 
it slightly in different circumstance in the switch case with nan filtering. 
The existing tests could fail if we change the default types to say R_8 (please 
refer to PercentileTest.java code as well to see the finer variations in 
expected values that is being looked at for different types. I have used R 
execution as the basis for setting those values).

Secondly,
Percentile.java header comment states somewhere to an effect that NaNs would be 
(left as-is and) handled by java's default sort behavior and no removal being 
done. So for me to map this behaviour to new implementation; it was 
NaNStrategy.FIXED that came close and didnt require any of the existing test 
cases for the existing Percentile behavior to change. What i am re-iterating 
here is the existing behavior tests have completely passed with new Type.CM and 
FIXED. (And now i have added several more tests including different types as 
well).

Thirdly,
While all the R_x (where x :[1-9]) types as run and verified by R tool; seemed 
to clearly convey the NaNs needed to be removed and hence you see that i have 
used different strategy NaNStrategy#REMOVED.
I agree while multiple defaults are not wise to have ; however; if we are 
forced to have Apache CM as supported type (which is not one of R_x types) and 
we have the need to support multiple variants (R1- R9) ; then it is inevitable 
to have type sepcific NaNStrategy as per the need.
I also feel ; NaN handling should be allowed for overriding atleast in a 
controlled manner as different use cases may exist for needing this variation 
in nan handling. Therefore IMO while we could avoid the public access to change 
these defaults; it is relevant to support these variations of nan handling on a 
per type and allow atleast sub classes to override if a rare need arises.
While the very name NaNStrategy reminds me of different ways to look at that; i 
feel we will be much restricted if we just said that we stick to one way of 
NaNHandling for all types. Please let me know your thoughts.

Next,
Regarding the PivotingStrategy; At first, i wanted to convey here that to have 
all the partitioning, pivoting and selection in separated classes/enums than 
inside main class. I have made it as static due to the fact that; it is more of 
a non-functional requirement and felt that it need not be set for every 
instance (more of a global setting that doesnt vary across types).
Please correct me here and let know if it still needs to be per instance.
I also made it package accessable/settable solely because medianOf3 method had 
been package level for the sole intent of possible overriding of the same 
within that package. Meaning; if some one really needed to tinker around 
pivoting they need a way to do it which i have provided it using a strategy.

Next,
In the current patch that i am going to attach as new dated patch (since you 
have already started looking at the old one ; which i would leave it as is). I 
find many utility type methods; replaceNaN, removeNaN( Predicated Lists ) and 
copyOf(values, begin.length) and as well as KthSelector with PivotingStrategy 
etc all of which can perhaps make its way to MathArrays and MathUtils. Please 
let me know.If so i will once again re-factor these changes and submit the 
patch.

Thanks for reading this through and for your time in reviewing . Please let me 
know your opinion on all of these.

thanks
venkat.

--------------------------------------------
On Wed, 18/6/14, Gilles <gil...@harfang.homelinux.org> wrote:

 Subject: Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support 
on variations in, percentile calculation
 To: dev@commons.apache.org
 Date: Wednesday, 18 June, 2014, 8:37 PM

 On Wed, 18 Jun 2014 16:39:12 +0200,
 Luc Maisonobe wrote:
 > Hi Gilles and Venkat,
 > 
 > Le 18/06/2014 15:40, Gilles a écrit :
 >> On Wed, 18 Jun 2014 15:02:41 +0200, Luc Maisonobe
 wrote:
 >>> Le 18/06/2014 14:32, Gilles a écrit :
 >>>> Hello Luc.
 >>>> 
 >>>>>> 
 >>>>>> See
 >>>>>>   https://issues.apache.org/jira/browse/MATH-1129
 >>>>>> 
 >>>>>> The problem reported was due to the
 sorting procedure not behaving
 >>>>>> correctly in the presence of NaN.
 >>>>>> This procedure could be replaced by
 an equivalent method from the JDK:
 >>>>>>   java.util.Arrays.sort(double[],int,int)
 >>>>>> 
 >>>>>> Any objection?
 >>>>> 
 >>>>> If you imply removing the select,
 medianOf3 and partition methods,
 >>>>> yes I
 >>>>> would object. If you imply replacing
 only the insertionSort method used
 >>>>> for small sub-arrays, then I almost
 agree.
 >>>> 
 >>>> Issue 1129 concerns the latter: See the
 comment in "insertionSort" in
 >>>> the
 >>>> current code.
 >>>> 
 >>>> However, for the former, you should really
 have a look at MATH-1120
 >>>>   https://issues.apache.org/jira/browse/MATH-1120
 >>>> The proposed patch indeed touches those
 methods.
 >>> 
 >>> So I think it is worth fixing MATH-1120 first,
 with the NaNStrategy and
 >>> go back to MATH-1129 afterwards, to see if it
 still applies of if
 >>> MATH-1120 also fixed it.
 >> 
 >> As per my last comment on MATH-1120, the
 "NaNStrategy" part of the patch
 >> is an extension of the initial feature request.
 > 
 > Sure, but it is a really good addition and seems fair
 to consider.
 > 
 >> 
 >> As per the Javadoc, the CM code was originally
 meant to handle (in some
 >> way), the presence of NaN values within the data.
 So I thought that this
 >> should be fixed before extending the API (which
 entailed additional
 >> questions as the proposed patch is fairly
 "massive").
 > 
 > Yes, the patch is a big one, and it has been works on
 for a while.
 > As it has quite stabilized by now, I think it would be
 the good time
 > to include it (with some editing I could do) and to
 start working with
 > separate patches applied on top of this one.
 > 
 > If we let it out of the code longer, it will become
 more and more
 > difficult for the contributor to work on it and to us
 to include it.
 > 
 >> 
 >> Especially, the new code is clearly untested for
 performance, and this
 >> seems to be an important argument by your previous
 post.
 > 
 > Yes, it is important as some issues were raised on it
 (if you remember
 > well, they were raised by one researcher from another
 lab working on the
 > same big project as you, just after the symposium when
 we met).
 > 
 >> 
 >> So, it is possible to add "isNaN" conditional, as I
 did in "insertionSort"
 >> and fix the current code without additional
 impact.
 > 
 > I really consider it as an ugly hack rather than
 attempting to really
 > fix the mess I did in this class.
 > 
 >> 
 >> Or, we redesign (as per MATH-1120) which implies
 forgetting about
 >> performance for now. The patch removes
 "insertionSort" altogether!
 > 
 > Well, it is just one line calling
 >   Arrays.sort(work, begin, end);
 > I can put the insertion sort back here while editing
 the patch before
 > committing it.
 > 
 >> IMHO, it is too many changes at once...
 > 
 > You know the popular saying about « premature
 optimization ». I think it
 > would be fair to temporarily forget about performance,
 fix the issue and
 > set up a new ground to work on based on MATH-1120
 patch, then improve
 > this by small corrections once the first huge patch has
 been committed,
 > and then look back at performances.

 That's fine with me. I was really waiting from someone else
 to apply
 this patch. ;-)

 Gilles

 > 
 > 
 >> 
 >> What is the meaning of percentile when NaNs are
 kept in the data
 >> (strategy "FIXED")?
 >> For all the other strategies, the NaNs will not
 cause a problem
 >> within the algorithms being discussed here (being
 replaced by +inf
 >> or -inf, or removed, or raising an exception).
 > 
 > This is one point we should really discuss, but it can
 be done later.
 > I agree with you, FIXED is not appropriate here. We
 could simply raise
 > an exception if it is selected as not being meaningful
 for this algorithm.
 > 
 >> 
 >> 
 >>> 
 >>> I will have a look at the proposed patch,
 including your comments.
 >> 
 >> Thanks.
 > 
 > So I did have a look and really think it is worth
 applying it. I agree
 > with your comments and will include your change for the
 NaNs recoding.
 > This change alone would solve MATH-1129 by the way.
 > 
 > I also am quite puzzled with the various default
 strategies and would
 > propose only one. I would even suggest to select
 immediately R8 instead
 > of our legacy method, as it would nevertheless be an
 improvement and
 > would not break existing code (only provide better
 results).
 > 
 > As there are several customization area (method,
 pivoting NaN strategy),
 > it would also be interesting to use the builder
 approach we started in
 > optimization. This would also allow us to pass an
 argument for the
 > RANDOM pivoting where the user could select a specific
 random generator
 > with specific seed, I really do not like having a
 random generator
 > blindly created with the automatic time dependent
 seed.
 > 
 > Another point I already changed is the medianOf3 method
 throwing an
 > exception in addition to being deprecated. It is really
 not friendly to
 > users. So I replaced the exception by a call to the
 appropriate enum
 > method (but of course let the deprecation in place).
 > 
 > I also changed the pivoting strategy to be an instance
 field rather than
 > a class field, just as all the other customization
 parameters.
 > 
 > best regards,
 > Luc
 > 
 >> 
 >> Gilles
 >> 
 >>>> [...]

 ---------------------------------------------------------------------
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation

Reply via email to