Re: [math] use case for NaNStrategy.FIXED?

Phil Steitz Sun, 13 Jul 2014 09:18:06 -0700

On 7/13/14, 2:40 AM, Luc Maisonobe wrote:
> Hi all,
>
> Le 01/07/2014 08:52, Luc Maisonobe a écrit :
>> Le 30/06/2014 22:00, Phil Steitz a écrit :
>>> On 6/30/14, 12:33 PM, Luc Maisonobe wrote:
>>>> Le 30/06/2014 19:15, Phil Steitz a écrit :
>>>>>> On Jun 30, 2014, at 9:59 AM, Gilles <gil...@harfang.homelinux.org>
>>>>>> wrote:
>>>>>>
>>>>>> On Mon, 30 Jun 2014 09:08:00 -0700, Phil Steitz wrote:
>>>>>>>> On Jun 29, 2014, at 3:41 PM, Gilles
>>>>>>>> <gil...@harfang.homelinux.org> wrote:
>>>>>>>>
>>>>>>>>>> On Sun, 29 Jun 2014 14:39:51 -0700, Phil Steitz wrote:
>>>>>>>>>>> On 6/29/14, 2:30 PM, Gilles wrote:
>>>>>>>>>>>> On Sun, 29 Jun 2014 10:25:58 -0700, Phil Steitz wrote: 
>>>>>>>>>>>> On 6/29/14, 9:48 AM, venkatesha murthy wrote: On Sun,
>>>>>>>>>>>> Jun 29, 2014 at 1:25 AM, Phil Steitz 
>>>>>>>>>>>> <phil.ste...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>> On 6/25/14, 12:12 AM, Luc Maisonobe wrote: Le
>>>>>>>>>>>>>> 25/06/2014 06:05, venkatesha murthy a écrit :
>>>>>>>>>>>>>>> On Wed, Jun 25, 2014 at 12:21 AM, Luc Maisonobe 
>>>>>>>>>>>>>>> <l...@spaceroots.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Hi Venkat,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Le 23/06/2014 21:08, venkatesha murthy a écrit
>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>> On Tue, Jun 24, 2014 at 12:08 AM, Luc
>>>>>>>>>>>>>>>>> Maisonobe <l...@spaceroots.org>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> While looking further in Percentile class
>>>>>>>>>>>>>>>>>> for MATH-1120, I have found another problem
>>>>>>>>>>>>>>>>>> in the current implementation. 
>>>>>>>>>>>>>>>>>> NaNStrategy.FIXED
>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>> leave the NaNs in place, but at the end of
>>>>>>>>>>>>>>>>>> the KthSelector.select method, a call to
>>>>>>>>>>>>>>>>>> Arrays.sort moves the NaNs to the end of 
>>>>>>>>>>>>>>>>>> the small sub-array. What is really
>>>>>>>>>>>>>>>>>> implemented here is therefore closer to 
>>>>>>>>>>>>>>>>>> NaNStrategy.MAXIMAL than NaNStrategy.FIXED.
>>>>>>>>>>>>>>>>>> This always occur in the test cases because
>>>>>>>>>>>>>>>>>> they use very short arrays, and we 
>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>> switch to
>>>>>>>>>>>>>>>>>> this part of the select method.
>>>>>>>>>>>>>>>>> Are NaNs considered higher than +Inf ? If
>>>>>>>>>>>>>>>>> MAXIMAL represent replacing for +inf ; you
>>>>>>>>>>>>>>>>> need something to indicate beyond this for
>>>>>>>>>>>>>>>>> NaN.
>>>>>>>>>>>>>>>> Well, we can just keep the NaN themselves and
>>>>>>>>>>>>>>>> handled them appropriately, hoping not to trash
>>>>>>>>>>>>>>>> performances too much.
>>>>>>>>>>>>>>> Agreed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> What is the test input you see an issue and
>>>>>>>>>>>>>>>>> what is the actual error you are seeing.
>>>>>>>>>>>>>>>>> Please share the test case.
>>>>>>>>>>>>>>>> Just look at
>>>>>>>>>>>>>>>> PercentileTest.testReplaceNanInRange(). The 
>>>>>>>>>>>>>>>> first check in the test corresponds to a
>>>>>>>>>>>>>>>> Percentile configuration at 50% percentile, and
>>>>>>>>>>>>>>>> NaNStrategy.FIXED. The array has an odd number
>>>>>>>>>>>>>>>> of entries, so the 50% percentile is exactly
>>>>>>>>>>>>>>>> one of the entries: the one at index 5 in the 
>>>>>>>>>>>>>>>> final array.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The initial ordering is { 0, 1, 2, 3, 4, NaN,
>>>>>>>>>>>>>>>> NaN, 5, 7, NaN, 8 }. So for the
>>>>>>>>>>>>>>>> NaNStrategy.FIXED setting, it should not be
>>>>>>>>>>>>>>>> modified at all in the selection algorithm and
>>>>>>>>>>>>>>>> the result for 50% should be the first NaN of
>>>>>>>>>>>>>>>> the array, at index 5. In fact, due to the
>>>>>>>>>>>>>>>> Arrays.sort, we *do* reorder the array into  {
>>>>>>>>>>>>>>>> 0, 1, 2, 3, 4, 5, 7, 8, NaN, NaN, NaN }, so the
>>>>>>>>>>>>>>>> result is 5.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Agreed. just verified by putting back the
>>>>>>>>>>>>>>>> earlier insertionSort
>>>>>>>>>>>>> function.
>>>>>>>>>>>>>>>> If we use NaNStrategy.MAXIMAL and any quantile
>>>>>>>>>>>>>>>> above 67%, we get as a result
>>>>>>>>>>>>>>>> Double.POSITIVE_INFINITY instead of
>>>>>>>>>>>>>>>> Double.NaN.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If we agree to leave FIXED as unchanged
>>>>>>>>>>>>>>>> behaviour with your
>>>>>>>>>>>>> insertionSort
>>>>>>>>>>>>>>> code; then atleast MAXIMAL/MINIMAL should be
>>>>>>>>>>>>>>> allowed for transformation
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> NaN to +/-Inf
>>>>>>>>>>>>>> I'm fine with it, as long as we clearly state it in
>>>>>>>>>>>>>> the NaNStrategy documentation, saying somethig
>>>>>>>>>>>>>> like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When MAXIMAL/MINIMAL are used, the NaNs are
>>>>>>>>>>>>>> effecively *replaced* by +/- infinity, hence the
>>>>>>>>>>>>>> results will be computed as if infinities were 
>>>>>>>>>>>>>> there in the first place.
>>>>>>>>>>>>> -1 - this is mixing concerns.  NaNStrategy exists for
>>>>>>>>>>>>> one specific purpose - to turn extend the ordering on
>>>>>>>>>>>>> doubles a total ordering. Strategies prescribe only
>>>>>>>>>>>>> how NaNs are to be treated in the ordering.  Side
>>>>>>>>>>>>> effects on the input array don't make sense in this 
>>>>>>>>>>>>> context.  The use case for which this class was
>>>>>>>>>>>>> created was rank transformations.  Returning
>>>>>>>>>>>>> infinities in rank transformations would blow up
>>>>>>>>>>>>> computations in these classes.  If a floating point 
>>>>>>>>>>>>> computations need to transform NaNs, the specific
>>>>>>>>>>>>> stats / other classes that do this transformation
>>>>>>>>>>>>> should perform and document the behavior.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Phil
>>>>>>>>>>>> OK. Agreed realized it later.  Hence i have not
>>>>>>>>>>>> touched NaNStrategy in my patch(MATH-1132) . Now i have
>>>>>>>>>>>> added a separate transformer to transform NaNs but it
>>>>>>>>>>>> uses NanStrategy.  Please let me know on this as this 
>>>>>>>>>>>> trasnformation itself can be used in different places.
>>>>>>>>>>> Honestly, I don't see the value of this.  I would be more
>>>>>>>>>>> favorable to an addition to MathArrays supporting NaN (or
>>>>>>>>>>> infinity) filtering / transformation.  Several years back
>>>>>>>>>>> we made the decision to just let NaNs propagate in
>>>>>>>>>>> computations, which basically means that if you feed NaNs
>>>>>>>>>>> to [math] you are going to get NaNs out in results. If we
>>>>>>>>>>> do get into the business of filtering NaNs (or infinities
>>>>>>>>>>> for that matter), I think it is probably best to do that
>>>>>>>>>>> in MathArrays or somesuch - i.e., don't decorate APIs
>>>>>>>>>>> with NaN or infinity-handling handling descriptions.
>>>>>>>>>>> That would be a huge task and would likely end up
>>>>>>>>>>> bleeding implementation detail into the APIs.  As I said
>>>>>>>>>>> above, NaNStrategy has to exist for rank transformations
>>>>>>>>>>> to be well-defined.  NaN-handling in floating point 
>>>>>>>>>>> computations is defined and as long as we fully document
>>>>>>>>>>> what our algorithms do, I think it is fair enough to "let
>>>>>>>>>>> the NaNs fall where they may" - which basically means
>>>>>>>>>>> users need to do the filtering / transformation
>>>>>>>>>>> themselves.
>>>>>>>>>> So, do I understand correctly that it's OK to add the
>>>>>>>>>> "remove" and "replace" utility methods (cf. MATH-1130) in
>>>>>>>>>> "MathArrays"? Or does this thread conclude that we don't
>>>>>>>>>> have a use for this within Commons Math?
>>>>>>>>> You raise a good point here.  Personally, I don't see an
>>>>>>>>> internal use and if we are sticking with MathArrays including
>>>>>>>>> only things we have internal use for, I would say no, don't
>>>>>>>>> include them.
>>>>>>>> A third way to look at it would be that since some algorithms
>>>>>>>> require being passed "clean" data (e.g. without NaNs), we could
>>>>>>>> provide some simple means to do so. A user could then write,
>>>>>>>> e.g. --- double[] data = { 1, Double.NaN, 3, 6, 5 };
>>>>>>>>
>>>>>>>> Percentile p = new Percentile(); double q =
>>>>>>>> p.evaluate(MathArrays.replace(data, Double.NaN,
>>>>>>>> Double.POSITIVE_INFINITY), 0.1); ---
>>>>>>>>
>>>>>>>> Then, if we provide that utility, is it still necessary to
>>>>>>>> complicate the Percentile code with a NaNStrategy parameter?
>>>>>>> I may be missing something, but it seems to me that percentile
>>>>>>> does need a NaNStrategy in the presence if NaNs like the other
>>>>>>> order statistics do (basically to define their position In the
>>>>>>> ordering).  It doesn't logically need more than that.  I would
>>>>>>> be ok defaulting the strategy or even returning NaN in the
>>>>>>> presence of NaNs (or whenever interpolation involves a NaN). All
>>>>>>> of this can be documented.
>>>>>> I am _surely_ missing something because I don't figure out how NaNs
>>>>>> can be useful when computing those statistics... :-}
>>>>> Good point.  Basically the different strategies allow you to
>>>>> effectively ignore or work around them so stats can be well-defined.
>>>>> Without NaNStrategy, order statistics including NaNs are undefined.
>>>> The problem with Percentile is that the elements are both reordered
>>>> *and* used when interpolating between two values. Tor the rank methods,
>>>> only reordering was needed so the various NaNStrategies could be
>>>> implemented by replacing at will with +/- infinity and a classical sort
>>>> would do the ordering for us.
>>>>
>>>> Here, we do interpolate on the values, so if we first replace NaN with
>>>> +infinity to put them at the end, we may interpolate between a finite
>>>> value and +inf and have +inf as a result, whereas if the NaN "value" was
>>>> preserved and some magic used to sort the NaN, we would end up
>>>> interpolating with a NaN and get a NaN as a result. Setting up the magic
>>>> for MINIMAL and MAXIMAL seems achievable withing the sort, as we already
>>>> have a specific sort (in fact not a sort but a selection algorithm,
>>>> which is the core of the class). Setting up the magic for FIXED seems
>>>> more difficult.
>>>>
>>>> So I guess implementing NaNStrategies may be done in different ways
>>>> depending on how they will be used by the calling algorithm. I don't yet
>>>> know if we can set up a general method to be used for all statistics.
>>>> Looking only at Percentile, replacement with +/- infinity already seems
>>>> to not be an appropriate solution.
>>> I don't think we should try to redefine algorithms so they can
>>> handle NaNs.  When you do floating point computations with NaNs, you
>>> get NaNs.  That should be understood to be the contract.  For rank
>>> statistics, such as Percentile, we need to extend the ordering on
>>> doubles to include them in ordering for the algorithm to be
>>> well-defined.  If we subsequently do floating point computations
>>> with the NaNs, the result should be NaN.  I think we should either
>>> just simplify things and say we *always* return NaN whenever NaNs
>>> are present for Percentile, or do the following:
>>>
>>> 0) use the prescribed NaN strategy to determine the ordering - so
>>> NaNs end up either at the bottom, at the top or in their original
>>> position (FIXED).
>>> 1) if interpolation involves a NaN element, the result returned is NaN.
>> This is exactly what I say. If NaNs are in, NaNs should be out. As for
>> now we do replace them, we get infinities instead of NaNs. I would
>> prefer getting NaNs there as it is more consistent with the purpose of
>> NaNs in floating point encoding.
>>
>> So I think we agree: NaNs should not be changed, but they should be
>> reordered. Perhaps we could have something simple like NaNStrategy
>> having a method:
>>
>>   boolean shouldSwap(int i1, double d1, int i2, double i2)
>>
>> that would use both the index and values of the element to check if they
>> should be swapped or not. This method would be used both for traditional
>> sorting for ranking and in selection algorithms for percentile. I first
>> thought we should simply have NaN implement Comparator<Double> but this
>> would not work for FIXED, whereas the above method would work (simply
>> returning false as soon as there is a NaN, regardless of it being at a
>> lower or upper index).
>>
>> I will take a look at this option.
> This does not work as is. The problem is lack of transitivity. Suppose
> that we have three indices i < j < k, with a[i] > a[k] and a[j] = NaN.
>
> Then if during the sort/selection we compare elements at indices i and k
> first, we will swap them, which is correct behaviour regardless of the
> NaN strategy used. If we compare elemntes at indices i and j first and
> then elements at indices j and k and we use FIXED NaNStrategy, then
> we won't swap the two elements and the middle NaN would induced a
> non-globally sorted array. I think FIXED should act as a river flowing
> over rocks : the water (non-NaN elements) would move freely around and
> be sorted, and the rocks (NaN elements) would stay were they are but do
> not prevent water flow.


That is what FIXED is supposed to mean.  If the original array is
x_0, ..., x_n and x_i is NaN, then after applying the FIXED
NaNStrategy to sort the array, x_i will still be NaN.  If this is
inconvenient to support for Percentile, I would be fine with making
selection of this NaNStrategy illegal - i.e, throwing IAE if it is
specified.  As I said before, I would also be OK with making the
presence of NaN values in the input trigger (documented) IAE.

Phil
>
> I have found another way to handle this, using an indirection array that
> could "jump" over NaNs for FIXED strategy or push them to either end for
> MINIMAL or MAXIMAL strategy.
>
> I will try this in the next few days, as time permit.
>
> best regards,
> Luc
>
>> best regards,
>> Luc
>>
>>> Phil
>>>> best regards,
>>>> Luc
>>>>
>>>>> Phil
>>>>>
>>>>>>> I still don't see a within-math use for recoding methods; though
>>>>>>> I do see the value from a user perspective.  The problem is if we
>>>>>>> open this door, we risk MathArrays becoming a mini commons lang.
>>>>>> If Commons Lang already provides a feature, then we indeed don't
>>>>>> need to offer the same within CM (if there is no internal use). Can
>>>>>> someone confirm?
>>>>>>
>>>>>>
>>>>>> Otherwise we could define a new inner interface in MathArrays,
>>>>>> similar to "MathArrays.Function": --- public class MathArrays {
>>>>>>
>>>>>> // ...
>>>>>>
>>>>>> public interface Transformer { /** Transform the whole array */ 
>>>>>> double[] transform(double[] input); /** Transform the array in the
>>>>>> interval [from, to) and return the whole array */ double[]
>>>>>> transform(double[] input, int from, int to); /** Transform the
>>>>>> array in the interval [from, to) and return the transformed
>>>>>> sub-array */ double[] transformAndSplice(double[] input, int from,
>>>>>> int to); } } ---
>>>>>>
>>>>>> Then we can provide "replace" and "remove" transformers through
>>>>>> factory methods (example/untested code): --- public class
>>>>>> MathArrays {
>>>>>>
>>>>>> // ...
>>>>>>
>>>>>> public static Transformer createReplaceTransformer(final double
>>>>>> old, final double replace) { return new Transformer() { public
>>>>>> double[] transform(double[] input) { final int len = input.length; 
>>>>>> final double[] output = new double[len]; for (int i = 0; i < len;
>>>>>> i++) { output[i] = Precision.equalsIncludingNaN(input[i], old) ? 
>>>>>> replace : input[i]; } return output; }
>>>>>>
>>>>>> // etc. }; } } ---
>>>>>>
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>
>>>>>>
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>
>>>>>
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] use case for NaNStrategy.FIXED?

Reply via email to