Re: [math] speeding up percentile based statistics

Mikkel Meyer Andersen Sun, 26 Sep 2010 09:55:53 -0700
I don't see any season for computing hash value if pointers don't match.
Den 2010 9 26 18:52 skrev "Luc Maisonobe" <luc.maison...@free.fr>:
> Le 26/09/2010 18:21, Ted Dunning a écrit :
>> I was about to write this suggestion down.
>>
>> Instead, I will simply concur.
>>
>> +1
>>
>> The safe version can do the checksum or make a copy. Either is safe and
the
>> costs are similar (O(n) on every call and no memory or O(1) on every call
>> after the first, O(n) memory).
>
> OK. Then I will have a copy of the data in the class and for each
> evaluate method with a double[] values parameter there will also be a
> method without this parameter that will run on the already provided
> array. The implementations with the value parameter will have an
> implementation roughly similar to:
>
> public double evaluate(final double[] values, final double p) {
> checkCache(values);
> return evaluate(p);
> }
>
> private void checkCache(final double[] values) {
> final int hash = Arrays.hashcode(values);
> if ((values != originalValues) || (hash != originalHash)) {
> clearCache();
> originalValues = values;
> originalHash = hash;
> copiedValues = values.clone();
> }
> }
>
> Starting to work on it ...
>
> Luc
>
>>
>> I would tend toward the copy option.
>>
>> On Sun, Sep 26, 2010 at 8:39 AM, Mikkel Meyer Andersen <m...@mikl.dk>
wrote:
>>
>>> Hi,
>>>
>>> Why not simply make 3) the default, but also supply a
>>> *IMSureWhatIMDoing-method only implementing 2) (it is so cheap that we
>>> might as well do it instead of not doing any check at all). This means
>>> that users who read the documentation can gain something when they
>>> explicitly ask for it, while users who don't read the docs are still
>>> safe.
>>>
>>> Cheers, Mikkel.
>>>
>>> 2010/9/26 Luc Maisonobe <luc.maison...@free.fr>:
>>>> Le 23/09/2010 20:57, Luc Maisonobe a écrit :
>>>>> Hi,
>>>>>
>>>>> Resuming this thread, as I will be soon able to work on this (I hope).
>>>>> I propose to start by solving
>>>>> <https://issues.apache.org/jira/browse/MATH-417> using a simple
>>>>> partition-based selection algorithm and preserving pivot information.
>>>>> This would help both the single call use-case (we don't sort
everything)
>>>>> and the multiple call case (we don't redo on subsequent calls work
>>>>> already done in the first calls.
>>>>
>>>> The current API of the Percentile method is derived from
>>>> AbstractUnivariateStatistic and provides several evaluate methods that
>>>> have the data array as a parameter.
>>>>
>>>> This is compatible with the one-call use case without any semantic
>>>> change, but does not fit the multiple-calls use case seamlessly.
>>>>
>>>> If we want to cache some work already done on the array between calls
>>>> with different values of p (the percentile value), we have to check if
>>>> the array is still the same as the cached one on which we have done
some
>>>> partitioning and saved some pivots. This can be done by storing a
>>>> reference to the original array in addition to a copy of its content
>>>> (which we reorganize) and with an "if (values == originalValues)"
>>>> statement. This would however not protect against changes in the array
>>>> elements themselves done by the user between calls. A better protection
>>>> could be to compute a hash code of the array content and compare it to
>>>> the original one at each call. However, this would add a linear cost.
>>>> The user who encountered the performances problems added an addValues
>>>> method to set up the array once and simply ignored the values passed as
>>>> an argument to the evaluate methods. This breaks the semantics of the
>>> API.
>>>>
>>>> So I see four choices about partitioning caching and would like to have
>>>> people express which one they prefer. A clearCache or similar public
>>>> method would be made available to users to ensure clean computation if
>>>> the caching mechanism gets on their way.
>>>>
>>>> 1) do nothing to check the array is the same between calls and blindly
>>>> assumes it IS the same. Users would really need to call clearCache
>>>> when they provide a new array
>>>> pros: very simple
>>>> cons: error-prone for the user as it relies only on reading and
>>>> understanding a documentation that would change with new version
>>>>
>>>> 2) check only array reference equality (with ==) to check the array
>>>> is the same as the previous call or not. Users would need to call
>>>> clearCache only if they use the same array but have changed its
>>>> content.
>>>> pros: trade-off between efficiency and reliability,
>>>> handles most cases properly
>>>> cons: may be wrong in corner cases
>>>>
>>>> 3) check array content using an hash code. Users would generally don't
>>>> need to call clearCache at all so it would not be provided
>>>> pros: works in all cases
>>>> cons: add linear cost for array checking
>>>>
>>>> 4) remove the double[] values parameter from the API and use a separate
>>>> addValues method, hence the class will reset its cache only when
>>>> addValues is called
>>>> pros: works in all cases
>>>> cons: Percentile could not implement UnivariateStatistic anymore
>>>>
>>>> My preference is choice 2.
>>>>
>>>> What do you think ?
>>>>
>>>> Luc
>>>>
>>>>>
>>>>> Partition-based algorithms are O(n) in expected time but not in worst
>>>>> case time. If end users ask for it later, we could improve again by
>>>>> adding the Musser's introselect that guarantees O(n) behavior even in
>>>>> worst case.
>>>>>
>>>>> Another possibility would be to have an additional method that would
>>>>> compute percentiles for several thresholds at once, but this would
>>>>> require returning an array of results and I think it would be
difficult
>>>>> to mix with other statistics (mainly the DescriptiveStatistics.apply
>>>>> method, which is used by the users I talked to).
>>>>>
>>>>> Would this meet everyone concerns ?
>>>>> Luc
>>>>>
>>>>>
>>>>> Le 18/09/2010 21:50, Dimitri Pourbaix a écrit :
>>>>>> Ted,
>>>>>>
>>>>>>> O(n), not n
>>>>>>>
>>>>>>> Expected case is n + n/2 + n/4 ... < 2n
>>>>>>
>>>>>> Yes, that I am already more incline to buy.
>>>>>>
>>>>>> Dim.
>>>>>>
>>>
----------------------------------------------------------------------------
>>>>>>
>>>>>> Dimitri Pourbaix *
>>>>>> Institut d'Astronomie et d'Astrophysique * Don't worry, be happy
>>>>>> CP 226, office 2.N4.211, building NO * and CARPE DIEM.
>>>>>> Universite Libre de Bruxelles *
>>>>>> Boulevard du Triomphe * Tel : +32-2-650.35.71
>>>>>> B-1050 Bruxelles * Fax : +32-2-650.42.26
>>>>>> http://sb9.astro.ulb.ac.be/~pourbaix * mailto:
>>> pourb...@astro.ulb.ac.be
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
Re: [math] speeding up percentile based statistics

Reply via email to