[jira] [Commented] (STATISTICS-63) Port o.a.c.math.stat.ranking to a commons-statistics-ranking module

Alex Herbert (Jira) Tue, 06 Dec 2022 08:22:11 -0800


    [ 
https://issues.apache.org/jira/browse/STATISTICS-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643943#comment-17643943
 ]


Alex Herbert commented on STATISTICS-63:
----------------------------------------

I have ported the math4 ranking package to a new statistics ranking module. 
This only has one class NaturalRanking and 3 interfaces to define functionality.
h2. Conversion

I have made some changes to the code during the conversion.

Bug Fix: Support ranking no data. The CM version would throw a runtime 
exception for an array of length 0.

Changed the RankingAlgorithm interface to implement DoubleUnaryOperator . This 
changes the method name from {{rank}} to {{{}apply{}}}. It allows chaining 
functions using the default method inherited from java.util.function.Function.

The RANDOM resolution of ties requires an optional source of randomness.

I changed the public API to not use UniformRandomProvider. A source of 
randomness can be provided using LongSupplier for randomness. If not provided 
then it will default to SplittableRandom. This minimised the public API of the 
module to only require interface provided by the JDK. You can create a 
NaturalRanking which uses a RNG of choice using a method reference:
{code:java}
NaturalRanking nr = new NaturalRanking(
    new SplittableRandom()::nextLong);{code}
I removed all the checked Math exceptions and replaced with 
IllegalArgumentException.

Changed the RANDOM tie resolution to create unique ranks. The previous 
implementation created random ranks in a range and may have created more ties 
in the output ranking. The RANDOM method is now similar to SEQUENTIAL but the 
order for ties is shuffled.

All rankings are now stable. Repeat input of a ranking array into the
same algorithm will not change it.

Updated the test suite. Existing tests were refactored and extras added.

Added test that the RANDOM strategy is actually random.

Fixed the test for randomness that was commented out (see also MATH-1361).

Optimise code:

Use a specialized primitive IntList to avoid List<Integer>.

Only build up the list of ties when at least one tie is found. For no ties then
no list is accumulated. Drop repeat use of new ArrayList<>() for list.clear().

Filter data only once for NaNs. This removed the requirement to change NaNs 
later to inf, throw exceptions or track points to ignore.

Noted that NaN will naturally be put at the end by the sort. The algorithm 
counts the NaNs and does not process them. Thus it does not have to maintain an 
array holding all the positions of the NaNs in the original data.
h2. Benchmark

I have created a benchmark for the NaturalRank implementation. It creates data 
of a given length with a fraction of the total length being assigned as ties. 
The number of tied-regions can be specified. The tied region is cut up into n 
random lengths using a Dirichlet distribution (see[ String cutting 
(wikipedia)|https://en.wikipedia.org/wiki/Dirichlet_distribution#String_cutting]).
 The data does not contain NaNs.

A test has been added to check that data created by the benchmark is ranked 
exactly the same by all ranking implementations.

For a baseline the input data is copied into an array of (index, value) pairs 
and sorted using the value. The rank order is assigned back to the index 
corresponding to the input. This method does not resolve ties.
||length||ranking||tieFraction||ties||Score||Relative||
|10000|baseline|0|20|1435940.0| |
|10000|baseline|0.1|20|1405690.6| |
|10000|baseline|0.5|20|1302646.1| |
|10000|math3|0|20|1612933.3|1.12|
|10000|math3|0.1|20|1594389.8|1.13|
|10000|math3|0.5|20|1393053.9|1.07|
|10000|statistics|0|20|1423274.7|0.99|
|10000|statistics|0.1|20|1446174.6|1.03|
|10000|statistics|0.5|20|1293842.5|0.99|

 Notes:
 * Arrays.sort is the bulk of the time for ranking
 * Having a lot of ties improve sorting speed (i.e. the JDK sort implementation 
can recognise sequences of tied data).
 * The statistics implementation is close to the same speed as the baseline.
 * The statistics implementation is faster than math3.

 

> Port o.a.c.math.stat.ranking to a commons-statistics-ranking module
> -------------------------------------------------------------------
>
>                 Key: STATISTICS-63
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-63
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: ranking
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Priority: Major
>
> The o.a.c.math4.legacy.stat.ranking package contains:
> {noformat}
> NaNStrategy.java
> NaturalRanking.java
> RankingAlgorithm.java
> TiesStrategy.java{noformat}
> There are no dependencies on other math packages.
> The TiesStrategy enum contains a RANDOM option:
> {noformat}
> "Ties get a random integral value from among applicable ranks."{noformat}
> I would suggest this is changed to
> {noformat}
> "Ties get a randomly assigned unique value from among applicable 
> ranks."{noformat}
> This is a minor change. But it allows ties to always be distinguished, which 
> seems to be the purpose of a tie strategy. The current implementation in math 
> just picks a random number and so ties can be resolved by assigning the same 
> rank to multiple points (thus not resolving anything).
> For example:
> {noformat}
> [0, 1, 1, 1, 2]{noformat}
> Can have an output of:
> {noformat}
> [0, 1, 2, 3, 4]
> [0, 1, 1, 1, 4]
> [0, 3, 3, 3, 4]
> etc{noformat}
> The suggested change would enumerate the ranks for the ties and then shuffle 
> them. All ranks would then be unique:
> {noformat}
> [0, 1, 2, 3, 4]
> [0, 1, 3, 2, 4]
> [0, 3, 2, 1, 4]
> etc{noformat}
> A second issue with the ranking package is it brings in a dependency on 
> UniformRandomProvider. If you do not supply one then an instance is created 
> (which may not be needed).
> Given that we now support JDK 8 I suggest the default uses an instance of 
> {{{}SplittableRandom{}}}. The user can override this by supplying a source of 
> random bits as a {{{}LongSupplier{}}}. This can be used as a source of 
> randomness for UniformRandomProvider from RNG. This is a functional interface 
> and using the long bits it can create random rank indices as required. The 
> package then does not expose non-JDK interfaces in its public API.
> Currently the NaturalRanking class has 6 constructors to set combinations for 
> the three properties: TiesStrategy; NaNStragtegy; and source of randomness. 
> Current API:
> {noformat}
> public NaturalRanking()
> public NaturalRanking(TiesStrategy)
> public NaturalRanking(NaNStrategy)
> public NaturalRanking(NaNStrategy, TiesStrategy)
> public NaturalRanking(UniformRandomProvider)
> public NaturalRanking(NaNStrategy, UniformRandomProvider){noformat}
> The constructors that accept a TiesStrategy create a generator even though 
> the TiesStrategy may not require it (i.e. is not RANDOM). The generator 
> should be created on demand when ties occur in the data.
> Note: The set of constructors could be changed to a builder pattern. This 
> would add builder object creation overhead for any new strategy. It also does 
> not allow implicit setting of the TiesStrategy to RANDOM if a constructor 
> with a source of randomness is used. An initial port can maintain the current 
> 6 constructors. It can be changed before the first release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (STATISTICS-63) Port o.a.c.math.stat.ranking to a commons-statistics-ranking module

Reply via email to