[jira] [Commented] (NUMBERS-206) Selection API

Alex Herbert (Jira) Tue, 18 Jun 2024 08:41:09 -0700


    [ 
https://issues.apache.org/jira/browse/NUMBERS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855978#comment-17855978
 ]


Alex Herbert commented on NUMBERS-206:
--------------------------------------

h1. Performance on Different Data Types

I copied the current double select implementation for other primitive types 
that cannot be easily sorted using a histogram (i.e. char, short and byte). 
These were tested using the BM data on length 50000:

!SelectDataType.png|width=800,height=600!

There is very little difference between the float and double types; float is 
marginally faster. The integer types long and int are noticeably faster (around 
20-25%).

This difference is approximately the same when selection is repeated using 
uniform random data, for example:
||k||double||float||long||int||
|100|11395639641|11247566058|8978986808|8784180316|
|Relative| 1.0|0.987|0.787|0.771|
h2. Use Cases

Note that using the double implementation will be able to provide the same 
result for all other datatypes if they are converted to a double, with the 
exception of any long data that exceeds 53-bits of precision. The downsides 
are: a (small) performance loss; and an increase in memory consumption. Memory 
overhead will vary based on whether the original data is changed to be stored 
as a double, or copied to a double array for selection.

Before adding support for other data types, the use cases should be considered. 
Currently selection is used in the Statistics package for quantile estimation 
of data. Typical numeric analysis would use a double data value. The Statistics 
descriptive package also supports the other primitive types used in Java 
streams (long and int). If numeric analysis is being performed on discrete data 
then the input could be a long or an int. Note the Statistics distributions 
package implements discrete distributions using the int primitive data type. 
The use of int and smaller primitive types for numeric data is typical for 
images (e.g. 16-bit or 8-bit image data). The use of a long data type is 
unnecessary except for extremely large counts. This may occur for example when 
analysing file sizes for millions of files. However a file of size 53-bits 
would be 8+ petabytes. This is an unrealistic size for a single file; a large 
datastore may exceed this size as a total. Since selection would be used on 
single elements then partial sorting of long array data of file sizes could be 
performed using a double.

Due to the improved performance over a double, and possible application to 
various discrete data it may be useful to add selection support for int[] data 
(and a matching Quantile implementation).

 

> Selection API
> -------------
>
>                 Key: NUMBERS-206
>                 URL: https://issues.apache.org/jira/browse/NUMBERS-206
>             Project: Commons Numbers
>          Issue Type: New Feature
>          Components: arrays
>            Reporter: Alex Herbert
>            Priority: Major
>             Fix For: 1.2
>
>         Attachments: SelectDataType.png
>
>
> Create a selection API to select the k-th largest element in an array. This 
> places at k the same value that would be at k in a fully sorted array.
> {code:java}
> public final class Selection {
>     public static void select(double[] a, int k);
>     public static void select(double[] a, int from, int to, int k);
>     public static void select(double[] a, int[] k);
>     public static void select(double[] a, int from, int to, int[] k);
>     // Extend to other primitive data types that are not easily sorted (e.g. 
> long, float, int)
> {code}
> Note: This API will support multiple points (int[] k) for use in quantile 
> estimation of array data by interpolation of neighbouring values (see 
> STATISTICS-85).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUMBERS-206) Selection API

Reply via email to