[ https://issues.apache.org/jira/browse/NUMBERS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855978#comment-17855978 ]
Alex Herbert commented on NUMBERS-206: -------------------------------------- h1. Performance on Different Data Types I copied the current double select implementation for other primitive types that cannot be easily sorted using a histogram (i.e. char, short and byte). These were tested using the BM data on length 50000: !SelectDataType.png|width=800,height=600! There is very little difference between the float and double types; float is marginally faster. The integer types long and int are noticeably faster (around 20-25%). This difference is approximately the same when selection is repeated using uniform random data, for example: ||k||double||float||long||int|| |100|11395639641|11247566058|8978986808|8784180316| |Relative| 1.0|0.987|0.787|0.771| h2. Use Cases Note that using the double implementation will be able to provide the same result for all other datatypes if they are converted to a double, with the exception of any long data that exceeds 53-bits of precision. The downsides are: a (small) performance loss; and an increase in memory consumption. Memory overhead will vary based on whether the original data is changed to be stored as a double, or copied to a double array for selection. Before adding support for other data types, the use cases should be considered. Currently selection is used in the Statistics package for quantile estimation of data. Typical numeric analysis would use a double data value. The Statistics descriptive package also supports the other primitive types used in Java streams (long and int). If numeric analysis is being performed on discrete data then the input could be a long or an int. Note the Statistics distributions package implements discrete distributions using the int primitive data type. The use of int and smaller primitive types for numeric data is typical for images (e.g. 16-bit or 8-bit image data). The use of a long data type is unnecessary except for extremely large counts. This may occur for example when analysing file sizes for millions of files. However a file of size 53-bits would be 8+ petabytes. This is an unrealistic size for a single file; a large datastore may exceed this size as a total. Since selection would be used on single elements then partial sorting of long array data of file sizes could be performed using a double. Due to the improved performance over a double, and possible application to various discrete data it may be useful to add selection support for int[] data (and a matching Quantile implementation). > Selection API > ------------- > > Key: NUMBERS-206 > URL: https://issues.apache.org/jira/browse/NUMBERS-206 > Project: Commons Numbers > Issue Type: New Feature > Components: arrays > Reporter: Alex Herbert > Priority: Major > Fix For: 1.2 > > Attachments: SelectDataType.png > > > Create a selection API to select the k-th largest element in an array. This > places at k the same value that would be at k in a fully sorted array. > {code:java} > public final class Selection { > public static void select(double[] a, int k); > public static void select(double[] a, int from, int to, int k); > public static void select(double[] a, int[] k); > public static void select(double[] a, int from, int to, int[] k); > // Extend to other primitive data types that are not easily sorted (e.g. > long, float, int) > {code} > Note: This API will support multiple points (int[] k) for use in quantile > estimation of array data by interpolation of neighbouring values (see > STATISTICS-85). -- This message was sent by Atlassian Jira (v8.20.10#820010)