[ 
https://issues.apache.org/jira/browse/RNG-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925254#comment-16925254
 ] 

Alex D Herbert commented on RNG-114:
------------------------------------

I've built a benchmark to investigate the shuffle of a List.

The benchmarking is done using an ArrayList and LinkedList. In the JDK the 
ArrayList is marked with an interface {{java.util.RandomAccess}} that states 
that the list supports fast (generally constant time) random access. This 
allows the JDK to handle lists using appropriate algorithms.

Here is the existing code from ListSampler against 
{{java.util.Collections.shuffle}}:
||Size||Method||Array||Linked||Relative Array||Relative Linked||
|10|Collections|41.43|92.26| | |
|10|PermutationSampler|67.97|104.04|1.64|1.13|
|100|Collections|437.58|762.51| | |
|100|PermutationSampler|476.92|2152.72|1.09|2.82|
|1000|Collections|4280.79|8275.67| | |
|1000|PermutationSampler|4631.31|331530.26|1.08|40.06|
|10000|Collections|42823.80|112289.16| | |
|10000|PermutationSampler|52537.36|38748538.37|1.23|345.08|

The existing code is a bit slower for ArrayLists and, as expected, the existing 
code does not handle LinkedList.

The current code in ListSampler creates an array of indices, shuffles them 
using the PermutationSampler, then uses those indices with a copy of the array 
to rewrite the original array. In contrast Collections shuffles a RandomAccess 
list in place, or extracts an array, shuffle that in-place and then rewrites to 
the original array using a list iterator.

An update of the algorithm has two options:
 # PermutationSamplerRandomAccess: Continue to delegate the shuffle to the 
PermuationSampler. However detect if the list is RandomAccess and then rewrite 
to the original array using either direct set of each index or via the list 
iterator.
 # DirectRandomAccess: Copy the method of Collections and shuffle direct if 
possible

The method to delegate to PermutationSampler currently duplicates the input 
List and also creates a list of integers of the same size for a shuffle. In 
contrast the method of Collections has no memory overhead for DirectAccess 
lists and only the overhead of a list duplication for non DirectAccess lists.
||Size||Method||Array||Linked||Relative Array||Relative Linked||
|10|Collections|41.43|92.26| | |
|10|DirectRandomAccess|61.99|90.88|1.50|0.99|
|10|PermutationSamplerRandomAccess|67.14|101.13|1.62|1.10|
|100|Collections|437.58|762.51| | |
|100|DirectRandomAccess|406.57|722.86|0.93|0.95|
|100|PermutationSamplerRandomAccess|459.10|703.42|1.05|0.92|
|1000|Collections|4280.79|8275.67| | |
|1000|DirectRandomAccess|4309.85|8322.40|1.01|1.01|
|1000|PermutationSamplerRandomAccess|4849.94|8929.17|1.13|1.08|
|10000|Collections|42823.80|112289.16| | |
|10000|DirectRandomAccess|42645.86|109335.42|1.00|0.97|
|10000|PermutationSamplerRandomAccess|50631.56|113643.49|1.18|1.01|

Here the modification of the original PermutationSampler has removed the 
quadratic time behaviour for LinkedLists. This method remains a bit slower for 
ArrayLists.

However these results show that for a full list shuffle using the method from 
Collections is faster (and is also more memory efficient).
h2. Extension to directional shuffle

The ListSampler also supports a partial shuffle of a list from a start point 
towards the head or tail of the list. This is where it is appropriate to 
delegate to the PermutationSampler as the shuffle handles this. The results can 
be copied back to the original list from the appropriate range that was 
shuffled.

An update of the algorithm has three options:
 # PermutationSamplerRandomAccess: Continue to delegate the shuffle to the 
PermuationSampler. However detect if the list is RandomAccess and then rewrite 
to the original array using either direct set of each index or via the list 
iterator. The update of the list should detect the range to target in the 
update (the present method does not do this an updates the entire list).
 # DirectRandomAccessDirectional: Copy the method of Collections and shuffle 
direct if possible. Modify the shuffle to do a directional shuffle and then 
update only the target range in the original list.
 # DirectRandomAccessSublist: Exploit the {{List.subList(int, int)}} method to 
extract a sublist from the input list and then shuffle using the direct method 
adapted from Collections.

Note for the use of a sublist:
 - ArrayLists return a list that is marked with DirectAccess.
 - LinkedLists that are converted to arrays will only require extra memory for 
the part of the list that is shuffled.

Here are the results for a shuffle of an array twice. Once from the middle to 
the head and once from the middle to the tail:
||Size||Method||Array||Linked||Relative Array||Relative Linked||
|10|DirectRandomAccessSublist|46.16|186.48|1.00|1.00|
|10|DirectRandomAccessDirectional|41.99|146.28|0.91|0.78|
|10|PermutationSamplerRandomAccess|105.35|193.64|2.28|1.04|
|100|DirectRandomAccessSublist|500.83|1021.95|1.00|1.00|
|100|DirectRandomAccessDirectional|480.11|1125.24|0.96|1.10|
|100|PermutationSamplerRandomAccess|589.58|1111.30|1.18|1.09|
|1000|DirectRandomAccessSublist|4734.42|10101.32|1.00|1.00|
|1000|DirectRandomAccessDirectional|4676.64|11379.03|0.99|1.13|
|1000|PermutationSamplerRandomAccess|6038.58|13147.48|1.28|1.30|
|10000|DirectRandomAccessSublist|47020.68|155079.67|1.00|1.00|
|10000|DirectRandomAccessDirectional|46537.58|179918.52|0.99|1.16|
|10000|PermutationSamplerRandomAccess|68735.10|171949.83|1.46|1.11|

The relative speed is shown against the sublist method. This overall is the 
fastest method across the range of array sizes and types. The direct partial 
shuffle of an ArrayList is faster for small lists. This may be due to the 
removal of the list iterator which adds an offset to all get and set accessor 
methods. When the list is larger this effect is less noticeable as speed may be 
dominated by memory access overhead to the larger array.

These results suggest altering the ListSampler to shuffle using a method 
similar to the method in Collections. Support for the partial shuffle of lists 
can be provided using a sublist.
h2. Optimisation

Most of the algorithms in Collections detect DirectAccess lists and switch 
algorithms. When the non DirectAccess list is small the algorithm for the 
direct access list is faster than the alternative. For shuffle the LinkedList 
can use the direct method when the size is less than 5. The Collections source 
code state that these sizes should be regularly revisited. I have tested lists 
of size 3-8. The method to shuffle an array in place is faster for lists under 
size 4:

!LinkedListShufflePerformance.png!

I have uploaded the benchmark and the modified ListSampler to use a shuffle 
based on Collections in [PR 66|https://github.com/apache/commons-rng/pull/66].

> Improve ListSampler shuffle algorithm to detect instances of RandomAccess
> -------------------------------------------------------------------------
>
>                 Key: RNG-114
>                 URL: https://issues.apache.org/jira/browse/RNG-114
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: sampling
>    Affects Versions: 1.2
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Minor
>         Attachments: LinkedListShufflePerformance.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The ListSampler shuffle algorithm uses the input list {{set(index)}} method. 
> If this runs in constant time then the algorithm is fast. If the list is not 
> an instance of {{java.util.RandomAccess}} such as a LinkedList the algorithm 
> will suffer from performance slowdown for large lists.
> This is handled in the JDK Collections.shuffle method by detecting if the 
> list is not an instance of RandomAccess. If so the list {{set(index)}} method 
> is avoided and the list is traversed for update using its iterator.
> Tasks:
>  * Benchmark the ListSampler method verses Collections.shuffle with ArrayList 
> and LinkedList
>  * Update the ListSampler to have comparable performance to 
> Collections.shuffle



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to