Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v12]

2022-05-25 Thread Laurent Bourgès
On Mon, 14 Mar 2022 19:12:29 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   * Improved mixed insertion sort
>   * Optimized insertion sort
>   * Improved merging sort
>   * Optimized soring tests

Well explained, vladimir.

I propose to adopt max_heap / 128 min to stay less than 1% footprint.

Then factors are 7 for ints, 8 for long, min.

For 1gb heap, it gives 8mb, so 2m ints or 1m long.

Laurent

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v12]

2022-05-15 Thread Laurent Bourgès
On Mon, 14 Mar 2022 19:12:29 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   * Improved mixed insertion sort
>   * Optimized insertion sort
>   * Improved merging sort
>   * Optimized soring tests

I checked the latest code: line 128:

Max length of additional buffer, limited by max_heap / 64 or 256mb elements 
(2gb max).

private static final int MAX_BUFFER_LENGTH =
(int) Math.min(Runtime.getRuntime().maxMemory() >> 6, 256L << 20);

So the current upper limit concerns the max length = max_heap_bytes / 64 (max 
heap/8) or 2gb (if heap is huge).

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v12]

2022-05-15 Thread Laurent Bourgès
On Fri, 13 May 2022 17:48:50 GMT, Piotr Tarsa  wrote:

> allocating extra buffers and catching OOME when sorting primitives is rather 
> unsatisfactory. you're not giving a reliable option for sorting under low 
> memory conditions. IMO at least the single-threaded primitives (ints, longs, 
> floats, etc all non-objects) sorting should be frugal when it comes to memory 
> usage.
> 
> just my 2 cents.

DPQS uses several sorting algorithms, merge sort & new radix sort need an extra 
buffer in contrary to isort, mixed isort, single and dual pivot quick sort.
In this PR an upper limit on the heap usage is in use min(max heap / 16, 128mb) 
to reduce the memory footprint. 
Anyway catching OOME now permits DPQS to use in-place but slower algorithms if 
the extra buffer can not be allocated.
Possibly the upper limit could be made configurable using system properties if 
it is really critical.
To sum up, low allocations are now under control in this PR.

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v12]

2022-05-06 Thread Laurent Bourgès
On Mon, 14 Mar 2022 19:12:29 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   * Improved mixed insertion sort
>   * Optimized insertion sort
>   * Improved merging sort
>   * Optimized soring tests

I am improving test code (jmh + robust estimators) to have more robust results 
on small arrays (20 - 1000).
I do hope having new results within 1 or 2 weeks.

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v10]

2022-01-12 Thread Laurent Bourgès
On Mon, 29 Nov 2021 21:16:32 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   * Updated javadoc
>   * Optimized insertion sort threshold
>   * Refactored parallel sorting section
>   * Improved step for pivot candidates
>   * Changed condition for Radix sort

Please review this PR, it is ready for months, or give comments, please !

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v10]

2021-12-12 Thread Laurent Bourgès
On Mon, 29 Nov 2021 21:16:32 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   * Updated javadoc
>   * Optimized insertion sort threshold
>   * Refactored parallel sorting section
>   * Improved step for pivot candidates
>   * Changed condition for Radix sort

Vladimir,
I updated the benchmarking code and here are the complete test results: system 
vs dpqs 21.5 vs 21.11:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/sort-bench/results/2021/2021-12-final/DPQS-sort-1k-1M-stats.out

Full results:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/sort-bench/results/2021/2021-12-final/sort-1k-1M-stats.out

All good, on 1k, 10k, 100k, 1m integer arrays.
Congratulations
拾

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: 8265891: (ch) InputStream returned by Channels.newInputStream should override transferTo [v13]

2021-12-06 Thread Laurent Bourgès
On Mon, 6 Dec 2021 07:07:03 GMT, Markus KARG  wrote:

>> Markus KARG has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Draft: Eliminated duplicate code using lambda expressions
>>  - Draft: Use blocking mode also for target channel
>
> Please keep this PR open. More commits will follow on the next weeks.

Looks promising, keep moving, @mkarg !

-

PR: https://git.openjdk.java.net/jdk/pull/4263


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v9]

2021-11-28 Thread Laurent Bourgès
On Mon, 15 Nov 2021 16:20:07 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   Updated comments for partitioning

As jdk18 Ramp Down 0 is approaching and jdk19 is being opened soon, maybe this 
PR review should be planned for jdk19 build 1 ?

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v9]

2021-11-17 Thread Laurent Bourgès
On Mon, 15 Nov 2021 16:20:07 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   Updated comments for partitioning

Please core-libs reviewers could you have a look before jdk18 rdp0 or not ?

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v9]

2021-11-15 Thread Laurent Bourgès
On Mon, 15 Nov 2021 16:20:07 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   Updated comments for partitioning

As I am not an official openjdk reviewer, I can not approve, but I do support

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8277095 : Empty streams create too many objects

2021-11-15 Thread Laurent Bourgès
On Fri, 5 Nov 2021 12:53:46 GMT, kabutz  wrote:

> This is a draft proposal for how we could improve stream performance for the 
> case where the streams are empty. Empty collections are common-place. If we 
> iterate over them with an Iterator, we would have to create one small 
> Iterator object (which could often be eliminated) and if it is empty we are 
> done. However, with Streams we first have to build up the entire pipeline, 
> until we realize that there is no work to do. With this example, we change 
> Collection#stream() to first check if the collection is empty, and if it is, 
> we simply return an EmptyStream. We also have EmptyIntStream, EmptyLongStream 
> and EmptyDoubleStream. We have taken great care for these to have the same 
> characteristics and behaviour as the streams returned by Stream.empty(), 
> IntStream.empty(), etc. 
> 
> Some of the JDK tests fail with this, due to ClassCastExceptions (our 
> EmptyStream is not an AbstractPipeline) and AssertionError, since we can call 
> some methods repeatedly on the stream without it failing. On the plus side, 
> creating a complex stream on an empty stream gives us upwards of 50x increase 
> in performance due to a much smaller object allocation rate. This PR includes 
> the code for the change, unit tests and also a JMH benchmark to demonstrate 
> the improvement.

Looks a good idea to reduce the pipeline overhead !

-

PR: https://git.openjdk.java.net/jdk/pull/6275


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v6]

2021-10-08 Thread Laurent Bourgès
On Wed, 6 Oct 2021 21:21:29 GMT, iaroslavski 
 wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   Added more comments

It is possible to use an upper limit on the array size to disable radix sort 
and reduce memory footprint and gc overhead...

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v6]

2021-10-08 Thread Laurent Bourgès
On Wed, 6 Oct 2021 21:21:29 GMT, iaroslavski 
 wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   Added more comments

I want to give my point of view on the new memory requirements:
- as radix sort is used for length > 4k, more often a temporary buffer will be 
needed (size = array length), so more often young gen GC passes will happen and 
maybe full gc if this buffer is really big (half heap?)
- if OOME is encountered when the buffer is allocated, then dpqs is used 
(in-place no-alloc) so new sorting is more robust than dpqs in jdk17

Sorting tests could be made with verbose gc to ascertain the increased gc work, 
latency...

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Fri, 14 May 2021 17:50:11 GMT, Piotr Tarsa 
 wrote:

>> I made a JMH test on jdk16 to test count4 (xor) performance:
>> https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/sort-bench/results/count_xor
>> 
>> 
>> Benchmark (size) Mode Cnt Score Error Units
>> ArrayXorBenchmark.arrayAndOriginal 100 avgt 20 684561,951 ± 2177,120 
>> ns/op
>> ArrayXorBenchmark.arrayXorOriginal 100 avgt 20 777255,465 ± 1815,136 
>> ns/op
>> ArrayXorBenchmark.arrayXor_Masked 100 avgt 20 814163,377 ± 2665,436 ns/op
>> ArrayXorBenchmark.arrayXor_Unsafe 100 avgt 20 707273,922 ± 2994,380 ns/op
>> 
>> Masked xor does not get optimized by c2 too.
>> 
>> Using Unsafe is better, see:
>> https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/sort-bench/src/main/java/edu/sorting/bench/ArrayXorBenchmark.java
>> 
>> If you want, I could make another radixsort() using on or off heap count 
>> buffers and Unsafe, as I did in Marlin to avoid bound checks...
>
> I think an uncontroversial and efficient solution would be to replace
> 
> count4[(a[i] >>> 24) ^ 0x80]--;
> 
> with
> 
> count4[(a[i] >>> 24) & 0xFF]--;
> 
> and after finishing counting, first half of `count4` should be swapped with 
> second half of `count4`.

FYI I made another radixsortNew() using Unsafe to access to all count arrays:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/dbfbd731ffd798defc75528cc1db04063bdb4619/src/edu/sorting/DualPivotQuicksort202105.java#L795

But it is only 2% faster in radix-sort-benchmark (1M arrays):
https://github.com/bourgesl/radix-sort-benchmark/blob/main/results/2021-ref/cmp-16-1M-full.log

It looks not worth the extra complexity for few percents, I think.

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Thu, 20 May 2021 21:21:26 GMT, Nils Eliasson  wrote:

>> A small update of the XorINodes type calculation makes the bound check go 
>> away. Testing now.
>
> That bounds check shouldn't be there, it's an obvious case and will be fixed: 
> https://github.com/openjdk/jdk/pull/4136

Thanks for fixing hotspot C2 (xor) so quickly !

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Fri, 14 May 2021 07:41:19 GMT, Richard Startin 
 wrote:

>> Great analysis on C2, richard.
>> 
>> maybe (x ^ 0x80) &0xFF would help C2 to eliminate bound checks...
>
> I don't know Laurent, I find the handling of signed order over-complicated. 
> Subtracting `Integer.MIN_VALUE` is really cheap...

I made a JMH test on jdk16 to test count4 (xor) performance:
https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/sort-bench/results/count_xor


Benchmark (size) Mode Cnt Score Error Units
ArrayXorBenchmark.arrayAndOriginal 100 avgt 20 684561,951 ± 2177,120 ns/op
ArrayXorBenchmark.arrayXorOriginal 100 avgt 20 777255,465 ± 1815,136 ns/op
ArrayXorBenchmark.arrayXor_Masked 100 avgt 20 814163,377 ± 2665,436 ns/op
ArrayXorBenchmark.arrayXor_Unsafe 100 avgt 20 707273,922 ± 2994,380 ns/op

Masked xor does not get optimized by c2 too.

Using Unsafe is better, see:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/sort-bench/src/main/java/edu/sorting/bench/ArrayXorBenchmark.java

If you want, I could make another radixsort() using on or off heap count 
buffers and Unsafe, as I did in Marlin to avoid bound checks...

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Thu, 13 May 2021 09:53:37 GMT, Richard Startin 
 wrote:

>> Laurent, the date in this class is not the date of our last commit,
>> this date is the date when I have final idea regarding to Radix sort,
>> therefore, I prefer to keep 2020.06.14
>
> Hi @iaroslavski I'm unconvinced that this work was from 14/06/2020 - I 
> believe this work derives from an unsigned radix sort I implemented on 
> 10/04/2021 
> https://github.com/richardstartin/radix-sort-benchmark/commit/ab4da230e1d0ac68e5ee2cee38d71c7e7d50f49b#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR226
>  which has numerous structural similarities to this work:
> * Producing all four histograms in one pass
> * Skipping passes based on detecting the total in the histogram
> * Bailing out of the skip detection if a nonzero value not equal to the total 
> is encountered
> * Manually unrolling the LSD radix sort loop in order to avoid array copies
> 
> My implementation from 10th April is below for reference:
> 
>   public static void unrollOnePassHistogramsSkipLevels(int[] data) {
> int[] histogram1 = new int[257];
> int[] histogram2 = new int[257];
> int[] histogram3 = new int[257];
> int[] histogram4 = new int[257];
> 
> for (int value : data) {
>   ++histogram1[(value & 0xFF) + 1];
>   ++histogram2[((value >>> 8) & 0xFF) + 1];
>   ++histogram3[((value >>> 16) & 0xFF) + 1];
>   ++histogram4[(value >>> 24) + 1];
> }
> boolean skipLevel1 = canSkipLevel(histogram1, data.length);
> boolean skipLevel2 = canSkipLevel(histogram2, data.length);
> boolean skipLevel3 = canSkipLevel(histogram3, data.length);
> boolean skipLevel4 = canSkipLevel(histogram4, data.length);
> 
> if (skipLevel1 && skipLevel2 && skipLevel3 && skipLevel4) {
>   return;
> }
> int[] copy = new int[data.length];
> 
> int[] source = data;
> int[] dest = copy;
> 
> if (!skipLevel1) {
>   for (int i = 1; i < histogram1.length; ++i) {
> histogram1[i] += histogram1[i - 1];
>   }
>   for (int value : source) {
> dest[histogram1[value & 0xFF]++] = value;
>   }
>   if (!skipLevel2 || !skipLevel3 || !skipLevel4) {
> int[] tmp = dest;
> dest = source;
> source = tmp;
>   }
> }
> 
> if (!skipLevel2) {
>   for (int i = 1; i < histogram2.length; ++i) {
> histogram2[i] += histogram2[i - 1];
>   }
>   for (int value : source) {
> dest[histogram2[(value >>> 8) & 0xFF]++] = value;
>   }
>   if (!skipLevel3 || !skipLevel4) {
> int[] tmp = dest;
> dest = source;
> source = tmp;
>   }
> }
> 
> if (!skipLevel3) {
>   for (int i = 1; i < histogram3.length; ++i) {
> histogram3[i] += histogram3[i - 1];
>   }
>   for (int value : data) {
> dest[histogram3[(value >>> 16) & 0xFF]++] = value;
>   }
>   if (!skipLevel4) {
> int[] tmp = dest;
> dest = source;
> source = tmp;
>   }
> }
> 
> if (!skipLevel4) {
>   for (int i = 1; i < histogram4.length; ++i) {
> histogram4[i] += histogram4[i - 1];
>   }
>   for (int value : source) {
> dest[histogram4[value >>> 24]++] = value;
>   }
> }
> if (dest != data) {
>   System.arraycopy(dest, 0, data, 0, data.length);
> }
>   }
> 
>   private static boolean canSkipLevel(int[] histogram, int dataSize) {
> for (int count : histogram) {
>   if (count == dataSize) {
> return true;
>   } else if (count > 0) {
> return false;
>   }
> }
> return true;
>   }
> 
> 
> Moreover, @bourgesl forked my repository on 11/04/2021 and communicated with 
> me about doing so. On 25/04/2021 there was a new implementation of 
> `DualPivotQuicksort` with a signed radix sort but the same structural 
> similarities, and with the same method and variable names in places 
> https://github.com/bourgesl/radix-sort-benchmark/commit/90ff7e427da0fa49f374bff0241fb2487bd87bde#diff-397ce8fd791e2ce508cf9127201bc9ab46264cd2a79fd0487a63569f2e4b59b2R607-R609
> 
> 
> // TODO add javadoc
> private static void radixSort(Sorter sorter, int[] a, int low, int high) {
> int[] b;
> // LBO: prealloc (high - low) +1 element:
> if (sorter == null || (b = sorter.b) == null || b.length < (high - 
> low)) {
> // System.out.println("alloc b: " + (high - low));
> b = new int[high - low];
> }
> 
> int[] count1, count2, count3, count4;
> if (sorter != null) {
> sorter.resetRadixBuffers();
> count1 = sorter.count1;
> count2 = sorter.count2;
> count3 = sorter.count3;
> count4 = sorter.count4;
> } else {
> // System.out.println("alloc radix buffers(4x256)");
> count1 = new int[256];
> count2 = new int[256];
> count3 = new int[256];
>  

Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Thu, 13 May 2021 21:44:38 GMT, Richard Startin 
 wrote:

>> For what it's worth, I benchmarked this implementation radix sort ([adapted 
>> here to fit in to my 
>> harness](https://github.com/richardstartin/radix-sort-benchmark/commit/07169e8e8602152cfda859baa159db165bf5fcab#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR681-R710782))
>>  against a [signed 
>> variant](https://github.com/richardstartin/radix-sort-benchmark/commit/07169e8e8602152cfda859baa159db165bf5fcab#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR396-R478)
>>  of what I have claimed this work was derived from and the proposed 
>> implementation does not perform favourably on uniformly random data:
>> 
>> 
>> 
>> Benchmark (bits)  (padding)  
>> (scenario)  (seed)   (size)  Mode  Cnt  Score Error  Units
>> RadixSortBenchmark.jdk17  7 
>> UNIFORM   0  100  avgt5  11301.950 ± 113.691  us/op
>> RadixSortBenchmark.jdk23  7 
>> UNIFORM   0  100  avgt5  11792.351 ±  60.757  us/op
>> RadixSortBenchmark.jdk30  7 
>> UNIFORM   0  100  avgt5  11184.616 ±  67.094  us/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned  17  7 
>> UNIFORM   0  100  avgt5   9564.626 ±  69.497  us/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned  23  7 
>> UNIFORM   0  100  avgt5   9432.085 ±  58.983  us/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned  30  7 
>> UNIFORM   0  100  avgt5  10772.975 ±  51.848  us/op
>> 
>> 
>> 
>> I believe the root cause is a defect in the mechanism employed to skip 
>> passes as can be seen by the increased number of instructions and cycles 
>> here. In the proposed implementation, instructions is roughly constant as a 
>> function of bits. In the case where all passes must be performed (bits = 
>> 30), IPC is superior in `unrollOnePassHistogramsSkipLevelsSigned`.
>> 
>> 
>> Benchmark   
>> (bits)  (padding)  (scenario)  (seed)   (size)  Mode  Cnt Score 
>> Error  Units
>> RadixSortBenchmark.jdk:cycles   
>> 17  7 UNIFORM   0  100  avgt   34976971.877  
>>#/op
>> RadixSortBenchmark.jdk:instructions 
>> 17  7 UNIFORM   0  100  avgt   70121142.003  
>>#/op
>> RadixSortBenchmark.jdk:cycles   
>> 23  7 UNIFORM   0  100  avgt   32369970.385  
>>#/op
>> RadixSortBenchmark.jdk:instructions 
>> 23  7 UNIFORM   0  100  avgt   70201664.963  
>>#/op
>> RadixSortBenchmark.jdk:cycles   
>> 30  7 UNIFORM   0  100  avgt   30789736.602  
>>#/op
>> RadixSortBenchmark.jdk:instructions 
>> 30  7 UNIFORM   0  100  avgt   70180942.122  
>>#/op
>> RadixSortBenchmark.jdk:IPC  
>> 30  7 UNIFORM   0  100  avgt  2.279  
>>   insns/clk
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:cycles 
>> 17  7 UNIFORM   0  100  avgt   26983994.479  
>>#/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:instructions   
>> 17  7 UNIFORM   0  100  avgt   62065304.827  
>>#/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:cycles 
>> 23  7 UNIFORM   0  100  avgt   26161703.124  
>>#/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:instructions   
>> 23  7 UNIFORM   0  100  avgt   63102683.167  
>>#/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:cycles 
>> 30  7 UNIFORM   0  100  avgt   29780103.795  
>>#/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:instructions   
>> 30  7 UNIFORM   0  100  avgt   74437097.025  
>>#/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:IPC
>> 30  7 UNIFORM   0  100  avgt  2.500  
>>   insns/clk
>> 
>> 
>> 
>> The biggest difference in executed code appears to be that bounds checks are 
>> not eliminated in the first counting pass in the proposed implementation:
>> 
>> 
>> [Hottest Region 
>> 

Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Wed, 12 May 2021 12:20:09 GMT, iaroslavski 
 wrote:

>> src/java.base/share/classes/java/util/DualPivotQuicksort.java line 47:
>> 
>>> 45:  * @author Doug Lea
>>> 46:  *
>>> 47:  * @version 2020.06.14
>> 
>> Vladimir, I would update to 2021.05.06 (+your hash)
>
> Laurent, the date in this class is not the date of our last commit,
> this date is the date when I have final idea regarding to Radix sort,
> therefore, I prefer to keep 2020.06.14

as you want, but you should maybe indicate which version corresponds to your 
master code too.

It would help tracking changes between forks (iarosalvskiy/sorting master)

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Sat, 8 May 2021 20:54:48 GMT, iaroslavski 
 wrote:

> Sorting:
> 
> - adopt radix sort for sequential and parallel sorts on int/long/float/double 
> arrays (almost random and length > 6K)
> - fix tryMergeRuns() to better handle case when the last run is a single 
> element
> - minor javadoc and comment changes
> 
> Testing:
> - add new data inputs in tests for sorting
> - add min/max/infinity values to float/double testing
> - add tests for radix sort

Congratulation,
what an amazing job !

DPQS is now 50% faster in average but 2.5 times faster (rms) my favorite !!

Finally OOME is now handled to let sort work under low-mem conditions !

I will work on more benchmarks for long/float/double types.

Laurent

Still waiting for individual OCA approval

src/java.base/share/classes/java/util/DualPivotQuicksort.java line 47:

> 45:  * @author Doug Lea
> 46:  *
> 47:  * @version 2020.06.14

Vladimir, I would update to 2021.05.06 (+your hash)

src/java.base/share/classes/java/util/DualPivotQuicksort.java line 288:

> 286: /*
> 287:  * Invoke radix sort on large array.
> 288:  */

I prefer testing (sorter == null) first as it is always true for sequential 
sort and avoid testing bits > ... in this case

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: How to know if cpu supports unaligned memory accesses ?

2021-04-19 Thread Laurent Bourgès
Thanks, Alan !

Probably I should adopt VarHandle and use their int / long views... (for
jdk11u).

Or adapt my arrays/off-heap buffers (struct like) using primitive objects
JEP 401 for jdk17 !

It looks very promising as the Marlin renderer uses arrays of structs,
packed in arrays or off-heap (edges, rle tile cache).

Cheers,
Laurent

Le dim. 18 avr. 2021 à 09:11, Alan Bateman  a
écrit :

>
> On 17/04/2021 19:02, Laurent Bourgès wrote:
> > Hi,
> > In the Marlin renderer & in other Unsafe use cases (pixel tile
> processing)
> > for java2d pipelines, I do use putInt()/putLong() primitives even if the
> > address is not aligned (to 4 or 8 bytes) to get faster processing of byte
> > buffer or arrays...
> >
> > Is it possible in java (jdk internals) to query cpu capabilities like
> > endianness, unaligned support ?
> >
> Since Marlin is in the JDK and using Unsafe already then can't it just
> using the isBigEndian and unalignedAccess methods? The Buffer
> implementation and a few places do the same.
>
> -Alan.
>


How to know if cpu supports unaligned memory accesses ?

2021-04-17 Thread Laurent Bourgès
Hi,
In the Marlin renderer & in other Unsafe use cases (pixel tile processing)
for java2d pipelines, I do use putInt()/putLong() primitives even if the
address is not aligned (to 4 or 8 bytes) to get faster processing of byte
buffer or arrays...

Is it possible in java (jdk internals) to query cpu capabilities like
endianness, unaligned support ?

For x86-64: unaligned ~ aligned, so it is not necessary to add padding in
the data layout and then it is better to have smaller buffers.

I would like write code like:
if (cpu_supports_unaligned) {
...
Use int/long to pack bytes...
} else {
Use byte (fallback)
}


Laurent


Re: 8252827: Caching Integer.toString just like Integer.valueOf

2021-04-17 Thread Laurent Bourgès
Hi,

I read the JBS bug and I interpret it as:
- IntegerCache provides Integer instances for [-128, 127] by default
- Having Integer.toString(int) could behave the same or at least cache most
probable values like [-1 to 16] or using the IntegerCache range.

It looks trivial and potentially could benefits to jdk itself to reduce
memory garbage : is Integer.toString(int) widely used in the jdk codebase ?

Finally it can be alternatively implemented in application's code.

My 2 cents,
Laurent

Le sam. 17 avr. 2021 à 11:06, Raffaello Giulietti <
raffaello.giulie...@gmail.com> a écrit :

>
>
> On 2021-04-17 07:07, David Holmes wrote:
> > On 17/04/2021 4:54 am, Raffaello Giulietti wrote:
> >> I guess the reporter meant to limit the cache range similarly to the
> >> one used for valueOf().
> >>
> >> I have no clue about the benefit/cost ratio for the proposed String
> >> cache. It really depends on usage, workload, etc. One can easily
> >> imagine both extreme scenarios but it's hard to tell how the average
> >> one would look.
> >>
> >> My post is only about either solving the issue by implementing the
> >> cache, which I can contribute to; or closing it because of lack of
> >> real-world need or interest.
> >
> > Caching for the sake of caching is not an objective in itself. Unless
> > the caching can be shown to solve a real problem, and the strategy for
> > managing the cache is well-defined, then I would just close the
> > enhancement request. (Historically whether an issue we don't have any
> > firm plans to address is just left open "forever" or closed, depends
> > very much on who does the bug triaging in that area. :) )
> >
> > Cheers,
> > David
> >
>
>
> Indeed, the impression is that many of the issues are probably open
> because it's unclear whether they should be addressed with some
> implementation or spec effort or whether they should be closed. Triaging
> is certainly a harder job than it appears at first sight ;-)
>
> It would be useful to have a kind of "suspended" or "limbo" resolution
> state on the JBS for issues like this one, so people searching for more
> compelling open ones would not encounter them.
>
> Personally, I would close this issue without a "fix"; or "suspend" it.
>
>
> Greetings
> Raffaello
>
>
>
> >>
> >> Greetings
> >> Raffaello
> >>
> >>
> >> On 2021-04-16 20:36, Roger Riggs wrote:
> >>> Hi,
> >>>
> >>> Is there any way to quantify the savings?
> >>> And what technique can be applied to limit the size of the cache.
> >>> The size of the small integer cache is somewhat arbitrary.
> >>>
> >>> Regards, Roger
> >>>
> >>> On 4/16/21 12:48 PM, Raffaello Giulietti wrote:
>  Hello,
> 
>  does the enhancement proposed in [1] make sense, both today and when
>  wrappers will be migrated to primitive classes?
>  If so, it should be rather simple to add it and I could prepare a PR.
>  If not, the issue might perhaps be closed.
> 
> 
>  Greetings
>  Raffaello
> 
>  
> 
>  [1] https://bugs.openjdk.java.net/browse/JDK-8252827
> >>>
>


Re: RFR(S): 8242848: Improve performance of InflaterOutputStream.write()

2020-04-23 Thread Laurent Bourgès
Thank you Volker for sharing your results.

It is amazing, to improve such widely used feature (zip -> jar, http
compression...)

Congratulations.

Laurent

Le jeu. 23 avr. 2020 à 12:19, Volker Simonis  a
écrit :

> On Thu, Apr 23, 2020 at 10:56 AM Laurent Bourgès
>  wrote:
> >
> > Hi Volker,
> >
> > Could you give some benchmark results in the jbs bug to have an idea of
> the performance gain ?
>
> The results of a benchmark run are at the end of the microbenchmark
> which is attached to https://bugs.openjdk.java.net/browse/JDK-8242848
>
> I've attached them here for your convenience. The first run is before,
> the second run after applying this patch.
>
> E.g. for a user supplied buffer of size 16384, the time for inflation
> decreases from "2.603 ± 0.404  ms/op" to "2.187 ± 0.126  ms/op" which
> is an improvement of 16%. Obviously, increasing the default internal
> buffer size from 512 to 16384 (which I want to do in
> https://bugs.openjdk.java.net/browse/JDK-8242864) will get you even
> bigger speedups if you use the default buffer size :)
>
> /output/jdk-opt/images/jdk/bin/java
> --
> Benchmark(scale)  (size)  Mode  Cnt  Score
> Error  Units
> InflaterOutputStreamWrite.write1 512  avgt3  5.378 ±
> 0.301  ms/op
> InflaterOutputStreamWrite.write11024  avgt3  3.917 ±
> 0.165  ms/op
> InflaterOutputStreamWrite.write12048  avgt3  3.158 ±
> 0.097  ms/op
> InflaterOutputStreamWrite.write14096  avgt3  2.707 ±
> 0.138  ms/op
> InflaterOutputStreamWrite.write18192  avgt3  2.600 ±
> 0.399  ms/op
> InflaterOutputStreamWrite.write1   16384  avgt3  2.603 ±
> 0.404  ms/op
> InflaterOutputStreamWrite.write1   32768  avgt3  2.622 ±
> 0.211  ms/op
> InflaterOutputStreamWrite.write1   65536  avgt3  2.605 ±
> 0.170  ms/op
> InflaterOutputStreamWrite.write2 512  avgt3  4.015 ±
> 0.150  ms/op
> InflaterOutputStreamWrite.write21024  avgt3  3.225 ±
> 0.178  ms/op
> InflaterOutputStreamWrite.write22048  avgt3  2.745 ±
> 0.261  ms/op
> InflaterOutputStreamWrite.write24096  avgt3  2.614 ±
> 0.542  ms/op
> InflaterOutputStreamWrite.write28192  avgt3  2.593 ±
> 0.206  ms/op
> InflaterOutputStreamWrite.write2   16384  avgt3  2.606 ±
> 0.055  ms/op
> InflaterOutputStreamWrite.write2   32768  avgt3  2.611 ±
> 0.116  ms/op
> InflaterOutputStreamWrite.write2   65536  avgt3  2.617 ±
> 0.170  ms/op
> InflaterOutputStreamWrite.write4 512  avgt3  3.376 ±
> 0.599  ms/op
> InflaterOutputStreamWrite.write41024  avgt3  2.840 ±
> 0.155  ms/op
> InflaterOutputStreamWrite.write42048  avgt3  2.633 ±
> 0.550  ms/op
> InflaterOutputStreamWrite.write44096  avgt3  2.598 ±
> 0.166  ms/op
> InflaterOutputStreamWrite.write48192  avgt3  2.602 ±
> 0.054  ms/op
> InflaterOutputStreamWrite.write4   16384  avgt3  2.601 ±
> 0.039  ms/op
> InflaterOutputStreamWrite.write4   32768  avgt3  2.639 ±
> 0.020  ms/op
> InflaterOutputStreamWrite.write4   65536  avgt3  2.619 ±
> 0.260  ms/op
> InflaterOutputStreamWrite.write8 512  avgt3  2.882 ±
> 0.149  ms/op
> InflaterOutputStreamWrite.write81024  avgt3  2.695 ±
> 0.586  ms/op
> InflaterOutputStreamWrite.write82048  avgt3  2.644 ±
> 0.472  ms/op
> InflaterOutputStreamWrite.write84096  avgt3  2.616 ±
> 0.052  ms/op
> InflaterOutputStreamWrite.write88192  avgt3  2.616 ±
> 0.063  ms/op
> InflaterOutputStreamWrite.write8   16384  avgt3  2.611 ±
> 0.090  ms/op
> InflaterOutputStreamWrite.write8   32768  avgt3  2.633 ±
> 0.216  ms/op
> InflaterOutputStreamWrite.write8   65536  avgt3  2.634 ±
> 0.180  ms/op
>
> /priv/simonisv/output/jdk-opt-JDK-8242848/images/jdk/bin/java
> --
> Benchmark(scale)  (size)  Mode  Cnt  Score
> Error  Units
> InflaterOutputStreamWrite.write1 512  avgt3  5.388 ±
> 0.349  ms/op
> InflaterOutputStreamWrite.write11024  avgt3  3.799 ±
> 0.093  ms/op
> InflaterOutputStreamWrite.write12048  avgt3  2.994 ±
> 0.023  ms/op
> InflaterOutputStreamWrite.write14096  avgt3  2.583 ±
> 0.159  ms/op
> InflaterOutputStreamWrite.write18192  avgt3  2.325 ±
> 0.

Re: RFR(S): 8242848: Improve performance of InflaterOutputStream.write()

2020-04-23 Thread Laurent Bourgès
Hi Volker,

Could you give some benchmark results in the jbs bug to have an idea of the
performance gain ?

Thanks,
Laurent

Le jeu. 23 avr. 2020 à 10:20, Langer, Christoph 
a écrit :

> Hi Volker,
>
> since it's not yet pushed, I also went over the change and I like it. +1 
>
> One little style nit caught my eye, which you could correct before
> pushing: The style of the if/else blocks in
> test/jdk/java/util/zip/DeflateIn_InflateOut.java in lines 48/49 and 59/60
> does not match the other if/else blocks in the file. You should probably
> have the else on the line of the closing bracket before...
>
> Thanks
> Christoph
>
>
> > -Original Message-
> > From: core-libs-dev  On Behalf
> > Of Volker Simonis
> > Sent: Mittwoch, 22. April 2020 22:09
> > To: Lance Andersen 
> > Cc: Java Core Libs ; Vyom Tewari
> > 
> > Subject: Re: RFR(S): 8242848: Improve performance of
> > InflaterOutputStream.write()
> >
> > On Tue, Apr 21, 2020 at 5:23 PM Lance Andersen
> >  wrote:
> > >
> > > Hi Volker,
> > >
> > > I think overall this looks OK.  I went through the older SCCS
> histories to see
> > if I could figure out why they were using 512 for the input length but
> could
> > not find anything that might shed some light for me.
> > >
> >
> > Hi Lance,
> >
> > thanks a lot for digging in the old sources to review my change. It's
> > great that we stil have people who can use SCCS :)
> >
> > > I am not sure you can guarantee that src.zip exists but others might
> be able
> > to weigh in here.  What we have been trying to do going forward is to
> have
> > the tests create  the zip files that it needs.  In some cases, we have
> created a
> > byte array within the test which represents the zip and just write it out
> > before the test begins.
> > >
> >
> > Yes, the dependency on an external file was not nice, so I rewrote the
> > benchmark to programmatically create a file which can be compressed by
> > a factor of ~6:
> >
> > http://cr.openjdk.java.net/~simonis/webrevs/2020/8242848.02/
> >
> > Notice that this new version only changes the microbenchmark, all the
> > other files are untouched.
> >
> > As everybody seemed to be happy with the change itself and the
> > regression test, I'm now waiting for your and Clae's final review of
> > the microbenchmark before pushing. Please let me know hat you think?
> >
> > Best regards,
> > Volker
> >
> > > I am hoping others with more history might also chime in case they are
> > aware of the history here.
> > >
> > > Thank you for helping improve the performance.
> > >
> > > Best
> > > Lance
> > >
> > > On Apr 17, 2020, at 6:49 AM, Volker Simonis 
> > wrote:
> > >
> > > Thanks everybody for looking at this change!
> > >
> > > Please find an updated webrev (with a new test and micro-benchmark)
> > > and my answers to your comments below:
> > >
> > > http://cr.openjdk.java.net/~simonis/webrevs/2020/8242848.01/
> > >
> > > On Thu, Apr 16, 2020 at 6:40 AM Vyom Tiwari 
> > wrote:
> > >
> > > Thanks for doing this, i think the below code change is not required .
> > >
> > > Please do let me know if i am not reading it correctly.
> > >
> > > if (inf.finished() || (len == 0)/* no more input */) {
> > >
> > > If you check the native code "inf.finished() will be true only of the
> > corresponding native call inflate(strm, Z_PARTIAL_FLUSH) returns
> > "Z_STREAM_END".
> > >
> > > After your code change  write may return even if finished() is not
> true.
> > >
> > >
> > > Yes, that's true, but that's what we must do if there's no more input
> > > available. Before my change this break on "len < 1" was in the "if
> > > (inf.needsInput())" branch.
> > >
> > > On Thu, Apr 16, 2020 at 8:22 AM Thomas Stüfe
> >  wrote:
> > >
> > > 252 // Check the decompressor
> > > 253 if (inf.finished() || (len == 0)/* no more input
> */) {
> > > 254 break;
> > > 255 }
> > >
> > > Not sure but I think this is wrong because now you bypass the followup
> > handling of inf.needsDirectory.
> > >
> > > Inflater.inflate() returns 0 if either needsInput or needsDirectory.
> We have
> > to ignore needsInput since we have no input anymore, but needsDirectory
> > has to be handled, no?
> > >
> > >
> > > You're absolutely right Thomas. Thanks for catching this! I've moved
> > > the check for "needsDictionary" in front of the "finished() || len ==
> > > 0" check.
> > >
> > > Unfortunately there is not very good test coverage for zip with preset
> > > dictionaries (jtreg and submit repo passed without problems). So I
> > > added a new test for this use case to "
> > > test/jdk/java/util/zip/DeflateIn_InflateOut.java".
> > >
> > > On Thu, Apr 16, 2020 at 8:48 AM Thomas Stüfe
> >  wrote:
> > >
> > >
> > > As for increasing the buffer size, it makes sense but IMHO needs a CSR
> and
> > a release note.
> > >
> > >
> > > I don't think so. This is an internal, implementation specific setting
> > > which has never been exposed or documented before so I don't 

Re: RFR: 8226297: Dual-pivot quicksort improvements

2019-08-07 Thread Laurent Bourgès
Hi Brent,

I made a manual comparison of your webrev.4 and my previous patch.
- I noticed plenty of javadoc / comment changes that seem totally
appropriate in Arrays class: ok for me.
- You also fixed the problem with the ForkJoinPool field: good.
- No change in the DualPivotQuickSort class except copyright header: OK
(you reverted reordering I suppose)

My former question was more about the process: how to produce incremental
web revs easily ? it is easier for me to have a quick look, than inspecting
the global patch where DPQS class was almost rewritten.

Thanks for your help,
Laurent

Le mer. 7 août 2019 à 01:33, Brent Christian  a
écrit :

> On 8/6/19 12:47 PM, Vladimir Yaroslavskiy wrote:
> >
> > I moved Object sorting related stuff after primitives sorting methods
> > to separate functionality logically.
>
> Sure, fine to keep that all together.  I can move that back:
> http://cr.openjdk.java.net/~bchristi/8226297/webrev04/
>
> > The order of methods in my version is:
> >
> > 1. primitives (sequential sorting)
> > - int
> > - long
> > - byte
> > - char
> > - short
> > - float
> > - double
>
> The order for sequential sorting of primitives in Arrays.java checked
> into the JDK is:
>
> - int
> - long
> * short
> * char
> * byte
> - float
> - double
>
> It simplifies the webrev for reviewing to keep that ordering, so that's
> what I've done.
>
> Thanks,
> -Brent
>
> > Вторник, 6 августа 2019, 21:35 +03:00 от Brent Christian
> > :
> >
> > Hi, Laurent
> >
> > I'm not sure what exactly is causing the problem, but here's my
> hunch:
> >
> > In Vladimir's version of Arrays.java:
> > MIN_ARRAY_SORT_GRAN
> > NaturalOrder
> > rangeCheck
> > got moved around, but were unchanged.
> >
> > In the interest of keeping the change as simple as possible, I
> restored
> > these to their original location, so they don't show up in my
> changes.
> > That could confuse things when comparing diffs.
> >
> > One idea would be to restore those elements back in their original
> > locations in your version, and re-generate your patch. I don't know
> if
> > that would be less work than just comparing raw files.
> >
> > Alternatively, if it would be easiest for those familiar with the
> > evolution of this fix to leave things where Vladimir had them, I can
> do
> > that.
> >
> > Thanks,
> > -Brent
> >
> > On 8/6/19 6:32 AM, Laurent Bourgès wrote:
> >  > Hi Brent,
> >  >
> >  > Thank you for sponsoring this patch.
> >  >
> >  > I tried to compare your webrev against my latest (diff patch
> > files) but
> >  > it gives me too many changes lines.
> >  >
> >  > Do you have another idea to see incremental changes only ?
> >  > (anyway I can compare raw files)
> >  >
> >  > Thanks,
> >  > Laurent
> >  >
> >  > Le lun. 5 août 2019 à 23:43, Brent Christian
> > mailto:brent.christ...@oracle.com>
> >  > <mailto:brent.christ...@oracle.com>> a écrit :
> >  >
> >  > Hi,
> >  >
> >  > Please review Vladimir Yaroslavskiy's changes to
> DualPivotQuickSort
> >  > (seen earlier[1] on this alias).  I will be sponsoring this
> change.
> >  >
> >  > I have a webrev against jdk-jdk here:
> >  > http://cr.openjdk.java.net/~bchristi/8226297/webrev03-rfr/
> >  >
> >  > (Note that I did a little re-ordering, and removed some
> superfluous
> >  > spacing changes, in order to simplify the webrev.  I've also
> included
> >  > Vladimir's FailedFloat[2] test case.)
> >  >
> >  > Information about benchmarking the changes was posted[3] recently.
> >  > An automated test run passes cleanly.
> >  >
> >  > Thanks!
> >  > -Brent
> >  > --
> >  > 1.
> >  >
> >
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061363.html
> >  >
> >  > 2.
> >  >
> >
> https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061513.html
> >  >
> >  > 3.
> >  >
> >
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061553.html
> >  >
> >
> >
> >
> > --
> > Vladimir Yaroslavskiy
>


-- 
-- 
Laurent Bourgès


Re: RFR: 8226297: Dual-pivot quicksort improvements

2019-08-06 Thread Laurent Bourgès
Hi Brent,

Thank you for sponsoring this patch.

I tried to compare your webrev against my latest (diff patch files) but it
gives me too many changes lines.

Do you have another idea to see incremental changes only ?
(anyway I can compare raw files)

Thanks,
Laurent

Le lun. 5 août 2019 à 23:43, Brent Christian  a
écrit :

> Hi,
>
> Please review Vladimir Yaroslavskiy's changes to DualPivotQuickSort
> (seen earlier[1] on this alias).  I will be sponsoring this change.
>
> I have a webrev against jdk-jdk here:
> http://cr.openjdk.java.net/~bchristi/8226297/webrev03-rfr/
>
> (Note that I did a little re-ordering, and removed some superfluous
> spacing changes, in order to simplify the webrev.  I've also included
> Vladimir's FailedFloat[2] test case.)
>
> Information about benchmarking the changes was posted[3] recently.
> An automated test run passes cleanly.
>
> Thanks!
> -Brent
> --
> 1.
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061363.html
>
> 2.
> https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061513.html
>
> 3.
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061553.html
>


Re: The final optimized version of Dual-Pivot Quicksort (ver.19.1)

2019-08-01 Thread Laurent Bourgès
Hi Brent,

Le jeu. 1 août 2019 à 02:32, Brent Christian  a
écrit :

> Thanks, Laurent.  I can sponsor this fix, get a RFR thread going for
> JDK-8226297, keep webrevs updated as the review progresses, etc.
>

Excellent, I will let you and Vladimir work on this review.

FYI I have implemented DPQS (int[] ) sorting 2 arrays (values + indices) in
the Marlin renderer. It could be generalized to any array type and having
the index array is very important in many algorithms...


> However I've uncovered an issue with the fix that should be resolved
> before proceeding.
>
> Specifically, the new Arrays.COMMON_PARALLELISM static constant causes
> exceptions at startup under some circumstances:
>  * running with LANG=C on Linux[1]
>  * setting a property value with non-Ascii characters on Mac[2]
>
> java.util.Arrays is used a fair bit for String handling
> (java.lang.StringCoding, java.langStringLatin1, etc).  The problem is
> that early in startup, depending on the locale/language setup and/or
> property settings, java.util.Arrays can be called during initialization
> of system properties.
>
> During Arrays., COMMON_PARALLELISM is setup with a call to
> ForkJoinPool.getCommonPoolParallelism().  ForkJoinPool sets up
> some VarHandles, VarHandles leads to
> MethodHandleStatics., which reads some system properties.  But
> we're still initializing the system properties - 'props' is null, so NPE.
>

Chicken-egg problem. Maybe this field could be wrapped in a getter doing
lazy resolution...


> I think Arrays.java needs to remove the COMMON_PARALLELISM constant, and
> go back to making calls to ForkJoinPool.getCommonPoolParallelism().
>

I do not know why Vladimir changed that recently. Any good reason ?


> I can do the update and post an updated webrev (unless further
> discussion is needed).
>

PS: I can help on benchmarking as I developped a fair sort benchmark based
on JMH:
https://github.com/bourgesl/nearly-optimal-mergesort-code

JMH test code:
https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/sort-bench/src/main/java/edu/sorting/bench
I would be interested by improving the perf test code and possibly
integrate it in OpenJDK JMH test suite... (later)

>
Goodbye & good luck,
Laurent


Re: The final optimized version of Dual-Pivot Quicksort (ver.19.1)

2019-07-26 Thread Laurent Bourgès
Hi all,

I updated the DPQS patch thanks to latest release from Vladimir ([image:
Pièces jointes]2019.07.25):
http://cr.openjdk.java.net/~lbourges/dpqs/dpqs-8226297/dpqs-8226297.1/

"
Updated version of Arrays and DualPivotQuicksort files, the summary of
changes:

1. Corrected minimal threshold for parallelism argument (0 -> 1)
2. Added constant for common parallelism
3. Updated byte / char / short counting sort.
"

Incremental patch change:
http://cr.openjdk.java.net/~lbourges/dpqs/dpqs-8226297/diff_dpqs_1_vs_0.log

Cheers,
Laurent


Le mer. 17 juil. 2019 à 17:12, Laurent Bourgès 
a écrit :

> Hi,
>
> I integrated locally the complete DPQS 19.2 patch from Vladimir
> Yaroslavskiy with up-to-date jdk14 (client) repository:
> build OK and jtreg passed OK (java/util/Arrays).
>
> Here is the corresponding webrev for review:
> http://cr.openjdk.java.net/~lbourges/dpqs/dpqs-8226297/webrev/
>
> Cheers,
> Laurent
>


Re: Benchmarking of new optimized Dual-Pivot Quicksort

2019-07-26 Thread Laurent Bourgès
Hi Vladimir,

First Thank you for these impressive results: 15% to 70% overall gains is
amazing and will make a big difference, Congratulations !

Could you publish the different test environment ?
- HW: cpu (lscpu output)
- OS version
- JDK version + JVM settings

Ideally you or I could write a shell script to collect setup and write a
simple log file.

PS: I suppose you tested DPQS only on intel cpus, could somebody test on
ppc or arm cpus as well ?

Cheers,
Laurent

Le jeu. 25 juil. 2019 à 16:04, Vladimir Yaroslavskiy  a
écrit :

> Hi all,
>
> With the help of Laurent Bourges I run benchmarking of new optimized
> Dual-Pivot Quicksort
> and summarized results. The tests were run for all types (int / long / ...
> / double), for both types:
> sequential and parallel sorting, for small, medium and large sizes. The
> comparison was done
> on several data types, such as: random, period, sorted, equal, stagger,
> shuffle.
>
> Here is the summary of total gains. The gain is defined as (jdk_time -
> dpqs_time) / jdk_time,
> where jdk_time is sorting time by the current jdk14 and dpqs_time is
> sorting time by new
> optimized Dual-Pivot Quicksort. To get stable and reproducible results,
> min time of many
> invocations is taken.
>
> You can find all detailed results there
> https://github.com/iaroslavski/sorting/tree/master/benchmarking/results
>
> Sources of benchmarking tests are there
> https://github.com/iaroslavski/sorting/tree/master/benchmarking/sources
>
> You
> can see not good performance for float / double types (parallel sorting).
> This behavior can be explained by existing bug in jdk, when NaNs and -0.0s
> are not processed properly. New sorting has total losses for small float /
> double
> arrays, few percents only. For all other cases new optimized sorting is
> winner.
>
>
> 
> [int]
>
> sequential, Length = 150, Gain: 0.15
> sequential, Length = 3, Gain: 0.28
> sequential, Length = 100, Gain: 0.31
> parallel 8, Length = 150, Gain: 0.14
> parallel 8, Length = 3, Gain: 0.15
> parallel 8, Length = 100, Gain: 0.14
>
> [long]
> sequential, Length = 150, Gain: 0.12
> sequential, Length = 3, Gain: 0.23
> sequential, Length = 100, Gain: 0.32
> parallel 8, Length = 150, Gain: 0.11
> parallel 8, Length = 3, Gain: 0.16
> parallel 8, Length = 100, Gain: 0.24
> parallel 88 processors, Length = 100, Gain: 0.05
>
> [byte]
> sequential, Length = 150, Gain: 0.06
> sequential, Length = 3, Gain: 0.36
> sequential, Length = 100, Gain: 0.37
> parallel 8, Length = 150, Gain: 0.13
> parallel 8, Length = 3, Gain: 0.73
> parallel 8, Length = 100, Gain: 0.65
>
> [char]
> sequential, Length = 150, Gain: 0.10
> sequential, Length = 3, Gain: 0.00
> sequential, Length = 100, Gain: 0.17
> parallel 8, Length = 150, Gain: 0.10
> parallel 8, Length = 3, Gain: 0.33
> parallel 8, Length = 100, Gain: 0.62
>
> [short]
> sequential, Length = 150, Gain: 0.14
> sequential, Length = 3, Gain: 0.10
> sequential, Length = 100, Gain: 0.18
> parallel 8, Length = 150, Gain: 0.13
> parallel 8, Length = 3, Gain: 0.41
> parallel 8, Length = 100, Gain: 0.65
>
> [float]
> sequential, Length = 150, Gain: -0.02
> sequential, Length = 3, Gain: 0.21
> sequential, Length = 100, Gain: 0.24
> parallel 8, Length = 150, Gain: -0.05
> parallel 8, Length = 3, Gain: -0.04
> parallel 8, Length = 100, Gain: -0.12
>
> [double]
> sequential, Length = 150, Gain: -0.03
> sequential, Length = 3, Gain: 0.19
> sequential, Length = 100, Gain: 0.23
> parallel 8, Length = 150, Gain: -0.05
> parallel 8, Length = 3, Gain: 0.05
> parallel 8, Length = 100, Gain: 0.15
>
> Thank you,
> Vladimir
>


Re: The final optimized version of Dual-Pivot Quicksort (ver.19.1)

2019-07-17 Thread Laurent Bourgès
Hi,

I integrated locally the complete DPQS 19.2 patch from Vladimir
Yaroslavskiy with up-to-date jdk14 (client) repository:
build OK and jtreg passed OK (java/util/Arrays).

Here is the corresponding webrev for review:
http://cr.openjdk.java.net/~lbourges/dpqs/dpqs-8226297/webrev/

Cheers,
Laurent

Le ven. 12 juil. 2019 à 09:34, Laurent Bourgès 
a écrit :

> Hi,
> I asked alan to update the webrev, but I can publish it myself...
>
> Will do it asap,
> Stay tuned,
> Laurent
>
> Le mar. 9 juil. 2019 à 22:19, Brent Christian 
> a écrit :
>
>> Hi,
>>
>> Is the webrev incomplete?  It only contains the changes for
>> DualPivotQuicksort.java, and not for Arrays.java, etc.
>>
>> Thanks,
>> -Brent
>>
>> On 6/18/19 2:37 PM, Vladimir Yaroslavskiy wrote:
>> > Hi team,
>> >
>> > Please review the final optimized version of Dual-Pivot Quicksort
>> (ver.19.1),
>> > see http://cr.openjdk.java.net/~alanb/8226297/0/webrev
>> > (link to Jira task https://bugs.openjdk.java.net/browse/JDK-8226297)
>> >
>> > The new version was published here more than one year ago (on Jan 19,
>> 2018).
>> > Since that time Dual-Pivot Quicksort has been significantly improved
>> > and this work was done in collaboration with Doug Lea and Laurent
>> Bourges.
>> >
>> > You can find summary of all changes below.
>> >
>> > DualPivotQuicksort.java
>> > --
>> > * The 1-st and 5-th candidates are taken as pivots
>> >instead of 2-nd and 4-th
>> > * Pivots are chosen with another step
>> > * Pivot candidates are sorted by combination of
>> >5-element network sorting and insertion sort
>> > * New backwards partitioning was introduced
>> > * Partitioning is related to the golden ratio
>> > * Quicksort tuning parameters were updated
>> > * Thresholds for byte / char / short sorting were updated
>> > * Additional steps for float / double were fully rewritten
>> > * Parallel sorting is based on combination of merge sort
>> >and Dual-Pivot Quicksort
>> > * Pair insertion sort was optimized and became a part
>> >of mixed insertion sort
>> > * Mixed insertion sort was introduced: combination
>> >of simple insertion sort, pin insertion sort and
>> >pair insertion sort
>> > * Simple insertion sort is invoked on tiny array
>> > * Pin insertion sort is started on small array
>> > * Pair insertion sort is continued on remain part
>> > * Merging of runs is invoked on each iteration of Quicksort
>> > * Merging of runs was fully rewritten
>> > * Optimal partitioning of merging is used
>> > * Not more than one copy of part to buffer during merging
>> > * Merging parameters were updated
>> > * Initial capacity of runs is based on size
>> > * Max number of runs is constant
>> > * Optimized version of heap sort was introduced
>> > * Heap sort is used as a guard against quadratic time
>> >(int / long / float / double)
>> > * Parallel merging of runs was also provided
>> > * Parallel sorting, heap sort and merging of runs
>> >are not used in byte / char / short sorting
>> > * Counting sort for byte / char / short were optimized
>> > * Counting sort is used as a guard against quadratic time
>> >(byte / char / short)
>> > * Code style and javadoc improvements
>> >
>> > Sorting.java
>> > --
>> > * New test cases were added
>> > * Test cases are invoked for both: sequential and
>> >parallel sorting
>> > * Check for Object sorting was added
>> > * New tests were added against heap sort
>> > * Added test case against bug in parallel merge
>> >sort on float / double data
>> >
>> > ParallelSorting.java
>> > --
>> > * This class was removed, because Sorting class
>> >covers all cases
>> >
>> > SortingHelper.java
>> > ----------
>> > * The helper class provides access package-private
>> >methods of DualPivotQuicksort class during testing
>> >
>> > Arrays.java
>> > --
>> > * Calls of Dual-Pivot Quicksort were updated
>> > * Parallel sorting of primitives was switched to parallel
>> >Dual-Pivot Quicksort
>> > * Javadoc was updated, double spaces were removed
>> > * Format changes
>> >
>> > ArraysParallelSortHelpers.java
>> > --
>> > * Implementation of parallel sorting
>> >for primitives was removed
>> > * Javadoc was updated
>> >
>> > Thank you,
>> > Vladimir
>> >
>>
>

-- 
-- 
Laurent Bourgès


Re: The final optimized version of Dual-Pivot Quicksort (ver.19.1)

2019-07-12 Thread Laurent Bourgès
Hi,
I asked alan to update the webrev, but I can publish it myself...

Will do it asap,
Stay tuned,
Laurent

Le mar. 9 juil. 2019 à 22:19, Brent Christian 
a écrit :

> Hi,
>
> Is the webrev incomplete?  It only contains the changes for
> DualPivotQuicksort.java, and not for Arrays.java, etc.
>
> Thanks,
> -Brent
>
> On 6/18/19 2:37 PM, Vladimir Yaroslavskiy wrote:
> > Hi team,
> >
> > Please review the final optimized version of Dual-Pivot Quicksort
> (ver.19.1),
> > see http://cr.openjdk.java.net/~alanb/8226297/0/webrev
> > (link to Jira task https://bugs.openjdk.java.net/browse/JDK-8226297)
> >
> > The new version was published here more than one year ago (on Jan 19,
> 2018).
> > Since that time Dual-Pivot Quicksort has been significantly improved
> > and this work was done in collaboration with Doug Lea and Laurent
> Bourges.
> >
> > You can find summary of all changes below.
> >
> > DualPivotQuicksort.java
> > --
> > * The 1-st and 5-th candidates are taken as pivots
> >instead of 2-nd and 4-th
> > * Pivots are chosen with another step
> > * Pivot candidates are sorted by combination of
> >5-element network sorting and insertion sort
> > * New backwards partitioning was introduced
> > * Partitioning is related to the golden ratio
> > * Quicksort tuning parameters were updated
> > * Thresholds for byte / char / short sorting were updated
> > * Additional steps for float / double were fully rewritten
> > * Parallel sorting is based on combination of merge sort
> >and Dual-Pivot Quicksort
> > * Pair insertion sort was optimized and became a part
> >of mixed insertion sort
> > * Mixed insertion sort was introduced: combination
> >of simple insertion sort, pin insertion sort and
> >pair insertion sort
> > * Simple insertion sort is invoked on tiny array
> > * Pin insertion sort is started on small array
> > * Pair insertion sort is continued on remain part
> > * Merging of runs is invoked on each iteration of Quicksort
> > * Merging of runs was fully rewritten
> > * Optimal partitioning of merging is used
> > * Not more than one copy of part to buffer during merging
> > * Merging parameters were updated
> > * Initial capacity of runs is based on size
> > * Max number of runs is constant
> > * Optimized version of heap sort was introduced
> > * Heap sort is used as a guard against quadratic time
> >(int / long / float / double)
> > * Parallel merging of runs was also provided
> > * Parallel sorting, heap sort and merging of runs
> >are not used in byte / char / short sorting
> > * Counting sort for byte / char / short were optimized
> > * Counting sort is used as a guard against quadratic time
> >(byte / char / short)
> > * Code style and javadoc improvements
> >
> > Sorting.java
> > --
> > * New test cases were added
> > * Test cases are invoked for both: sequential and
> >parallel sorting
> > * Check for Object sorting was added
> > * New tests were added against heap sort
> > * Added test case against bug in parallel merge
> >sort on float / double data
> >
> > ParallelSorting.java
> > --
> > * This class was removed, because Sorting class
> >covers all cases
> >
> > SortingHelper.java
> > --
> > * The helper class provides access package-private
> >methods of DualPivotQuicksort class during testing
> >
> > Arrays.java
> > --
> > * Calls of Dual-Pivot Quicksort were updated
> > * Parallel sorting of primitives was switched to parallel
> >Dual-Pivot Quicksort
> > * Javadoc was updated, double spaces were removed
> > * Format changes
> >
> > ArraysParallelSortHelpers.java
> > --
> > * Implementation of parallel sorting
> >for primitives was removed
> > * Javadoc was updated
> >
> > Thank you,
> > Vladimir
> >
>


Re: Extending Java Arrays/Collection Sort API

2018-12-05 Thread Laurent Bourgès
John,

Any feedback ?
We could discuss that during next OpenJDK workshop, but I would prefer
going slowly but surely.

Laurent

Le jeu. 29 nov. 2018 à 17:52, Laurent Bourgès  a
écrit :

> Hi John & Brian,
>
> Thanks for the proposed approach, I agree we are in the design discussion;
> adding such API enhancements will take time, I said 13 to not hurry for 12
> (CSR...).
>
> John, you wrote me a very detailed letter, I will try to be more concise
> and focus on Array.sort() API.
>
> Le jeu. 29 nov. 2018 à 02:12, John Rose  a écrit :
>
>>
>> >
>> > In this renderer, I needed a fast sort algorithm on small almost sorted
>> > arrays, but more important, to deal with 2 arrays: x values and their
>> > corresponding edge indices (pointer like). I derived my MergeSort class
>> to
>> > perform 2 swaps (x & edge ptr) as Arrays.sort() or TimSort() did not
>> > provide such API.
>>
>> There are two distinct requirements here:  1. Selecting or tweaking a
>> sorting
>> algorithms for particular expected patterns in the data set
>> (almost-sorted).
>> 2. Sorting two arrays in tandem, with one array supplying the sort key,
>> and the other serving as a passively permuted payload ("PPP").
>>
>> Re 1: Another pattern sometimes expected is range-bounded values,
>> which may prompt consideration of bin-sort.  Perhaps unique values.
>> Permutation vectors have both properties.
>>
>
> Yes, the discussion has 2 points:
> - provide more sorting algorithms in the public API
> - provide new sort variants handling indices
>
> For example, Julia has a simple sort API:
> https://docs.julialang.org/en/v0.6.2/stdlib/sort/
> It provides several basic sort functions (isort, qsort, msort) but the
> general sort() accepts optional custom comparison (less) & transformation
> functions.
> Moreover, another sortperm() function that returns a permutation vector of
> indices, it accepts an optional custom comparison (less).
>
>
>
>>
>> Re 2: In some cases, it's not two arrays that need the work.  Sometimes
>> it is one array of keys, plus an index set; this can be reduced to two
>> arrays
>> one of which is an int-array carrying indexes, but in some proposals you
>> see
>> such an int-array appearing only as an output, with an implicit input of
>> the iota-vector for the other array's indexes.
>>
>> More deeply, sometimes the index array is concrete, but the array of keys
>> is virtualized.  Canonical example (for me) is a set of indexes into a
>> data
>> set such as a file.  (These indexes are dense for compression algorithms
>> like B-W, or could be sparse for lexical analysis of bulk text in situ.)
>> To
>> handle this properly, a good old index array is just fine, as long as the
>> sort algorithm *also* accepts an int-comparator.
>>
>> There is no strong need to generalize to three arrays in tandem, or to
>> have separate algorithms for treating additional arrays as containing
>> secondary keys instead of PPP.  This is because given a two-array tandem
>> sort with PPP, plus an int-array sort with int-comparator, you can compose
>> many other shapes at a modest footprint cost by adding two index vector
>> to control the various sorts.
>>
>
> So you propose 2 new variants:
> - sort(int[]a, int[] b, low, high) where b values are permuted
> - sort(int[]a, low, high, Comparator) where b values are permuted
>
> I proposed the 2 array variants: sort( int[] values, int[] indices, low,
> high) as it helps when the user wants both values and indices sorted
> (easier use), but this can be reduced to sort(int[] indices, low, high,
> comparator) as the user can use the permutation vector (indices) to reorder
> values.
>
> I would prefer adding the more general solution sort(int[] indices, low,
> high, comparator) first.
>
> However, having extracted the values is really helpful for performance
> (locality) and simplifies the sorting algorithms but costs more data swaps.
> For example in Marlin, my comparator would be:
> Comparator::compare(i,j) {
> return Integer.compareTo(edges[i].x, edges[j].x);
> }
> However, my edge data are packed in an Unsafe buffer so it is more:
> Comparator::compare(i,j) {
> return Integer.compareTo(Unsafe.getInt(addr0 + i), Unsafe.getInt(addr0
> + j) );
> }
> I wonder if such lookups may be costly (random access + overheads) in
> contrary to having the values given as an extra int[].
>
> If you want, I can prototype later both solutions...
>
>
>>
>> Simplified sketch of generalized tandem array sort:
>>
>&g

Re: Extending Java Arrays/Collection Sort API

2018-11-29 Thread Laurent Bourgès
Hi John & Brian,

Thanks for the proposed approach, I agree we are in the design discussion;
adding such API enhancements will take time, I said 13 to not hurry for 12
(CSR...).

John, you wrote me a very detailed letter, I will try to be more concise
and focus on Array.sort() API.

Le jeu. 29 nov. 2018 à 02:12, John Rose  a écrit :

>
> >
> > In this renderer, I needed a fast sort algorithm on small almost sorted
> > arrays, but more important, to deal with 2 arrays: x values and their
> > corresponding edge indices (pointer like). I derived my MergeSort class
> to
> > perform 2 swaps (x & edge ptr) as Arrays.sort() or TimSort() did not
> > provide such API.
>
> There are two distinct requirements here:  1. Selecting or tweaking a
> sorting
> algorithms for particular expected patterns in the data set
> (almost-sorted).
> 2. Sorting two arrays in tandem, with one array supplying the sort key,
> and the other serving as a passively permuted payload ("PPP").
>
> Re 1: Another pattern sometimes expected is range-bounded values,
> which may prompt consideration of bin-sort.  Perhaps unique values.
> Permutation vectors have both properties.
>

Yes, the discussion has 2 points:
- provide more sorting algorithms in the public API
- provide new sort variants handling indices

For example, Julia has a simple sort API:
https://docs.julialang.org/en/v0.6.2/stdlib/sort/
It provides several basic sort functions (isort, qsort, msort) but the
general sort() accepts optional custom comparison (less) & transformation
functions.
Moreover, another sortperm() function that returns a permutation vector of
indices, it accepts an optional custom comparison (less).



>
> Re 2: In some cases, it's not two arrays that need the work.  Sometimes
> it is one array of keys, plus an index set; this can be reduced to two
> arrays
> one of which is an int-array carrying indexes, but in some proposals you
> see
> such an int-array appearing only as an output, with an implicit input of
> the iota-vector for the other array's indexes.
>
> More deeply, sometimes the index array is concrete, but the array of keys
> is virtualized.  Canonical example (for me) is a set of indexes into a data
> set such as a file.  (These indexes are dense for compression algorithms
> like B-W, or could be sparse for lexical analysis of bulk text in situ.)
> To
> handle this properly, a good old index array is just fine, as long as the
> sort algorithm *also* accepts an int-comparator.
>
> There is no strong need to generalize to three arrays in tandem, or to
> have separate algorithms for treating additional arrays as containing
> secondary keys instead of PPP.  This is because given a two-array tandem
> sort with PPP, plus an int-array sort with int-comparator, you can compose
> many other shapes at a modest footprint cost by adding two index vector
> to control the various sorts.
>

So you propose 2 new variants:
- sort(int[]a, int[] b, low, high) where b values are permuted
- sort(int[]a, low, high, Comparator) where b values are permuted

I proposed the 2 array variants: sort( int[] values, int[] indices, low,
high) as it helps when the user wants both values and indices sorted
(easier use), but this can be reduced to sort(int[] indices, low, high,
comparator) as the user can use the permutation vector (indices) to reorder
values.

I would prefer adding the more general solution sort(int[] indices, low,
high, comparator) first.

However, having extracted the values is really helpful for performance
(locality) and simplifies the sorting algorithms but costs more data swaps.
For example in Marlin, my comparator would be:
Comparator::compare(i,j) {
return Integer.compareTo(edges[i].x, edges[j].x);
}
However, my edge data are packed in an Unsafe buffer so it is more:
Comparator::compare(i,j) {
return Integer.compareTo(Unsafe.getInt(addr0 + i), Unsafe.getInt(addr0
+ j) );
}
I wonder if such lookups may be costly (random access + overheads) in
contrary to having the values given as an extra int[].

If you want, I can prototype later both solutions...


>
> Simplified sketch of generalized tandem array sort:
>
> 1. Let L be the common length of the N arrays, A[N][L].
> 2. Create a new int index array P initialized to iota(L).
> 3. Create an int-comparator C over 0<=i,j A[*][j].
> 4. Sort P using C.  Now P is a permutation vector to be applied to each
> A[*].
> 5. Create a second index array Q such that Q[P[i]] = i for all i.  Q is
> P's inverse.
> 6. Tandem-sort each Q[*] (as a PPP) using a fresh copy of Q for each array.
>
> There is more than one way to do this; the above sketch is intended to be
> straightforward.  The temp. space required is two length-L int arrays.
>

I like you proposal, but I dislike object allocations so I would prefer
providing result arrays as arguments if possible.
In Marlin renderer, allocations are always done in its RendererContext +
XXXArrayCache (soft ref) to ensure no GC overhead.


>
> My point is that the "missing 

Re: Extending Java Arrays/Collection Sort API

2018-11-28 Thread Laurent Bourgès
Hi John & Brian,

Thank you for your feedback finally, we almost agree Java Sort API could be
improved, in jdk13 possibly.

>
> Doug is right that there is an enormous potential list of sort methods,
> and we can't include them all.  But I do miss the ability to extract
> indexes instead of sorting the array itself.
>
> If you read my first email 11.14, I asked if you ever agree exposing few
other sort algorithms to have a public sort API useful on special cases
(isort, radix, merge sorts) when needed to manage specific types of
datasets.
Never mind: third-party libraries can provide advanced / other algorithms.

Vladimir Yaroslavskiy made huge progresses on his DPQS 18.11 and I hope it
will be the overall winner on almost all data distributions at all scale
soon to provide a better Arrays.sort() implementation soon.


> Or, slightly more generally, sorting an int[] or long[] array with a
> comparator.
> Sometimes you don't want an object per sorted entity; you want some kind
> of oracle function that can compare two indexes into a non-concrete
> (virtualized) set of values; the sort operation would output the reordered
> ints.
>
> Example:  Sorting indexes into a large text file according to some
> algorithm
> such as Burrows-Wheeler.  You don't want to make all the substrings and
> put them into a String[] array, and you don't even want to make
> CharSequence
> objects that view into the text file.  You just want indexes into the text
> file to
> be sorted.
>
> IMO, the very simplest answer with significantly better semantics is:
>
>   void sort(int[] a, (int[] a, int fromIndex, int toIndex,
> IntBinaryOperator cmp);
>
> And maybe add parallelSort, and use a new IntComparator instead of
> repurposing IntBinaryOperator.
>
> Extracting the permutation vector of indexes from an array sorting
> operation
> would then look like this:
>
> public static  int[] sortToIndexes(T[] a, Comparator c) {
> int[] perm = new int[a.length];  // starts as an iota vector
> for (int i = 0; i < a.length; i++)  perm[i] = i;
> sort(perm, 0, a.length, (i, j) -> c.compare(a[i], a[j]));  // NEW
> PRIMITIVE
> return perm;
> }
>
> But there are other use cases for comparator-based sorting of indexes,
> which start not with an iota vector, but with an array of index-like values
> which may index into something other than an array (like a text file,
> which is why 'long[] a' might be useful too).
>
> In Valhalla many such use cases can be covered by using a flat array of
> values which encapsulate the integer (or long) indexes, and sorting that.
> The array of wrapped primitives will look like Object[] and so the existing
> API point with a comparator would allow the flat array of ints (dressed as
> value Objects) to be sorted according to any ad hoc ordering rule,
> including
> treating them as indexes.
>
> So maybe the answer is "wait for Valhalla".  Still, the lack of a
> comparator
> on primitive arrays is an unnecessary limitation, which excludes
> index-based
> computations from the portfolio of our sort methods.
>

1. I am waiting for Value types for several years as Marlin renderer is
already using array of structs in java, ie edge data are packed in int[]
and unsafe buffers, to avoid bound checks (like C). It is a good candidate
to evaluate this new feature.
I can help here, but not alone.

2. John, your proposal matches mine:
"alternatively, it could be expressed as Comparator
Arrays.sort(int[] a, low, high, Comparator cmp) that sorts array
a using the given comparator. For example, a[] could be edge indices sorted
according to their edge x position...
This comparator is specific as it compares anything (external storage) at
given indices i,j."

Let's go for 13...

If I may insist a bit on the 2 arrays variant, Marlin needs the x array
sorted for future traversal...
I do not know what is faster: sorting 2 arrays (indices swapped according
to x array order: 2x times swaps) or using a comparator (lookups to compare
(x[i], x[j] ) + x array reconstruction).

Finally how can we manage such improvement in 13? CSR / RFE ?

John or Brian, could you lead that specification request (CSR) and formal
process ?
 I can propose implementations on my side, I already have a working DPQS
18.11 with 2 arrays on Marlin's github.

Regards,
Laurent


Re: Extending Java Arrays/Collection Sort API

2018-11-23 Thread Laurent Bourgès
Hi,

I am happy to announce that I succeeded in writing my own BentleyBasher
working like a swiss clock:
- auto-tune benchmark passes & hot loop to obtain high accuracy on
measurements ~ 2% (guaranteed), with proper variance estimators
- test > 10 sorters with small, large & very large int[] arrays including
DPQS11, 18.2, 18.11 ... including my variants sorting 2 arrays altogether.

I observed DPQS 18.11 working very well...
we are currently fixing performance issue on small almost-sorted arrays (<
1000).

>
My first results & all code (benchmark + all sorters) are available in this
github repo:
- Ea results with stats:
https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/results
- source code:
https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/src/edu/sorting

It is my Sort Race 2018... in progress.

PS: plots & JMH cross-validation remains to be implemented.

Cheers,
Laurent


Re: Extending Java Arrays/Collection Sort API

2018-11-20 Thread Laurent Bourgès
Hi Doug,
That's a pleasure to read you.

Le mar. 20 nov. 2018 à 13:17, Doug Lea  a écrit :

> On 11/20/18 2:50 AM, Laurent Bourgès wrote:
> > Hi,
> > Any feedback on improving Java Sort API ?
>
> I think most considerations await progress on value types. I posted some
> ideas and links to code last summer on Valhalla list:
> http://mail.openjdk.java.net/pipermail/valhalla-dev/2018-August/thread.html


I already followed this thread and I am looking for Value types for long.
I looked at your CVSorter class and it looks promising, I can try adapting
it to my needs and compare its own performance in my Sort Race 2018.

Even if Value types are coming, the Arrays/Collections sort() API will
remain and must still evolve to provide the best (overall) sort.
As said previously, adaptive algorithms already embed several Sort
algorithms (insertion sort, merge sort, dual pivot quick sort, radix sort)
so I proposed to wrap them as public API to let the developper select the
best sort according to its needs or data sets.


> They didn't include APIs for sorting based on indexes, which I agree
> should be considered. However, until value types arrive, it would be
> hard to justify adding variants for each primitive type.
>

Thanks, you agree it is useful. For now I only focused on sort(int[] a,
int[] indices, low, high) for my own needs.
As Java arrays are still only integer based, I propose to only support
int[] indices.
I already implemented such method in my DualPivotQuicksort201802Ext (that
uses pre-allocation):
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/src/edu/sorting/DualPivotQuicksort201802Ext.java

static void sort(int[] a, int[] b, int low, int high, int[] auxA, int[] auxB,
int[] run) { ... }
note: int[] auxA, int[] auxB, int[] run are given for pre-allocation
purposes


>
> Almost every sorting algorithm is best for some data types and inputs.
> And there are a lot of algorithms. Among the hard choices for JDK is to
> pick a small a only a few (stable vs nonstable, parallelizable, etc),
> that together provide good choices for most usages. Even then, some
> usages (possibly including the one in your initial post) might be better
> off creating a custom solution. (As is true for all JDK
> Collections/Array-related components.)
>

So is my proposal: Arrays/Collection sort() must be the overall winner (all
data distributions, all array lengths) ~ best compromise.

For particular needs, the JDK or a sort library could provide specific sort
implementations.
To that purpose, I have collected 15 sort (working) implementations:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/src/edu/sorting/perf/IntSorter.java

Best regards,
Laurent


> > What about an Arrays API extension:
> > - Arrays.sort(int[] a, int[] b, low, high) or
> > Indices int[] Arrays.sortIndices(int[] a, low high), a[] left
> untouched
> > - same with preallocated buffers & runs arrays
> > - alternatively, it could be expressed as Comparator
> > Arrays.sort(int[] a, low, high, Comparator cmp) that sorts
> > array a using the given comparator. For example, a[] could be edge
> > indices sorted according to their edge x position...
> > This comparator is specific as it compares anything (external
> > storage) at given indices i,j.
> >
>
>


Re: Re[4]: The new optimized version of Dual-Pivot Quicksort

2018-11-20 Thread Laurent Bourgès
Hi Vladimir,

One short comment, see below:


Le jeu. 15 nov. 2018 à 15:39, Vladimir Yaroslavskiy  a
écrit :

> Hello, Tagir!
>
> I compared Radix sort with Dual-Pivot Quicksort and see
> that Radix sort is faster than DPQ (on 2M elements) on random data in
> twice times,
> but it works slower on repeated elements (in 3-7 times). For highly
> structured arrays merging sort is invoked in both cases. Other
> data types - almost the same.
>

Your test is not really fair: 2m ints ~ 8mb so it is larger than typical
CPU L3 cache size.
RadixSort is doing random accesses and is known to be faster when arrays
fits in L2/L3 caches.
If larger then using radix on sub parts+merge !


> Do we actually need to use RadixSort if it is not faster on all inputs?
>

I will do my own performance tests on int[] arrays with sizes in range [50
- 10] ...

Vladimir, I think Tageer proposed to tune its radixSort to obtain an hybrid
/ adaptive radix sort, as you did in your DPQS.
I can help here.

Cheers,
Laurent


What do you think?
>
> Regards,
> Vladimir
>
> Среда, 14 ноября 2018, 19:17 +03:00 от Tagir Valeev :
>
> Hello, Laurent, Vladimir!
>
> I created a pull request containing my RadixSort implementation:
> https://github.com/bourgesl/nearly-optimal-mergesort-code/pull/1
> On my machine the results produced by Mergesorts.java are like this:
>
> Runs with individual timing (skips first 10 runs):
> adjusted reps: 110 + inner loop: 1
> avg-ms=102.6177(+/- 7 %), algo=PeekSort+iscutoff=24+onlyIncRuns=false,
> n=100 (55000110) (n=100, µ=102.6177, σ=7.4065714)
> avg-ms=95.62607(+/- 4 %),
> algo=TopDownMergesort+iscutoff=24+checkSorted=true, n=100
> (55000110) (n=100, µ=95.62607, σ=3.5302947)
> avg-ms=118.73089(+/- 4 %),
> algo=BottomUpMergesort+minRunLen=24+checkSorted=true, n=100
> (55000110) (n=100, µ=118.73089, σ=4.464908)
> avg-ms=108.36175(+/- 4 %), algo=MarlinSort, n=100 (55000110)
> (n=100, µ=108.36175, σ=4.5998554)
> avg-ms=99.68292(+/- 4 %), algo=MarlinMergeSort, n=100
> (55000110) (n=100, µ=99.68292, σ=3.9944465)
> avg-ms=75.43999(+/- 3 %), algo=DualPivotQuicksort2018, n=100
> (55000110) (n=100, µ=75.43999, σ=2.6127808)
> avg-ms=83.80406(+/- 6 %), algo=DualPivotQuicksort2018Ext, n=100
>  (55000110) (n=100, µ=83.80406, σ=4.734225)
> avg-ms=18.886326(+/- 4 %), algo=RadixSort, n=100 (55000110)
> (n=100, µ=18.886326, σ=0.75667334)
>
> As you can see RadixSort is much faster than anything. Well, probably
> there are some input patterns where it may lose (though it's
> performance is much more predictable and much less depend on input
> data than in DPQS), but I strongly believe that concentrating efforts
> on testing corner cases performance and integrating RadixSort into JDK
> would yield much better performance than DPQS, at least for int[]
> arrays. Also the implementation is much simpler.
>
> With best regards,
> Tagir Valeev.
> On Mon, Nov 12, 2018 at 5:09 PM Laurent Bourgès
>  wrote:
> >
> > Hi,
> >
> > Do you know if someone has written a complete JMH benchmark suite
> dedicated
> > to Arrays.sort() ?
> > with varying array size (trivial) but also testing lots of data
> > distributions: (see Vladimir's tests) and possibly all variants (int,
> long,
> > double, Object[] )
> >
> > It could be part of the standard OpenJDK JMH test suite...
> >
> > For now, I forked the nearly-optimal-mergesort repository on github:
> >
> https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/results
> >
> > Cheers,
> > Laurent
> >
> > Le sam. 10 nov. 2018 à 12:49, Laurent Bourgès 
> a
> > écrit :
> >
> > > Dear Vladimir & other Java sort experts,
> > >
> > > I made the port of the DPQS 2018.2 code last night to support a
> secondary
> > > array to be sorted and use preallocation (aux/run for merge sort):
> > >
> > >
> https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/src/wildinter/net/mergesort/DualPivotQuicksort2018Ext.java
> > >
> > > I succeded in using this variant in Marlin renderer (dev) and it is
> > > promising. I figured out a rendering bug in 1 test wigh 65535 random
> > > segments, certainly due to array swaps in mergesort (my side)...
> > >
> > > I will look again at my changes to check if I miss some permutation (a
> //
> > > b) or made any mistake on their indices... tricky.
> > >
> > > PS: Please share your updated webrev when possible to rebase my
> changes.
> > >
> > > Regards,
> > > Laurent
> > >
> > > Le ven. 9 nov. 2018 à 10:08, Laurent Bourgès <
>

Re: Extending Java Arrays/Collection Sort API

2018-11-19 Thread Laurent Bourgès
Hi,
Any feedback on improving Java Sort API ?

PS: I improved a lot the Array benchmark accuracy (confidence interval ~
5%). Here are EA results:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/results/basher-results-partial.out

Does anybody want to help me on this topic ?
Do you have a JMH test suite dedicated to array sorting ?

Cheers,
Laurent

Le mer. 14 nov. 2018 à 09:01, Laurent Bourgès  a
écrit :

> Hi,
>
> I am a scientist and the author of the Marlin renderer, integrated in
> OpenJDK.
>
> In this renderer, I needed a fast sort algorithm on small almost sorted
> arrays, but more important, to deal with 2 arrays: x values and their
> corresponding edge indices (pointer like). I derived my MergeSort class to
> perform 2 swaps (x & edge ptr) as Arrays.sort() or TimSort() did not
> provide such API.
>
> 1. I started discussing about extending the Java Sort API with Vladimir
> Yaroslavskiy, the author of the famous Java DualPivotQuickSort, as he is
> currently improving it.
>
> His coming DPQS 2018 class provides lot of sort algorithms, I propose to
> make them public API through a facade method in Arrays class:
> insertionSort(), heapSort(), mergeSort(), quickSort()... Of course, his
> current code is dedicated to DPQS, but his algorithm implementation could
> be generalized to propose the Best implementations as a public Java API
> (stable & unstable sorts)
>
> For example in Marlin, I need trivial insertionSort() in the Helpers
> class, my MergeSort could be replaced by another best implementation.
> Let's not reinvent the wheel...
>
> 2. New Sort API to deal with indices...
>
> For now, there is no mean to obtain the indices sorted on another value,
> see my introduction: edge indices sorted by their x positions.
> In data science, this is very useful and general.
>
> What about an Arrays API extension:
> - Arrays.sort(int[] a, int[] b, low, high) or
> Indices int[] Arrays.sortIndices(int[] a, low high), a[] left untouched
> - same with preallocated buffers & runs arrays
> - alternatively, it could be expressed as Comparator
> Arrays.sort(int[] a, low, high, Comparator cmp) that sorts array
> a using the given comparator. For example, a[] could be edge indices sorted
> according to their edge x position...
> This comparator is specific as it compares anything (external storage) at
> given indices i,j.
>
> Finally Marlin renderer would benefit from Value types (array of structs),
> as edge data are currently packed in int[] arrays, but this new Sort API
> seems to me general & needed enough to be discussed here.
>
> Best regards,
> Laurent
>


Re: Re[4]: The new optimized version of Dual-Pivot Quicksort

2018-11-15 Thread Laurent Bourgès
Hi all Sorting experts,

I want to let you know I am building a Sorting benchmark suite myself based
on contributions of Sebastian Wild (forked github repo) & Vladimir on
github (MIT license):
https://github.com/bourgesl/nearly-optimal-mergesort-code

I hope to exploit JMH in a short future to obtain a reliable test suite on
lots of dataset types & lots of Sorting implementation & array sizes.

Once stabilized, its results can be exploited to run the Java Sort Race
2018.12...

If some C.S. researchers are interested, data papers could rely on this
tool to compare (fairly) algorithms.

PS: Early results (unknow confidence yet):
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/results/basher-results-partial.out

Best regards,
FOSS rocks !


Laurent

Le jeu. 15 nov. 2018 à 15:39, Vladimir Yaroslavskiy  a
écrit :

> Hello, Tagir!
>
> I compared Radix sort with Dual-Pivot Quicksort and see
> that Radix sort is faster than DPQ (on 2M elements) on random data in
> twice times,
> but it works slower on repeated elements (in 3-7 times). For highly
> structured arrays merging sort is invoked in both cases. Other
> data types - almost the same.
>
> Do we actually need to use RadixSort if it is not faster on all inputs?
> What do you think?
>
> Regards,
> Vladimir
>
> Среда, 14 ноября 2018, 19:17 +03:00 от Tagir Valeev :
>
> Hello, Laurent, Vladimir!
>
> I created a pull request containing my RadixSort implementation:
> https://github.com/bourgesl/nearly-optimal-mergesort-code/pull/1
> On my machine the results produced by Mergesorts.java are like this:
>
> Runs with individual timing (skips first 10 runs):
> adjusted reps: 110 + inner loop: 1
> avg-ms=102.6177(+/- 7 %), algo=PeekSort+iscutoff=24+onlyIncRuns=false,
> n=100 (55000110) (n=100, µ=102.6177, σ=7.4065714)
> avg-ms=95.62607(+/- 4 %),
> algo=TopDownMergesort+iscutoff=24+checkSorted=true, n=100
> (55000110) (n=100, µ=95.62607, σ=3.5302947)
> avg-ms=118.73089(+/- 4 %),
> algo=BottomUpMergesort+minRunLen=24+checkSorted=true, n=100
> (55000110) (n=100, µ=118.73089, σ=4.464908)
> avg-ms=108.36175(+/- 4 %), algo=MarlinSort, n=100 (55000110)
> (n=100, µ=108.36175, σ=4.5998554)
> avg-ms=99.68292(+/- 4 %), algo=MarlinMergeSort, n=100
> (55000110) (n=100, µ=99.68292, σ=3.9944465)
> avg-ms=75.43999(+/- 3 %), algo=DualPivotQuicksort2018, n=100
> (55000110) (n=100, µ=75.43999, σ=2.6127808)
> avg-ms=83.80406(+/- 6 %), algo=DualPivotQuicksort2018Ext, n=100
>  (55000110) (n=100, µ=83.80406, σ=4.734225)
> avg-ms=18.886326(+/- 4 %), algo=RadixSort, n=100 (55000110)
> (n=100, µ=18.886326, σ=0.75667334)
>
> As you can see RadixSort is much faster than anything. Well, probably
> there are some input patterns where it may lose (though it's
> performance is much more predictable and much less depend on input
> data than in DPQS), but I strongly believe that concentrating efforts
> on testing corner cases performance and integrating RadixSort into JDK
> would yield much better performance than DPQS, at least for int[]
> arrays. Also the implementation is much simpler.
>
> With best regards,
> Tagir Valeev.
> On Mon, Nov 12, 2018 at 5:09 PM Laurent Bourgès
>  wrote:
> >
> > Hi,
> >
> > Do you know if someone has written a complete JMH benchmark suite
> dedicated
> > to Arrays.sort() ?
> > with varying array size (trivial) but also testing lots of data
> > distributions: (see Vladimir's tests) and possibly all variants (int,
> long,
> > double, Object[] )
> >
> > It could be part of the standard OpenJDK JMH test suite...
> >
> > For now, I forked the nearly-optimal-mergesort repository on github:
> >
> https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/results
> >
> > Cheers,
> > Laurent
> >
> > Le sam. 10 nov. 2018 à 12:49, Laurent Bourgès 
> a
> > écrit :
> >
> > > Dear Vladimir & other Java sort experts,
> > >
> > > I made the port of the DPQS 2018.2 code last night to support a
> secondary
> > > array to be sorted and use preallocation (aux/run for merge sort):
> > >
> > >
> https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/src/wildinter/net/mergesort/DualPivotQuicksort2018Ext.java
> > >
> > > I succeded in using this variant in Marlin renderer (dev) and it is
> > > promising. I figured out a rendering bug in 1 test wigh 65535 random
> > > segments, certainly due to array swaps in mergesort (my side)...
> > >
> > > I will look again at my changes to check if I miss some permutation (a
> //
> > > b) or made any mistake on their indices... tricky.
> >

Re:

2018-11-14 Thread Laurent Bourgès
Hi Vladimir,


>
>>
>> The test suite was described in this paper
>>
>> Jon L. Bentley, M. Douglas McILroy
>> “Engineering a Sort Function”, 1993
>>
>> I use Java version (a little bit extended), see attachment.
>> What you need is to specified sorting classes in IntSorter.java
>> and run BentleyBasher.
>>
>
> Thanks, what is the license of this code ? Files do not have any header...
>
> If you agree, I would like to host such test code on github, as an
> opensource repository.
>

What is your decision ?
 Can I put such precious test code on github (MIT license) in my fork of
sebastian wild's nearly-optimal-mergesort ?

https://github.com/bourgesl/nearly-optimal-mergesort-code/

Cheers,
Laurent


Extending Java Arrays/Collection Sort API

2018-11-14 Thread Laurent Bourgès
Hi,

I am a scientist and the author of the Marlin renderer, integrated in
OpenJDK.

In this renderer, I needed a fast sort algorithm on small almost sorted
arrays, but more important, to deal with 2 arrays: x values and their
corresponding edge indices (pointer like). I derived my MergeSort class to
perform 2 swaps (x & edge ptr) as Arrays.sort() or TimSort() did not
provide such API.

1. I started discussing about extending the Java Sort API with Vladimir
Yaroslavskiy, the author of the famous Java DualPivotQuickSort, as he is
currently improving it.

His coming DPQS 2018 class provides lot of sort algorithms, I propose to
make them public API through a facade method in Arrays class:
insertionSort(), heapSort(), mergeSort(), quickSort()... Of course, his
current code is dedicated to DPQS, but his algorithm implementation could
be generalized to propose the Best implementations as a public Java API
(stable & unstable sorts)

For example in Marlin, I need trivial insertionSort() in the Helpers class,
my MergeSort could be replaced by another best implementation.
Let's not reinvent the wheel...

2. New Sort API to deal with indices...

For now, there is no mean to obtain the indices sorted on another value,
see my introduction: edge indices sorted by their x positions.
In data science, this is very useful and general.

What about an Arrays API extension:
- Arrays.sort(int[] a, int[] b, low, high) or
Indices int[] Arrays.sortIndices(int[] a, low high), a[] left untouched
- same with preallocated buffers & runs arrays
- alternatively, it could be expressed as Comparator
Arrays.sort(int[] a, low, high, Comparator cmp) that sorts array
a using the given comparator. For example, a[] could be edge indices sorted
according to their edge x position...
This comparator is specific as it compares anything (external storage) at
given indices i,j.

Finally Marlin renderer would benefit from Value types (array of structs),
as edge data are currently packed in int[] arrays, but this new Sort API
seems to me general & needed enough to be discussed here.

Best regards,
Laurent


Re:

2018-11-12 Thread Laurent Bourgès
Hi Vladimir,

Excellent, I will try your test code asap.

Le lun. 12 nov. 2018 à 12:25, Vladimir Yaroslavskiy  a
écrit :

> Hi Laurent,
>
> The test suite was described in this paper
>
> Jon L. Bentley, M. Douglas McILroy
> “Engineering a Sort Function”, 1993
>
> I use Java version (a little bit extended), see attachment.
> What you need is to specified sorting classes in IntSorter.java
> and run BentleyBasher.
>

Thanks, what is the license of this code ? Files do not have any header...

If you agree, I would like to host such test code on github, as an
opensource repository.

>
> Please let me know if you have questions/comments/improvements.
>
Sure, I will do,
Thanks a lot.

Laurent

> Понедельник, 12 ноября 2018, 14:01 +03:00 от Laurent Bourgès <
> bourges.laur...@gmail.com>:
>
> Dear Vladimir,
>
> No information about JMH benchmarking.
>
>
> I like it as Alexey made the best framework to benchmark short operations,
> like sorting small arrays.
>
> For example, 1000 ints ~ 0.05ms on my i7 laptop at fixed freq = 2ghz.
>
>
> I use Bentley's test suite to compare algorithms.
>
>
> Is it publicly available ? If yes, where ?
>
> Cheers,
> Laurent
>
>
>
> Понедельник, 12 ноября 2018, 13:06 +03:00 от Laurent Bourgès <
> bourges.laur...@gmail.com
> <https://e.mail.ru/compose/?mailto=mailto%3abourges.laur...@gmail.com>>:
>
> Hi,
>
> Do you know if someone has written a complete JMH benchmark suite
> dedicated to Arrays.sort() ?
> with varying array size (trivial) but also testing lots of data
> distributions: (see Vladimir's tests) and possibly all variants (int, long,
> double, Object[] )
>
> It could be part of the standard OpenJDK JMH test suite...
>
> For now, I forked the nearly-optimal-mergesort repository on github:
>
> https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/results
>
> Cheers,
> Laurent
>
>
>
> --
> Vladimir Yaroslavskiy
>


Re: Re[4]: The new optimized version of Dual-Pivot Quicksort

2018-11-12 Thread Laurent Bourgès
Dear Vladimir,

No information about JMH benchmarking.
>

I like it as Alexey made the best framework to benchmark short operations,
like sorting small arrays.

For example, 1000 ints ~ 0.05ms on my i7 laptop at fixed freq = 2ghz.

I use Bentley's test suite to compare algorithms.
>

Is it publicly available ? If yes, where ?

Cheers,
Laurent


> Понедельник, 12 ноября 2018, 13:06 +03:00 от Laurent Bourgès <
> bourges.laur...@gmail.com>:
>
> Hi,
>
> Do you know if someone has written a complete JMH benchmark suite
> dedicated to Arrays.sort() ?
> with varying array size (trivial) but also testing lots of data
> distributions: (see Vladimir's tests) and possibly all variants (int, long,
> double, Object[] )
>
> It could be part of the standard OpenJDK JMH test suite...
>
> For now, I forked the nearly-optimal-mergesort repository on github:
>
> https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/results
>
> Cheers,
> Laurent
>
>


Re: Re[2]: The new optimized version of Dual-Pivot Quicksort

2018-11-12 Thread Laurent Bourgès
Hi,

Do you know if someone has written a complete JMH benchmark suite dedicated
to Arrays.sort() ?
with varying array size (trivial) but also testing lots of data
distributions: (see Vladimir's tests) and possibly all variants (int, long,
double, Object[] )

It could be part of the standard OpenJDK JMH test suite...

For now, I forked the nearly-optimal-mergesort repository on github:
https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/results

Cheers,
Laurent

Le sam. 10 nov. 2018 à 12:49, Laurent Bourgès  a
écrit :

> Dear Vladimir & other Java sort experts,
>
> I made the port of the DPQS 2018.2 code last night to support a secondary
> array to be sorted and use preallocation (aux/run for merge sort):
>
> https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/src/wildinter/net/mergesort/DualPivotQuicksort2018Ext.java
>
> I succeded in using this variant in Marlin renderer (dev) and it is
> promising. I figured out a rendering bug in 1 test wigh 65535 random
> segments, certainly due to array swaps in mergesort (my side)...
>
> I will look again at my changes to check if I miss some permutation (a //
> b) or made any mistake on their indices... tricky.
>
> PS: Please share your updated webrev when possible to rebase my changes.
>
> Regards,
> Laurent
>
> Le ven. 9 nov. 2018 à 10:08, Laurent Bourgès 
> a écrit :
>
>> Hi Vladimir,
>>
>> Thank you for your attention, you are the Sort Master.
>>
>>
>> Le ven. 9 nov. 2018 à 09:02, Vladimir Yaroslavskiy 
>> a écrit :
>>
>>> Hi Laurent,
>>>
>>> The new version is still under review, there were a lot of suggestions
>>> and ideas from Doug Lea.
>>> I needed time to apply and check them. I'm going to send final version
>>> in few days.
>>>
>>
>> Excellent news, so it will ship in jdk12...
>>
>>>
>>> As to new method sort(a[], low, high, indices): do you mean this method
>>> in Arrays class?
>>> Are you going to extend Arrays API?
>>>
>>
>> I benchmarked my MergeSort and adding extra array implies many more
>> moves: thresholds should be adjusted and my sort is sometimes better as it
>> makes half moves (a <-> buffer swaps), so this complementary sort should
>> have its own tuned implementation (tests & benchmarks).
>>
>> I coded my sort as such need did not match any Arrays.sort() methods.
>> Having it publicly in jdk.base would be great for any other data handling
>> algorithm (science, AI ?)
>>
>> If you agree it is useful, I could try proposing an Arrays/Collection API
>> extension like :
>> sort([], low, high, int[]indices)
>>
>>
>>> About new parameters workbase[]: method sort(...) in DualPivotQuicksort
>>> class (version is under review)
>>> has extra parameter - parallel context (Sorter sorter) which has
>>> required workbase[].
>>> If we run sorting sequentially, sorter parameter is null (no any
>>> objects) and temporary buffer
>>> will be created inside in tryMerge method if we go ahead with merging.
>>>
>>
>> I will have a look. For Marlin, parallel sort is not possible as
>> rendering shapes can be parallelized in the application (geoserver map
>> rendering).
>>
>>
>>> I will send new sources in few days and add you in cc, so you can look
>>> at it
>>> and find if new methods are acceptable for you. Then we can discuss
>>> required changes (if any).
>>>
>>
>> Perfect, I started adapting your code and will share soon the link to my
>> github repo.
>>
>>
>>> Thank you,
>>> Vladimir
>>>
>>
>> Thank you very much for your first thoughts,
>> Laurent
>>
>>
>>> Пятница, 9 ноября 2018, 10:44 +03:00 от Laurent Bourgès <
>>> bourges.laur...@gmail.com>:
>>>
>>> Hi,
>>>
>>> I am currently testing many sort algorithms to improve the Marlin
>>> renderer (AA 2D shape rasterizer), I integrated since OpenJDK9 and am still
>>> improving for OpenJDK12 .
>>>
>>> I created my MergeSort (top down, check for sorted parts, array / buffer
>>> swap to minimize moves, isort on small sub arrays) that sort 2 int[]:
>>> crossing + edge indices.
>>>  It is critical as edge crossings are sorted at every scanline, data are
>>> almost sorted/reversed, not really random.
>>>
>>> 3 questions:
>>> - why is this patch dormant ? I checked in openjdk12 repo and your
>>> changes were not integrated.
>>>

Re: Re[2]: The new optimized version of Dual-Pivot Quicksort

2018-11-10 Thread Laurent Bourgès
Dear Vladimir & other Java sort experts,

I made the port of the DPQS 2018.2 code last night to support a secondary
array to be sorted and use preallocation (aux/run for merge sort):
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/src/wildinter/net/mergesort/DualPivotQuicksort2018Ext.java

I succeded in using this variant in Marlin renderer (dev) and it is
promising. I figured out a rendering bug in 1 test wigh 65535 random
segments, certainly due to array swaps in mergesort (my side)...

I will look again at my changes to check if I miss some permutation (a //
b) or made any mistake on their indices... tricky.

PS: Please share your updated webrev when possible to rebase my changes.

Regards,
Laurent

Le ven. 9 nov. 2018 à 10:08, Laurent Bourgès  a
écrit :

> Hi Vladimir,
>
> Thank you for your attention, you are the Sort Master.
>
>
> Le ven. 9 nov. 2018 à 09:02, Vladimir Yaroslavskiy  a
> écrit :
>
>> Hi Laurent,
>>
>> The new version is still under review, there were a lot of suggestions
>> and ideas from Doug Lea.
>> I needed time to apply and check them. I'm going to send final version in
>> few days.
>>
>
> Excellent news, so it will ship in jdk12...
>
>>
>> As to new method sort(a[], low, high, indices): do you mean this method
>> in Arrays class?
>> Are you going to extend Arrays API?
>>
>
> I benchmarked my MergeSort and adding extra array implies many more moves:
> thresholds should be adjusted and my sort is sometimes better as it makes
> half moves (a <-> buffer swaps), so this complementary sort should have its
> own tuned implementation (tests & benchmarks).
>
> I coded my sort as such need did not match any Arrays.sort() methods.
> Having it publicly in jdk.base would be great for any other data handling
> algorithm (science, AI ?)
>
> If you agree it is useful, I could try proposing an Arrays/Collection API
> extension like :
> sort([], low, high, int[]indices)
>
>
>> About new parameters workbase[]: method sort(...) in DualPivotQuicksort
>> class (version is under review)
>> has extra parameter - parallel context (Sorter sorter) which has required
>> workbase[].
>> If we run sorting sequentially, sorter parameter is null (no any objects)
>> and temporary buffer
>> will be created inside in tryMerge method if we go ahead with merging.
>>
>
> I will have a look. For Marlin, parallel sort is not possible as rendering
> shapes can be parallelized in the application (geoserver map rendering).
>
>
>> I will send new sources in few days and add you in cc, so you can look at
>> it
>> and find if new methods are acceptable for you. Then we can discuss
>> required changes (if any).
>>
>
> Perfect, I started adapting your code and will share soon the link to my
> github repo.
>
>
>> Thank you,
>> Vladimir
>>
>
> Thank you very much for your first thoughts,
> Laurent
>
>
>> Пятница, 9 ноября 2018, 10:44 +03:00 от Laurent Bourgès <
>> bourges.laur...@gmail.com>:
>>
>> Hi,
>>
>> I am currently testing many sort algorithms to improve the Marlin
>> renderer (AA 2D shape rasterizer), I integrated since OpenJDK9 and am still
>> improving for OpenJDK12 .
>>
>> I created my MergeSort (top down, check for sorted parts, array / buffer
>> swap to minimize moves, isort on small sub arrays) that sort 2 int[]:
>> crossing + edge indices.
>>  It is critical as edge crossings are sorted at every scanline, data are
>> almost sorted/reversed, not really random.
>>
>> 3 questions:
>> - why is this patch dormant ? I checked in openjdk12 repo and your
>> changes were not integrated.
>> - I need a sort() method that works with 2 arrays: data + indices
>> (pointer like). Such sorted indices are useful to use for lookups c[idx[k]]
>> or to perform other array permutations...
>> Would you agree adding such new sort(a[], low, high, indices)
>> - Marlin uses a ThreadLocal context to avoid any allocation. JDK sort()
>> impl do not accept parameters to give any buffer[] and avoid allocations.
>> Would you agree adding such optional parameters (like workbase[]) ?
>>
>> I will experiment adapting the DualPivotQuickSort in Marlin renderer and
>> perform array sort race & rendering benchmarks.
>>
>> Cheers,
>> Laurent
>>
>> Le ven. 19 janv. 2018 à 14:38, Vladimir Yaroslavskiy > <https://e.mail.ru/compose/?mailto=mailto%3avlv.spb...@mail.ru>> a
>> écrit :
>>
>> Hi team,
>>
>> In Sept 2009 Josh Bloch, Jon Bentley and I introduced new sorting
>> algorith

Re: Re[2]: The new optimized version of Dual-Pivot Quicksort

2018-11-09 Thread Laurent Bourgès
Hi Vladimir,

Thank you for your attention, you are the Sort Master.


Le ven. 9 nov. 2018 à 09:02, Vladimir Yaroslavskiy  a
écrit :

> Hi Laurent,
>
> The new version is still under review, there were a lot of suggestions and
> ideas from Doug Lea.
> I needed time to apply and check them. I'm going to send final version in
> few days.
>

Excellent news, so it will ship in jdk12...

>
> As to new method sort(a[], low, high, indices): do you mean this method in
> Arrays class?
> Are you going to extend Arrays API?
>

I benchmarked my MergeSort and adding extra array implies many more moves:
thresholds should be adjusted and my sort is sometimes better as it makes
half moves (a <-> buffer swaps), so this complementary sort should have its
own tuned implementation (tests & benchmarks).

I coded my sort as such need did not match any Arrays.sort() methods.
Having it publicly in jdk.base would be great for any other data handling
algorithm (science, AI ?)

If you agree it is useful, I could try proposing an Arrays/Collection API
extension like :
sort([], low, high, int[]indices)


> About new parameters workbase[]: method sort(...) in DualPivotQuicksort
> class (version is under review)
> has extra parameter - parallel context (Sorter sorter) which has required
> workbase[].
> If we run sorting sequentially, sorter parameter is null (no any objects)
> and temporary buffer
> will be created inside in tryMerge method if we go ahead with merging.
>

I will have a look. For Marlin, parallel sort is not possible as rendering
shapes can be parallelized in the application (geoserver map rendering).


> I will send new sources in few days and add you in cc, so you can look at
> it
> and find if new methods are acceptable for you. Then we can discuss
> required changes (if any).
>

Perfect, I started adapting your code and will share soon the link to my
github repo.


> Thank you,
> Vladimir
>

Thank you very much for your first thoughts,
Laurent


> Пятница, 9 ноября 2018, 10:44 +03:00 от Laurent Bourgès <
> bourges.laur...@gmail.com>:
>
> Hi,
>
> I am currently testing many sort algorithms to improve the Marlin renderer
> (AA 2D shape rasterizer), I integrated since OpenJDK9 and am still
> improving for OpenJDK12 .
>
> I created my MergeSort (top down, check for sorted parts, array / buffer
> swap to minimize moves, isort on small sub arrays) that sort 2 int[]:
> crossing + edge indices.
>  It is critical as edge crossings are sorted at every scanline, data are
> almost sorted/reversed, not really random.
>
> 3 questions:
> - why is this patch dormant ? I checked in openjdk12 repo and your changes
> were not integrated.
> - I need a sort() method that works with 2 arrays: data + indices (pointer
> like). Such sorted indices are useful to use for lookups c[idx[k]] or to
> perform other array permutations...
> Would you agree adding such new sort(a[], low, high, indices)
> - Marlin uses a ThreadLocal context to avoid any allocation. JDK sort()
> impl do not accept parameters to give any buffer[] and avoid allocations.
> Would you agree adding such optional parameters (like workbase[]) ?
>
> I will experiment adapting the DualPivotQuickSort in Marlin renderer and
> perform array sort race & rendering benchmarks.
>
> Cheers,
> Laurent
>
> Le ven. 19 janv. 2018 à 14:38, Vladimir Yaroslavskiy  <https://e.mail.ru/compose/?mailto=mailto%3avlv.spb...@mail.ru>> a écrit :
>
> Hi team,
>
> In Sept 2009 Josh Bloch, Jon Bentley and I introduced new sorting
> algorithm, Dual-Pivot Quicksort, for primitives in JDK 7 and later
> I suggested several improvements of Dual-Pivot Quicksort, which
> were integrated into JDK 8.
>
> Now I have more optimized and faster version of Dual-Pivot Quicksort.
> I have been working on it for the last 5 years. Please, find the
> summary of changes below and sources / diff at webrev [1].
>
> All tests and benchmarking were run on the most recent build of JDK 10,
> jdk-10-ea+39. The new version shows the better performance on different
> inputs and guarantees n*log(n) on any data.
>
> The new implementation of Dual-Pivot Quicksort is 'all-in-one' version:
> it contains one code for both parallel and sequential sorting algorithms.
>
> Suggested version is 10-20% faster on random data, 1.5-4 times faster
> on nearly structured arrays, 1.5-2 times faster on period inputs.
>
> Parallel Dual-Pivot Quicksort is 1.5-3 times faster than current
> algorithm based on merge sort.
>
> Benchmarking on the test suite, suggested by Jon Bentley, shows the
> boost of performance in 1.4 times. This test suite contains several
> types of inputs, such as random data, nearly structured arrays, data
>

Re: The new optimized version of Dual-Pivot Quicksort

2018-11-09 Thread Laurent Bourgès
Hi,

One more thing: if you have any advice on my code, datasets, or algorithmic
approach, please give me your feedback/comments.

Here is the Marlin MergeSort:
http://hg.openjdk.java.net/jdk/jdk/file/190b77982361/src/java.desktop/share/classes/sun/java2d/marlin/MergeSort.java

Laurent

Le ven. 9 nov. 2018 à 08:44, Laurent Bourgès  a
écrit :

> Hi,
>
> I am currently testing many sort algorithms to improve the Marlin renderer
> (AA 2D shape rasterizer), I integrated since OpenJDK9 and am still
> improving for OpenJDK12 .
>
> I created my MergeSort (top down, check for sorted parts, array / buffer
> swap to minimize moves, isort on small sub arrays) that sort 2 int[]:
> crossing + edge indices.
>  It is critical as edge crossings are sorted at every scanline, data are
> almost sorted/reversed, not really random.
>
> 3 questions:
> - why is this patch dormant ? I checked in openjdk12 repo and your changes
> were not integrated.
> - I need a sort() method that works with 2 arrays: data + indices (pointer
> like). Such sorted indices are useful to use for lookups c[idx[k]] or to
> perform other array permutations...
> Would you agree adding such new sort(a[], low, high, indices)
> - Marlin uses a ThreadLocal context to avoid any allocation. JDK sort()
> impl do not accept parameters to give any buffer[] and avoid allocations.
> Would you agree adding such optional parameters (like workbase[]) ?
>
> I will experiment adapting the DualPivotQuickSort in Marlin renderer and
> perform array sort race & rendering benchmarks.
>
> Cheers,
> Laurent
>
> Le ven. 19 janv. 2018 à 14:38, Vladimir Yaroslavskiy 
> a écrit :
>
>> Hi team,
>>
>> In Sept 2009 Josh Bloch, Jon Bentley and I introduced new sorting
>> algorithm, Dual-Pivot Quicksort, for primitives in JDK 7 and later
>> I suggested several improvements of Dual-Pivot Quicksort, which
>> were integrated into JDK 8.
>>
>> Now I have more optimized and faster version of Dual-Pivot Quicksort.
>> I have been working on it for the last 5 years. Please, find the
>> summary of changes below and sources / diff at webrev [1].
>>
>> All tests and benchmarking were run on the most recent build of JDK 10,
>> jdk-10-ea+39. The new version shows the better performance on different
>> inputs and guarantees n*log(n) on any data.
>>
>> The new implementation of Dual-Pivot Quicksort is 'all-in-one' version:
>> it contains one code for both parallel and sequential sorting algorithms.
>>
>> Suggested version is 10-20% faster on random data, 1.5-4 times faster
>> on nearly structured arrays, 1.5-2 times faster on period inputs.
>>
>> Parallel Dual-Pivot Quicksort is 1.5-3 times faster than current
>> algorithm based on merge sort.
>>
>> Benchmarking on the test suite, suggested by Jon Bentley, shows the
>> boost of performance in 1.4 times. This test suite contains several
>> types of inputs, such as random data, nearly structured arrays, data
>> with period and so on. Also there are several modifications of inputs
>> and parameters which are used in data building. We run sorting on all
>> combinations to compare two algorithms.
>>
>> Please let me know if you have any questions / comments.
>>
>> Summary of changes:
>>
>> DualPivotQuicksort class
>> 
>> * Pivots are chosen with another step, the 1-st and 5-th candidates
>>are taken as pivots instead of 2-nd and 4-th.
>> * Splitting into parts is related to the golden ratio
>> * Pivot candidates are sorted by combination of 5-element
>>network sorting + insertion sort
>> * New backwards partitioning is simpler and more efficient
>> * Quicksort tuning parameters were updated
>> * Merging sort is invoked on each iteration from Quicksort
>> * Additional steps for float/double were fully rewritten
>> * Heap sort is invoked on the leftmost part
>> * Heap sort is used as a guard against quadratic time
>> * Parallel sorting is based on Dual-Pivot Quicksort
>>instead of merge sort
>>
>> SortingAlgorithms class
>> ---
>> * Merging sort and pair insertion sort were moved from
>>DualPivotQuicksort class
>> * Pair insertion sort was simplified and optimized
>> * New nano insertion sort was introduced for tiny arrays
>> * Merging sort was fully rewritten
>> * Optimized merging partitioning is used
>> * Merging parameters were updated
>> * Merging of runs was fully rewritten
>> * Fast version of heap sort was introduced
>> * Parallel merging sort was also provided
>>
>> Sorting / ParallelSorting classes
&g

Re: The new optimized version of Dual-Pivot Quicksort

2018-11-08 Thread Laurent Bourgès
Hi,

I am currently testing many sort algorithms to improve the Marlin renderer
(AA 2D shape rasterizer), I integrated since OpenJDK9 and am still
improving for OpenJDK12 .

I created my MergeSort (top down, check for sorted parts, array / buffer
swap to minimize moves, isort on small sub arrays) that sort 2 int[]:
crossing + edge indices.
 It is critical as edge crossings are sorted at every scanline, data are
almost sorted/reversed, not really random.

3 questions:
- why is this patch dormant ? I checked in openjdk12 repo and your changes
were not integrated.
- I need a sort() method that works with 2 arrays: data + indices (pointer
like). Such sorted indices are useful to use for lookups c[idx[k]] or to
perform other array permutations...
Would you agree adding such new sort(a[], low, high, indices)
- Marlin uses a ThreadLocal context to avoid any allocation. JDK sort()
impl do not accept parameters to give any buffer[] and avoid allocations.
Would you agree adding such optional parameters (like workbase[]) ?

I will experiment adapting the DualPivotQuickSort in Marlin renderer and
perform array sort race & rendering benchmarks.

Cheers,
Laurent

Le ven. 19 janv. 2018 à 14:38, Vladimir Yaroslavskiy  a
écrit :

> Hi team,
>
> In Sept 2009 Josh Bloch, Jon Bentley and I introduced new sorting
> algorithm, Dual-Pivot Quicksort, for primitives in JDK 7 and later
> I suggested several improvements of Dual-Pivot Quicksort, which
> were integrated into JDK 8.
>
> Now I have more optimized and faster version of Dual-Pivot Quicksort.
> I have been working on it for the last 5 years. Please, find the
> summary of changes below and sources / diff at webrev [1].
>
> All tests and benchmarking were run on the most recent build of JDK 10,
> jdk-10-ea+39. The new version shows the better performance on different
> inputs and guarantees n*log(n) on any data.
>
> The new implementation of Dual-Pivot Quicksort is 'all-in-one' version:
> it contains one code for both parallel and sequential sorting algorithms.
>
> Suggested version is 10-20% faster on random data, 1.5-4 times faster
> on nearly structured arrays, 1.5-2 times faster on period inputs.
>
> Parallel Dual-Pivot Quicksort is 1.5-3 times faster than current
> algorithm based on merge sort.
>
> Benchmarking on the test suite, suggested by Jon Bentley, shows the
> boost of performance in 1.4 times. This test suite contains several
> types of inputs, such as random data, nearly structured arrays, data
> with period and so on. Also there are several modifications of inputs
> and parameters which are used in data building. We run sorting on all
> combinations to compare two algorithms.
>
> Please let me know if you have any questions / comments.
>
> Summary of changes:
>
> DualPivotQuicksort class
> 
> * Pivots are chosen with another step, the 1-st and 5-th candidates
>are taken as pivots instead of 2-nd and 4-th.
> * Splitting into parts is related to the golden ratio
> * Pivot candidates are sorted by combination of 5-element
>network sorting + insertion sort
> * New backwards partitioning is simpler and more efficient
> * Quicksort tuning parameters were updated
> * Merging sort is invoked on each iteration from Quicksort
> * Additional steps for float/double were fully rewritten
> * Heap sort is invoked on the leftmost part
> * Heap sort is used as a guard against quadratic time
> * Parallel sorting is based on Dual-Pivot Quicksort
>instead of merge sort
>
> SortingAlgorithms class
> ---
> * Merging sort and pair insertion sort were moved from
>DualPivotQuicksort class
> * Pair insertion sort was simplified and optimized
> * New nano insertion sort was introduced for tiny arrays
> * Merging sort was fully rewritten
> * Optimized merging partitioning is used
> * Merging parameters were updated
> * Merging of runs was fully rewritten
> * Fast version of heap sort was introduced
> * Parallel merging sort was also provided
>
> Sorting / ParallelSorting classes
> -
> * New test cases were added
> * Sources of these classes were unified
>
> Arrays class
> 
> * Calls of Dual-Pivot Quicksort were updated
> * Parallel sorting of primitives was switched to parallel
>Dual-Pivot Quicksort
> * Javadoc was modified
>
> ArraysParallelSortHelpers class
> ---
> * Old implementation of parallel sorting for primitives was removed
> * Javadoc was updated
>
> Thank you,
> Vladimir
>
> 
> [1] http://cr.openjdk.java.net/~alanb/DualPivotSortUpdate/webrev.01/
> 
>


Re: Can @Stable (or something similar) be made accessible?

2018-01-12 Thread Laurent Bourgès
Hi,

I am the author of the Marlin renderer, integrated in both java.desktop &
javafx.graphics modules.

I wonder if such core annotations @Stable @ForceInline ... could be allowed
to such openjdk modules in order to improve performance.

Marlin already uses jdk.internal.Unsafe but @Stable looks promising as
Marlin uses (and recycles) lots of int/double arrays.

I already made my fields (& constants) final but it is not enough to let
hotspot trust totally my code, isn't it ?

I think these annotations are confined to java.lang for now.
I will check thatasap.

Cheers,
Laurent

Le 12 janv. 2018 9:50 AM, "Andrew Haley"  a écrit :

> On 12/01/18 04:33, Jason Greene wrote:
>
> > The internal @Stable facility provides the desired semantics and
> > precision, but it is heavily locked down, with privileged code
> > checks and export restrictions. Could this be made more accessible
> > (or perhaps a variant restricted to just final fields)? Informing
> > the compiler that a final field is a true lazy initialized constant,
> > with no store-to-final seems a pretty useful construct in general. I
> > can understand that the long term desire is that this shouldn’t be
> > necessary, and should be inferred [3], but at that point the
> > annotation is still useful as documentation and legacy
> > compatibility. If nothing else could it be allowed in non-privileged
> > code via some flag?
>
> I don't know of any way to do that without compromising the integrity
> of the JVM.  All that anybody would have to do to break the VM is to
> define a field as @Stable and then change the field.  I suppose you
> could say that Unsafe is just as bad and people still are allowed to
> use it, but there's understandable reluctance any further to transform
> a safe runtime into an unsafe one.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. 
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>


Re: Faster Math ?

2017-11-17 Thread Laurent Bourgès
net/browse/JDK-8145688
>   (From these you can track related issues.)
>
> Other Math functions are not intrinsic like cbrt (non-native) and acos
> (native). There is ongoing work to turn native implementations into Java
> implementations (i don’t know if there would be any follow up on
> intrinsification).
>
>https://bugs.openjdk.java.net/browse/JDK-8134780
>https://bugs.openjdk.java.net/browse/JDK-8171407
>
> Joe knows more.
>
> —
>
> As part of the Vector API effort we will likely need to investigate the
> support for less accurate but faster math functions. It’s too early to tell
> if something like a FastMath class will pop out of that, but FWIW i am
> sympathetic to that :-)
>
> I liked this tweet:
>
> https://twitter.com/FioraAeterna/status/926150700836405248
>
>life as a gpu compiler dev is basically just fielding repeated
> complaints that
>"fast math" isn't precise and "precise math" isn't fast
>
> as an indication of what we could be getting into :-)
>
> Paul.
>
> On 9 Nov 2017, at 01:00, Laurent Bourgès <bourges.laur...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> The Marlin renderer (JEP265) uses few Math functions: sqrt, cbrt, acos...
>>
>> Could you check if the current JDK uses C2 intrinsics or libfdm (native /
>> JNI overhead?) and tell me if such functions are already highly optimized
>> in jdk9 or 10 ?
>>
>> Some people have implemented their own fast Math like Apache Commons Math
>> or JaFaMa libraries that are 10x faster for acos / cbrt.
>>
>> I wonder if I should implement my own cbrt function (cubics) in pure java
>> as I do not need the highest accuracy but SPEED.
>>
>> Would it sound possible to have a JDK FastMath public API (lots faster but
>> less accurate?)
>>
>> Do you know if recent CPU (intel?) have dedicated instructions for such
>> math operations ?
>> Why not use it instead?
>> Maybe that's part of the new Vectorization API (panama) ?
>>
>> Cheers,
>> Laurent Bourges
>>
>


Re: Faster Math ?

2017-11-09 Thread Laurent Bourgès
Hi Paul,

Thank you so much for this complete summary !

I will still perform some benchmarks and could port acos native code into
java code as it is used by marlin.

Anyway I will backport the Cbrt java code into Marlin @ github for JDK8
users (GPL v2).

Thanks,
Laurent

Le 9 nov. 2017 18:19, "Paul Sandoz" <paul.san...@oracle.com> a écrit :

> Hi Laurent,
>
> A Java method is a candidate for intrinsification if it is annotated with
> @HotSpotIntrinsicCandidate. When running Java code you can also use the
> HotSpot flags -XX:+PrintCompilarion -XX:+PrintInlining to show methods that
> are intrinsic (JIT watch, as mentioned, is also excellent in this regard).
>
> I recommend cloning OpenJDK and browsing the source.
>
> Some of the math functions are intrinsic in the interpreter and all the
> runtime compilers to ensure consistent results across interpretation and
> compilation.
>
> Work was done by Intel to improve many of the math functions. See:
>
>   Update for x86 sin and cos in the math lib
>   https://bugs.openjdk.java.net/browse/JDK-8143353
>
>   Update for x86 pow in the math lib
>   https://bugs.openjdk.java.net/browse/JDK-8145688
>
>   (From these you can track related issues.)
>
> Other Math functions are not intrinsic like cbrt (non-native) and acos
> (native). There is ongoing work to turn native implementations into Java
> implementations (i don’t know if there would be any follow up on
> intrinsification).
>
>   https://bugs.openjdk.java.net/browse/JDK-8134780
>   https://bugs.openjdk.java.net/browse/JDK-8171407
>
> Joe knows more.
>
> —
>
> As part of the Vector API effort we will likely need to investigate the
> support for less accurate but faster math functions. It’s too early to tell
> if something like a FastMath class will pop out of that, but FWIW i am
> sympathetic to that :-)
>
> I liked this tweet:
>
> https://twitter.com/FioraAeterna/status/926150700836405248
>
>   life as a gpu compiler dev is basically just fielding repeated
> complaints that
>   "fast math" isn't precise and "precise math" isn't fast
>
> as an indication of what we could be getting into :-)
>
> Paul.
>
> > On 9 Nov 2017, at 01:00, Laurent Bourgès <bourges.laur...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > The Marlin renderer (JEP265) uses few Math functions: sqrt, cbrt, acos...
> >
> > Could you check if the current JDK uses C2 intrinsics or libfdm (native /
> > JNI overhead?) and tell me if such functions are already highly optimized
> > in jdk9 or 10 ?
> >
> > Some people have implemented their own fast Math like Apache Commons Math
> > or JaFaMa libraries that are 10x faster for acos / cbrt.
> >
> > I wonder if I should implement my own cbrt function (cubics) in pure java
> > as I do not need the highest accuracy but SPEED.
> >
> > Would it sound possible to have a JDK FastMath public API (lots faster
> but
> > less accurate?)
> >
> > Do you know if recent CPU (intel?) have dedicated instructions for such
> > math operations ?
> > Why not use it instead?
> > Maybe that's part of the new Vectorization API (panama) ?
> >
> > Cheers,
> > Laurent Bourges
>
>


Re: Faster Math ?

2017-11-09 Thread Laurent Bourgès
Thanks, andrew.

I searched on the web and I understand now:
Fdlibm native library has been ported in Java code for jdk9 (like the
jafama library).

Cbrt changeset:
http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/7dc9726cfa82

I will anyway compare jdk9 with latest jafama 2.2 to have an up-to-date
comparison.

Does somebody else need a faster but less accurate math library ?
JaFaMa has such alternative methods...

Preserving double-precision may be very costly in terms of performance and
sometimes such accuracy is not required.

Last question:
Is there a sin & cos function returning both values for the same angle ?
It is very useful to compute exact fourrier transform...
It is called sinAndCos(double wrappers) in jafama.

Cheers,
Laurent

Le 9 nov. 2017 17:08, "Andrew Haley" <a...@redhat.com> a écrit :

> On 09/11/17 15:02, Laurent Bourgès wrote:
> > --- testing cbrt(double) = pow(double, 1/3) ---
> > Loop on Math.pow(double, 1/3), args in [-10.0,10.0], took 0.739 s
> > Loop on FastMath.cbrt(double), args in [-10.0,10.0], took 0.166 s
> > Loop on Math.pow(double, 1/3), args in [-0.7,0.7], took 0.746 s
> > Loop on FastMath.cbrt(double), args in [-0.7,0.7], took 0.166 s
> > Loop on Math.pow(double, 1/3), args in [-0.1,0.1], took 0.742 s
> > Loop on FastMath.cbrt(double), args in [-0.1,0.1], took 0.165 s
> > Loop on Math.pow(double, 1/3), args in all magnitudes, took 0.753 s
> > Loop on FastMath.cbrt(double), args in all magnitudes, took 0.244
> >
> > Conclusion:
> > - acos / asin / atan functions are quite slow: it confirms these are not
> > optimized by hotspot intrinsics.
> >
> > - cbrt() is slower than sqrt() : 1.1s vs 0.1 => 10x slower
> > - cbrt() is slower than pow(1/3) : 1.1s vs 0.7s => 50% slower
>
> No.  cbrt() is faster than pow(1/3) : 0.24 vs 0.75
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>


Re: Faster Math ?

2017-11-09 Thread Laurent Bourgès
Hi,

Here are very basic benchmark results from (JaFaMa 2 - FastMathPerf) made
on my laptop (i7-6820HQ set @ 2Ghz + JDK8):
--- testing asin(double) ---
Loop on Math.asin(double) took 6.675 s
Loop on FastMath.asin(double) took 0.162 s

--- testing acos(double) ---
Loop on Math.acos(double) took 6.332 s
Loop on FastMath.acos(double) took 0.16 s

--- testing atan(double) ---
Loop on Math.atan(double) took 0.766 s
Loop on FastMath.atan(double) took 0.167

--- testing sqrt(double) ---
Loop on Math.sqrt(double), args in [0.0,10.0], took 0.095 s
Loop on FastMath.sqrt(double), args in [0.0,10.0], took 0.097 s
Loop on Math.sqrt(double), args in [0.0,1.0E12], took 0.109 s
Loop on FastMath.sqrt(double), args in [0.0,1.0E12], took 0.093 s
Loop on Math.sqrt(double), args in all magnitudes (>=0), took 0.091 s
Loop on FastMath.sqrt(double), args in all magnitudes (>=0), took 0.092

--- testing cbrt(double) ---
Loop on Math.cbrt(double), args in [-10.0,10.0], took 1.152 s
Loop on FastMath.cbrt(double), args in [-10.0,10.0], took 0.195 s
Loop on Math.cbrt(double), args in [-1.0E12,1.0E12], took 1.153 s
Loop on FastMath.cbrt(double), args in [-1.0E12,1.0E12], took 0.193 s
Loop on Math.cbrt(double), args in all magnitudes, took 1.154 s
Loop on FastMath.cbrt(double), args in all magnitudes, took 0.272

--- testing cbrt(double) = pow(double, 1/3) ---
Loop on Math.pow(double, 1/3), args in [-10.0,10.0], took 0.739 s
Loop on FastMath.cbrt(double), args in [-10.0,10.0], took 0.166 s
Loop on Math.pow(double, 1/3), args in [-0.7,0.7], took 0.746 s
Loop on FastMath.cbrt(double), args in [-0.7,0.7], took 0.166 s
Loop on Math.pow(double, 1/3), args in [-0.1,0.1], took 0.742 s
Loop on FastMath.cbrt(double), args in [-0.1,0.1], took 0.165 s
Loop on Math.pow(double, 1/3), args in all magnitudes, took 0.753 s
Loop on FastMath.cbrt(double), args in all magnitudes, took 0.244

Conclusion:
- acos / asin / atan functions are quite slow: it confirms these are not
optimized by hotspot intrinsics.

- cbrt() is slower than sqrt() : 1.1s vs 0.1 => 10x slower
- cbrt() is slower than pow(1/3) : 1.1s vs 0.7s => 50% slower

Any plan to enhance these specific math operations ?

Laurent


2017-11-09 14:33 GMT+01:00 Laurent Bourgès <bourges.laur...@gmail.com>:

> I checked in the latest jdk master and both cbrt / acos are NOT intrinsics.
>
> However, cbrt(x) = pow(x, 1/3) so it may be optmized...
>
> Could someone tell me how cbrt() is concretely implemented ?
>
> In native libfdm, there is no e_cbrt.c !
>
> Thanks for your help,
> Laurent
>
> Le 9 nov. 2017 10:52 AM, "Jonas Konrad" <m...@yawk.at> a écrit :
>
>> Hey,
>>
>> Most functions in the Math class are intrinsic (
>> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/tip/src/
>> share/vm/classfile/vmSymbols.hpp#l664 ) and will use native instructions
>> where available. You can also test this yourself using jitwatch. There is
>> no native call overhead.
>>
>> The standard library does not currently include less accurate but faster
>> Math functions, maybe someone else can answer if that is something to be
>> considered.
>>
>> - Jonas Konrad
>>
>> On 11/09/2017 10:00 AM, Laurent Bourgès wrote:
>>
>>> Hi,
>>>
>>> The Marlin renderer (JEP265) uses few Math functions: sqrt, cbrt, acos...
>>>
>>> Could you check if the current JDK uses C2 intrinsics or libfdm (native /
>>> JNI overhead?) and tell me if such functions are already highly optimized
>>> in jdk9 or 10 ?
>>>
>>> Some people have implemented their own fast Math like Apache Commons Math
>>> or JaFaMa libraries that are 10x faster for acos / cbrt.
>>>
>>> I wonder if I should implement my own cbrt function (cubics) in pure java
>>> as I do not need the highest accuracy but SPEED.
>>>
>>> Would it sound possible to have a JDK FastMath public API (lots faster
>>> but
>>> less accurate?)
>>>
>>> Do you know if recent CPU (intel?) have dedicated instructions for such
>>> math operations ?
>>> Why not use it instead?
>>> Maybe that's part of the new Vectorization API (panama) ?
>>>
>>> Cheers,
>>> Laurent Bourges
>>>
>>>


-- 
-- 
Laurent Bourgès


Re: Faster Math ?

2017-11-09 Thread Laurent Bourgès
I checked in the latest jdk master and both cbrt / acos are NOT intrinsics.

However, cbrt(x) = pow(x, 1/3) so it may be optmized...

Could someone tell me how cbrt() is concretely implemented ?

In native libfdm, there is no e_cbrt.c !

Thanks for your help,
Laurent

Le 9 nov. 2017 10:52 AM, "Jonas Konrad" <m...@yawk.at> a écrit :

> Hey,
>
> Most functions in the Math class are intrinsic (
> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/tip/src/
> share/vm/classfile/vmSymbols.hpp#l664 ) and will use native instructions
> where available. You can also test this yourself using jitwatch. There is
> no native call overhead.
>
> The standard library does not currently include less accurate but faster
> Math functions, maybe someone else can answer if that is something to be
> considered.
>
> - Jonas Konrad
>
> On 11/09/2017 10:00 AM, Laurent Bourgès wrote:
>
>> Hi,
>>
>> The Marlin renderer (JEP265) uses few Math functions: sqrt, cbrt, acos...
>>
>> Could you check if the current JDK uses C2 intrinsics or libfdm (native /
>> JNI overhead?) and tell me if such functions are already highly optimized
>> in jdk9 or 10 ?
>>
>> Some people have implemented their own fast Math like Apache Commons Math
>> or JaFaMa libraries that are 10x faster for acos / cbrt.
>>
>> I wonder if I should implement my own cbrt function (cubics) in pure java
>> as I do not need the highest accuracy but SPEED.
>>
>> Would it sound possible to have a JDK FastMath public API (lots faster but
>> less accurate?)
>>
>> Do you know if recent CPU (intel?) have dedicated instructions for such
>> math operations ?
>> Why not use it instead?
>> Maybe that's part of the new Vectorization API (panama) ?
>>
>> Cheers,
>> Laurent Bourges
>>
>>


Faster Math ?

2017-11-09 Thread Laurent Bourgès
Hi,

The Marlin renderer (JEP265) uses few Math functions: sqrt, cbrt, acos...

Could you check if the current JDK uses C2 intrinsics or libfdm (native /
JNI overhead?) and tell me if such functions are already highly optimized
in jdk9 or 10 ?

Some people have implemented their own fast Math like Apache Commons Math
or JaFaMa libraries that are 10x faster for acos / cbrt.

I wonder if I should implement my own cbrt function (cubics) in pure java
as I do not need the highest accuracy but SPEED.

Would it sound possible to have a JDK FastMath public API (lots faster but
less accurate?)

Do you know if recent CPU (intel?) have dedicated instructions for such
math operations ?
Why not use it instead?
Maybe that's part of the new Vectorization API (panama) ?

Cheers,
Laurent Bourges


Re: [OpenJDK 2D-Dev] JDK 9 RFR of JDK-8149896: Remove unnecessary values in FloatConsts and DoubleConsts

2016-02-16 Thread Laurent Bourgès
Hello,

Joe, I tested your changes in the Marlin renderer as it works well !


> A quick note on the 2d changes, several constants (and a copy from a
> package-private method from java.lang) were used to initialize a double
> value POWER_2_TO_32; this can be accomplished more directly using a
> hexadecimal floating-point literal.
> >
> > Please review the webrev
> >
> >http://cr.openjdk.java.net/~darcy/8149896.0/
> ...
> My favourite change is this one:
>
> -private static final double POWER_2_TO_32 = FloatMath.powerOfTwoD(32);
> +private static final double POWER_2_TO_32 = 0x1.0p32;
>
> and the removal of the method on FloatMath.
>

I agree it is simpler and exactly what I needed; I wasn't aware of such
literals:
https://blogs.oracle.com/darcy/entry/hexadecimal_floating_point_literals

(I am not a reviewer)

Thanks for the fix,
Laurent


Re: RFR: JDK-8066859 java/lang/ref/OOMEInReferenceHandler.java failed with java.lang.Exception: Reference Handler thread died

2015-05-07 Thread Laurent Bourgès
Peter,

I looked at Cleaner by curiosity and it seems to be not catching the oome
from thunk.run !

If oome1 is thrown by thunk.run at line 150 then it is catched at line 157
but your new try/catch block (oome2) only encapsulates the doPriviledge
block.

If this block also throws a new oome2 due to the first oome1 (no memory
left), it will work but I would have prefered a more explicit solution and
check oome1 first ...

My 2 cents (I am not a reviewer).

Laurent


Re: RFR: JDK-8066859 java/lang/ref/OOMEInReferenceHandler.java failed with java.lang.Exception: Reference Handler thread died

2015-05-07 Thread Laurent Bourgès
Peter,

Thanks for long and detailled answer.

I know now better why OOME should not happen. However any application may
also use phantom references and the ReferenceHandler thread will call
Cleaner.run () which could catch OOME from the application code
implementing thunk.run (). Am I right ?

 If this block also throws a new oome2 due to the first oome1 (no memory
left), it will work but I would have prefered a more explicit solution and
check oome1 first ...

I looked back at your patch and it is fine. Howevdr I wonder if it would be
possible to avoid any allocation in the catch(Throwable) block:
- preallocate the PriviledgeAction
- avoid new Error(x) to get its stack trace ? Do you know any trick like
ones in SharedSecrets that could dump the stack without any allocation in
case of urgency ?

 You have a point and I asked myself the same question. The question is
how to treat OOME thrown from thunk.run(). Current behavior is to exit()
JVM for any exception (Throwable). I maintained that semantics. I only
added a handler for OOME thrown in the handler of the 1st exception. I
might have just exit()-ed the VM if OOME is thrown, but leaving no trace
and just exiting VM would not help anyone diagnose what went wrong. So I
opted for keeping the VM running for a while by delaying the handling of
1st exception to better times. If better times never come, then the
application is probably stuck anyway.

Seems very good after a 2nd look.
However, it could loop for a while if no more memory left ?
For example: oome1 = oome2 (catch) = throw x= oome2 (catch) = 

 An alternative would be to catch OOME from thunk.run() and ignore it
(printing it out would be ugly if VM is left to run), but that would
silently ignore OOMEs thrown from thunk.run() and noone would notice that
Cleaner(s) might not have clean-ed up the resources they should.

I am a bit lost but I like logging such exceptional case but if no
allocation can happen, how to ensure logging such case anyway ?
...

 Anyway. If none of the Cleaner.thunk's run() methods can throw any
exception, then my handling of OOME is redundant and a code-path never
taken. But I would still leave it there in case some new Cleaner use comes
along which is not verified yet...

Agreed. It is always better to handle such exceptional case if you can at
least log them...

Best regards,
Laurent


Fwd: Re: Fwd: Improve large array allocation / gc intrinsics

2014-02-13 Thread Laurent Bourgès
Comments  opinions from core-libs, please ?

Laurent

-- Message transféré --
De : Thomas Schatzl thomas.scha...@oracle.com
Date : 11 févr. 2014 11:28
Objet : Re: Fwd: Improve large array allocation / gc  intrinsics
À : Laurent Bourgès bourges.laur...@gmail.com
Cc : hotspot-...@openjdk.java.net

 Hi,

   just a few comments...

  De : Laurent Bourgès bourges.laur...@gmail.com
  Date : 10 févr. 2014 10:24
  Objet : Improve large array allocation / gc  intrinsics
  À : core-libs-dev core-libs-dev@openjdk.java.net, discuss 
  disc...@openjdk.java.net
  Cc :
 
   Dear all,
  
   I would like to propose a JDK9 RFE to improve JVM efficiency when
   dealing with large arrays (allocation + gc).
  
   In several scientific applications (and my patched java2d pisces),
   many large arrays are needed, created on the fly and it becomes very
   painful to recycle them using an efficient array cache (concurrency,
   cache size tuning, clear + cache eviction issues).

 Why do you think that a one-size fits all approach that any library in
 the JDK will not have the same issues to deal with? How do you know that
 a generic library in the JDK (as in your proposal) or hacking things
 into the VM can deal with this problem better than you who knows the
 application allocation patterns?

 Do you have a prototype (library) for that?

   In this case, the GC overhead leads to a big performance penalty
   (several hundred megabytes per seconds).

 I do not understand what the problem is. Typically I would not specify
 performance (throughput?) penalty in megabytes per seconds.

 Do the GCs take too long, or do you feel there too much memory wastage
 somewhere?

   As such array cache are very efficient in an application context, I am
   wondering if that approach could be promoted at the JDK level itself:
  
   - provide a new array allocator for large arrays that can return
   larger arrays than expected (size = 4 or 8 multiples) using array
   caches (per thread ?) stored in a dedicated JVM large memory area

 The behavior you propose seems very particular to your application.
 Others may have completely different requirements. The mentioned
 per-thread caches do not seem to be problematic to do in a library
 either.

   (GC aware) and providing efficient cache eviction policies

 Did you every try one of the other available garbage collectors with
 your application? Both CMS and G1 never move large objects around, i.e.
 there is almost no direct GC overhead associated with them.

 Reclaiming them is almost zero cost in these collectors. Keeping them
 alive seems to be best handled by logic in a library.

 Can you give examples where the VM has significant advantages over a
 dedicated library, or a particular use case with measurements that show
 this could be the case?

   - may support for both clean arrays (zero filled) or dirty arrays
   because some algorithms does not need zero-filled arrays.
  
   - improve JVM intrinsics (array clear, fill) to achieve maximum
   performance ie take into account the data alignment (4 or 8 multiples)

 I think the compiler already uses specialized methods for that, using
 the best instructions that are available. It should also already be
 able to detect fill loops, and vectorize them.

 Objects are always 8 byte aligned - I think you can force higher
 alignments by setting ObjectAlignmentInBytes or so.

 Otherwise these changes could be simply added, i.e. seems to not need
 any RFE.

   - upgrade GC to recycle such 'cached' arrays (clean), update usage
   statistics and manage cache eviction policy (avoid wasting memory)

 The GCs already automatically recycle the freed space. Everything else
 seems to be more complicated to implement at VM level than at library
 level, with the added drawback that you add a VM dependency.

   Please give me your feedback  opinion and evaluate if this feature
   seems possible to implement into the JDK (hotspot, gc, core-libs)...

 Thanks,
 Thomas




Fwd: Improve large array allocation / gc intrinsics

2014-02-11 Thread Laurent Bourgès
Please, could you give me your opinions on the following ideas for a JDK9
RFE ?

Is it worth? Interesting or totally stupid with current hotspot / gc ?

How hard / long would it be to study and make a prototype ?

Any other ideas to improve array performance like improving again the array
bound check elimination or fill / clear intrinsics ...

Who would be interested by this topics ?

Laurent

-- Message transféré --
De : Laurent Bourgès bourges.laur...@gmail.com
Date : 10 févr. 2014 10:24
Objet : Improve large array allocation / gc  intrinsics
À : core-libs-dev core-libs-dev@openjdk.java.net, discuss 
disc...@openjdk.java.net
Cc :

 Dear all,

 I would like to propose a JDK9 RFE to improve JVM efficiency when
 dealing with large arrays (allocation + gc).

 In several scientific applications (and my patched java2d pisces),
 many large arrays are needed, created on the fly and it becomes very
 painful to recycle them using an efficient array cache (concurrency,
 cache size tuning, clear + cache eviction issues).

 In this case, the GC overhead leads to a big performance penalty
 (several hundred megabytes per seconds).

 As such array cache are very efficient in an application context, I am
 wondering if that approach could be promoted at the JDK level itself:

 - provide a new array allocator for large arrays that can return
 larger arrays than expected (size = 4 or 8 multiples) using array
 caches (per thread ?) stored in a dedicated JVM large memory area (GC
 aware) and providing efficient cache eviction policies

 - may support for both clean arrays (zero filled) or dirty arrays
 because some algorithms does not need zero-filled arrays.

 - improve JVM intrinsics (array clear, fill) to achieve maximum
 performance ie take into account the data alignment (4 or 8 multiples)

 - upgrade GC to recycle such 'cached' arrays (clean), update usage
 statistics and manage cache eviction policy (avoid wasting memory)

 Please give me your feedback  opinion and evaluate if this feature
 seems possible to implement into the JDK (hotspot, gc, core-libs) ...

 What is the procedure to create such JDK9 RFE ?

 Best regards,

 Laurent Bourgès
 -- Message transféré --
De : Laurent Bourgès bourges.laur...@gmail.com
Date : 10 févr. 2014 10:24
Objet : Improve large array allocation / gc  intrinsics
À : core-libs-dev core-libs-dev@openjdk.java.net, discuss 
disc...@openjdk.java.net
Cc :

Dear all,

 I would like to propose a JDK9 RFE to improve JVM efficiency when
 dealing with large arrays (allocation + gc).

 In several scientific applications (and my patched java2d pisces),
 many large arrays are needed, created on the fly and it becomes very
 painful to recycle them using an efficient array cache (concurrency,
 cache size tuning, clear + cache eviction issues).

 In this case, the GC overhead leads to a big performance penalty
 (several hundred megabytes per seconds).

 As such array cache are very efficient in an application context, I am
 wondering if that approach could be promoted at the JDK level itself:

 - provide a new array allocator for large arrays that can return
 larger arrays than expected (size = 4 or 8 multiples) using array
 caches (per thread ?) stored in a dedicated JVM large memory area (GC
 aware) and providing efficient cache eviction policies

 - may support for both clean arrays (zero filled) or dirty arrays
 because some algorithms does not need zero-filled arrays.

 - improve JVM intrinsics (array clear, fill) to achieve maximum
 performance ie take into account the data alignment (4 or 8 multiples)

 - upgrade GC to recycle such 'cached' arrays (clean), update usage
 statistics and manage cache eviction policy (avoid wasting memory)

 Please give me your feedback  opinion and evaluate if this feature
 seems possible to implement into the JDK (hotspot, gc, core-libs) ...

 What is the procedure to create such JDK9 RFE ?

 Best regards,

 Laurent Bourgès



Improve large array allocation / gc intrinsics

2014-02-10 Thread Laurent Bourgès
Dear all,

I would like to propose a JDK9 RFE to improve JVM efficiency when
dealing with large arrays (allocation + gc).

In several scientific applications (and my patched java2d pisces),
many large arrays are needed, created on the fly and it becomes very
painful to recycle them using an efficient array cache (concurrency,
cache size tuning, clear + cache eviction issues).

In this case, the GC overhead leads to a big performance penalty
(several hundred megabytes per seconds).

As such array cache are very efficient in an application context, I am
wondering if that approach could be promoted at the JDK level itself:

- provide a new array allocator for large arrays that can return
larger arrays than expected (size = 4 or 8 multiples) using array
caches (per thread ?) stored in a dedicated JVM large memory area (GC
aware) and providing efficient cache eviction policies

- may support for both clean arrays (zero filled) or dirty arrays
because some algorithms does not need zero-filled arrays.

- improve JVM intrinsics (array clear, fill) to achieve maximum
performance ie take into account the data alignment (4 or 8 multiples)

- upgrade GC to recycle such 'cached' arrays (clean), update usage
statistics and manage cache eviction policy (avoid wasting memory)

Please give me your feedback  opinion and evaluate if this feature
seems possible to implement into the JDK (hotspot, gc, core-libs) ...

What is the procedure to create such JDK9 RFE ?

Best regards,

Laurent Bourgès


Re: Conflicts between JVM application and j.u.l logging shutdown hooks

2013-12-12 Thread Laurent Bourgès
Alan,

Thanks for creating the bug 9005822 !

As I spotted in my initial email, both shutdown hook problems (JavaWS and
JUL) are due to the concurrent execution of shutdown hooks :

com.sun.imageio.stream.StreamCloser.java
101:  Runtime.getRuntime().addShutdownHook(streamCloser);

java.util.logging.LogManager.java
255:  Runtime.getRuntime().addShutdownHook(new Cleaner());
For example, the JavaWS bug is caused by a closed jar file (unable to load
an class during shutdown) because (I guess) the StreamCloser closed all
opened jar files or JUL Streams.

As I said, these 2 important hooks (StreamCloser and jul.Cleaner) should be
executed later and the StreamCloser as last.
As jul.Handlers can use sockets or files, it is important to flush / close
first handlers (Cleaner) and then close any remaining opened stream ...

I think this bug should be converted into a more general shutdown hook
issue:
- execute first application hooks (not JVM ones)
- fix priorities / order between all JVM hooks (~ 20 now)

Regards,
Laurent


Conflicts between JVM application and j.u.l logging shutdown hooks

2013-12-09 Thread Laurent Bourgès
Anybody wants to look at this logging issue I reported several months ago?

Le 27 juin 2013 11:57, Laurent Bourgès bourges.laur...@gmail.com a
écrit :

 Dear members,

 I have a problem within an external library using one JVM Shutdown hook
to perform resource cleanup (socket, lock file deletion ...) that seems
buggy in java web start environment: sometimes the JVM reports an
IllegalStateException happenning during that shutdown hook.

 This library uses java.util.logging to log warnings and exceptions but
none is logged (console or java trace files).

 The 'lost' log messages is related to the LogManager's shutdown hook:
 // This private class is used as a shutdown hook.
 // It does a reset to close all open handlers.
 private class Cleaner extends Thread {

 private Cleaner() {
 /* Set context class loader to null in order to avoid
  * keeping a strong reference to an application classloader.
  */
 this.setContextClassLoader(null);
 }

 public void run() {
 // This is to ensure the LogManager.clinit is completed
 // before synchronized block. Otherwise deadlocks are
possible.
 LogManager mgr = manager;

 // If the global handlers haven't been initialized yet, we
 // don't want to initialize them just so we can close them!
 synchronized (LogManager.this) {
 // Note that death is imminent.
 deathImminent = true;
 initializedGlobalHandlers = true;
 }

 // Do a reset to close all active handlers.
 reset();
 }
 }

 Without any log handler, the log messages are lost during shutdown !
 I think it is annoying as it avoids me getting log and exceptions ...

 FYI, the ApplicationShutdownHooks class executes all 'application' hooks
concurrently:
 static void runHooks() {
 CollectionThread threads;
 synchronized(ApplicationShutdownHooks.class) {
 threads = hooks.keySet();
 hooks = null;
 }

 for (Thread hook : threads) {
 hook.start();
 }
 for (Thread hook : threads) {
 try {
 hook.join();
 } catch (InterruptedException x) { }
 }
 }
 So it is not possible to define hook priorities ...

 However, I looked at the JVM Shutdown class which supports hook
priorities and I think that the J.U.L Cleaner hook should run after all
application hooks.
 Here are the current Shutdown priorities:
 // The system shutdown hooks are registered with a predefined slot.
 // The list of shutdown hooks is as follows:
 // (0) Console restore hook
 // (1) Application hooks
 // (2) DeleteOnExit hook

 Moreover, as a RFE, it could be useful for applications to be able to
define hook priorities within an application:
 I did that several times registering only 1 shutdown hook and handling
correct ordering on my own ... but could be supported by the JDK itself.

 Finally, here are (below) the Open JDK8 usages of Runtime.addShutdownHook
[19 occurrences] which are more system hooks than application hooks:
 com.sun.imageio.stream
 StreamCloser.java
 StreamCloser
 addToQueue
 anonymous java.security.PrivilegedAction
 run
   101:  Runtime.getRuntime().addShutdownHook(streamCloser);
 java.util.logging
 LogManager.java
 LogManager
 LogManager
   255:  Runtime.getRuntime().addShutdownHook(new Cleaner());
 sun.font
 SunFontManager.java
 SunFontManager
 createFont2D
 anonymous java.security.PrivilegedAction
 run
 2538:  Runtime.getRuntime().addShutdownHook(fileCloser);
 sun.rmi.server
 Activation.java
 Activation
 init
   240:  Runtime.getRuntime().addShutdownHook(shutdownHook);
 sun.tools.jstat

 sun.awt.shell
 Win32ShellFolderManager2.java
 Win32ShellFolderManager2
 ComInvoker
 ComInvoker
 anonymous java.security.PrivilegedActionjava.lang.Void
 run
   493:  Runtime.getRuntime().addShutdownHook(
 sun.awt.windows
 WToolkit.java
 WToolkit
 registerShutdownHook
 anonymous java.security.PrivilegedActionjava.lang.Void
 run
   285:  Runtime.getRuntime().addShutdownHook(shutdown);
 sun.java2d.d3d
 D3DScreenUpdateManager.java
 D3DScreenUpdateManager
 D3DScreenUpdateManager
 anonymous java.security.PrivilegedAction
 run
   112:  Runtime.getRuntime().addShutdownHook(shutdown);
 java.util.prefs
 FileSystemPreferences.java
 FileSystemPreferences
 anonymous java.security.PrivilegedActionjava.lang.Void
 run
   439:  Runtime.getRuntime().addShutdownHook(new Thread() {
 sun.awt.X11
 XToolkit.java
 XToolkit
 init
 anonymous java.security.PrivilegedActionjava.lang.Void
 run
   350:  Runtime.getRuntime().addShutdownHook(shutdownThread);
 sun.awt
 X11GraphicsDevice.java
 X11GraphicsDevice
 setDisplayMode
 anonymous java.security.PrivilegedActionjava.lang.Void
 run
   445:  Runtime.getRuntime().addShutdownHook(t);
 java.util.prefs
 MacOSXPreferencesFile.java

Re: JVM application shutdown hooks

2013-08-07 Thread Laurent Bourgès
Dear core-libs members,

I finally succeed in diagnosing my shutdown hook issue in Java Web Start
environment: see Bug ID: 9005822.

Could you please give ma feedback to my questions related to LogManager and
StreamCloser shutdown hooks ?


This library uses java.util.logging to log warnings and exceptions but none
 is logged (console or java trace files).

 The 'lost' log messages is related to the LogManager's shutdown hook:
 // This private class is used as a shutdown hook.
 // It does a reset to close all open handlers.
 private class Cleaner extends Thread {

 private Cleaner() {
 /* Set context class loader to null in order to avoid
  * keeping a strong reference to an application classloader.
  */
 this.setContextClassLoader(null);
 }

 public void run() {
 // This is to ensure the LogManager.clinit is completed
 // before synchronized block. Otherwise deadlocks are possible.
 LogManager mgr = manager;

 // If the global handlers haven't been initialized yet, we
 // don't want to initialize them just so we can close them!
 synchronized (LogManager.this) {
 // Note that death is imminent.
 deathImminent = true;
 initializedGlobalHandlers = true;
 }

 // Do a reset to close all active handlers.
 reset();
 }
 }

 I think it is annoying as it avoids me getting log and exceptions ...


Exactly what happened to me in JavaWS env !! I had to patch JSamp library
to add System.out and printStackTrace statements whereas there are plenty
of j.u.l. log statements !!

Please postpone that f*** hook after 'true' application hooks !!!


Moreover, as a RFE, it could be useful for applications to be able to
 define hook priorities within an application:
 I did that several times registering only 1 shutdown hook and handling
 correct ordering on my own ... but could be supported by the JDK itself.


Do you want me to submit a patch ? I think that the JVM specs should be
modified to change the addShutdownHook() signature = RFE ?


 Finally, here are (below) the Open JDK8 usages of Runtime.addShutdownHook
 [19 occurrences] which are more system hooks than application hooks:
 com.sun.imageio.stream
 StreamCloser.java
 StreamCloser
 addToQueue
 anonymous java.security.PrivilegedAction
 run
   101:  Runtime.getRuntime().addShutdownHook(streamCloser);
 java.util.logging
 LogManager.java
 LogManager
 LogManager
   255:  Runtime.getRuntime().addShutdownHook(new Cleaner());
 sun.font
 SunFontManager.java
 SunFontManager
 createFont2D
 anonymous java.security.PrivilegedAction
 run
 2538:  Runtime.getRuntime().addShutdownHook(fileCloser);
 sun.rmi.server
 Activation.java
 Activation
 init
   240:  Runtime.getRuntime().addShutdownHook(shutdownHook);
 sun.tools.jstat

 sun.awt.shell
 Win32ShellFolderManager2.java
 Win32ShellFolderManager2
 ComInvoker
 ComInvoker
 anonymous java.security.PrivilegedActionjava.lang.Void
 run
   493:  Runtime.getRuntime().addShutdownHook(
 sun.awt.windows
 WToolkit.java
 WToolkit
 registerShutdownHook
 anonymous java.security.PrivilegedActionjava.lang.Void
 run
   285:  Runtime.getRuntime().addShutdownHook(shutdown);
 sun.java2d.d3d
 D3DScreenUpdateManager.java
 D3DScreenUpdateManager
 D3DScreenUpdateManager
 anonymous java.security.PrivilegedAction
 run
   112:  Runtime.getRuntime().addShutdownHook(shutdown);
 java.util.prefs
 FileSystemPreferences.java
 FileSystemPreferences
 anonymous java.security.PrivilegedActionjava.lang.Void
 run
   439:  Runtime.getRuntime().addShutdownHook(new Thread() {
 sun.awt.X11
 XToolkit.java
 XToolkit
 init
 anonymous java.security.PrivilegedActionjava.lang.Void
 run
   350:  Runtime.getRuntime().addShutdownHook(shutdownThread);
 sun.awt
 X11GraphicsDevice.java
 X11GraphicsDevice
 setDisplayMode
 anonymous java.security.PrivilegedActionjava.lang.Void
 run
   445:  Runtime.getRuntime().addShutdownHook(t);
 java.util.prefs
 MacOSXPreferencesFile.java
 MacOSXPreferencesFile
 timer
   356:  Runtime.getRuntime().addShutdownHook(flushThread);
 sun.font
 CFontManager.java
 CFontManager
 createFont2D
 anonymous java.security.PrivilegedActionjava.lang.Object
 run
   232:  Runtime.getRuntime().addShutdownHook(fileCloser);
 sun.lwawt
 LWToolkit.java
 LWToolkit
 init
 83:  Runtime.getRuntime().addShutdownHook(

 I hope that these hooks are already supporting concurrency execution (awt
 ?) but someone should check if there is no dependencies between them (like
 systemd or rc scripts does).


Could someone check if there is no dependencies between all these shutdown
hooks when they are executed in parallel with random order ?

Best regards,
Laurent


JVM application shutdown hooks

2013-06-27 Thread Laurent Bourgès
LWToolkit
init
83:  Runtime.getRuntime().addShutdownHook(

I hope that these hooks are already supporting concurrency execution (awt
?) but someone should check if there is no dependencies between them (like
systemd or rc scripts does).

PS: As shutdown hooks are given by the application as Thread class
instances, I suspect also a possible classloader leak ? In Java web start
environment, it may be the cause of my problems as I do not know if the
hook thread is still within the application class loader ...

Laurent Bourgès


Re: RFR (XS) CR 8014233: java.lang.Thread should be @Contended

2013-05-10 Thread Laurent Bourgès
Peter,

you're absolutely right: I was thinking about thread local values (object
instances) and not ThreadLocal keys !
I think ThreadLocal name is confusing as it does not correspond to values !

Several times I wonder if false sharing can happen between my thread local
values (i.e. different Thread context classes) and any other object
including other Thread contexts).

Is the GC (old gen) able to place objects in thread dedicated area: it
would so avoid any false sharing between object graphs dedicated to each
thread = thread isolation.

I think that TLAB does so for allocation / short lived objects but for the
old generation (long lived objects) it is not the case: maybe G1 can
provide different partitioning and maybe take into acccount the thread
affinity ?

Laurent

2013/5/9 Peter Levart peter.lev...@gmail.com


 On 05/09/2013 04:59 PM, Laurent Bourgès wrote:

 Hi all,

 A stupid question:
 any ThreadLocal subclass should be marked @Contended to be sure that false
 sharing never happens between ThreadLocal instance and any other object on
 the heap ?


 Hi Laurent,

 ThreadLocal object is just a key (into a ThreadLocalMap). It's usually not
 subclassed to add any state but to override initialValue method.
 ThreadLocal contains a single final field 'threadLocalHashCode', which is
 read at every call to ThreadLocal.get() (usually by multiple threads). This
 can contend with a frequent write of a field in some other object, placed
 into it's proximity, yes, but I don't think we should put @Contended on
 every class that has frequently read fields. @Contended should be reserved
 for classes with fields that are frequently written, if I understand the
 concept correctly.

 Regards, Peter


 Laurent

  2013/5/9 Peter Levart peter.lev...@gmail.com

 Hi Aleksey,

 Wouldn't it be even better if just threadLocalRandom* fields were
 annotated with @Contended(ThreadLocal) ?
 Some fields within the Thread object are accessed from non-local threads.
 I don't know how frequently, but isolating just threadLocalRandom* fields
 from all possible false-sharing scenarios would seem even better, no?

 Regards, Peter


 On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:

 Hi,

 This is from our backlog after JDK-8005926. After ThreadLocalRandom
 state was merged into Thread, we now have to deal with the false sharing
 induced by heavily-updated fields in Thread. TLR was padded before, and
 it should make sense to make Thread bear @Contended annotation to
 isolate its fields in the same manner.

 The webrev is here:
 http://cr.openjdk.java.net/~shade/8014233/webrev.00/

 Testing:
   - microbenchmarks (see below)
   - JPRT cycle against jdk8-tl

 The extended rationale for the change follows.

 If we look at the current Thread layout, we can see the TLR state is
 buried within the Thread instance. TLR state are by far the mostly
 updated fields in Thread now:

  Running 64-bit HotSpot VM.
 Using compressed references with 3-bit shift.
 Objects are 8 bytes aligned.

 java.lang.Thread
offset  size type description
 012  (assumed to be the object
 header + first field alignment)
12 4  int Thread.priority
16 8 long Thread.eetop
24 8 long Thread.stackSize
32 8 long Thread.nativeParkEventPointer
40 8 long Thread.tid
48 8 long Thread.threadLocalRandomSeed
56 4  int Thread.threadStatus
60 4  int Thread.threadLocalRandomProbe
64 4  int
 Thread.threadLocalRandomSecondarySeed
68 1  boolean Thread.single_step
69 1  boolean Thread.daemon
70 1  boolean Thread.stillborn
71 1  (alignment/padding gap)
72 4   char[] Thread.name
76 4   Thread Thread.threadQ
80 4 Runnable Thread.target
84 4  ThreadGroup Thread.group
88 4  ClassLoader Thread.contextClassLoader
92 4 AccessControlContext
 Thread.inheritedAccessControlContext
96 4   ThreadLocalMap Thread.threadLocals
   100 4   ThreadLocalMap Thread.inheritableThreadLocals
   104 4   Object Thread.parkBlocker
   108 4Interruptible Thread.blocker
   112 4   Object Thread.blockerLock
   116 4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
   120(object boundary, size
 estimate)
   VM reports 120 bytes per instance


 Assuming current x86 hardware with 64-byte cache line sizes and current
 class layout, we can see the trailing fields

Re: RFR (XS) CR 8014233: java.lang.Thread should be @Contended

2013-05-09 Thread Laurent Bourgès
Hi all,

A stupid question:
any ThreadLocal subclass should be marked @Contended to be sure that false
sharing never happens between ThreadLocal instance and any other object on
the heap ?

Laurent

2013/5/9 Peter Levart peter.lev...@gmail.com

 Hi Aleksey,

 Wouldn't it be even better if just threadLocalRandom* fields were
 annotated with @Contended(ThreadLocal) ?
 Some fields within the Thread object are accessed from non-local threads.
 I don't know how frequently, but isolating just threadLocalRandom* fields
 from all possible false-sharing scenarios would seem even better, no?

 Regards, Peter


 On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:

 Hi,

 This is from our backlog after JDK-8005926. After ThreadLocalRandom
 state was merged into Thread, we now have to deal with the false sharing
 induced by heavily-updated fields in Thread. TLR was padded before, and
 it should make sense to make Thread bear @Contended annotation to
 isolate its fields in the same manner.

 The webrev is here:
 
 http://cr.openjdk.java.net/~**shade/8014233/webrev.00/http://cr.openjdk.java.net/~shade/8014233/webrev.00/

 Testing:
   - microbenchmarks (see below)
   - JPRT cycle against jdk8-tl

 The extended rationale for the change follows.

 If we look at the current Thread layout, we can see the TLR state is
 buried within the Thread instance. TLR state are by far the mostly
 updated fields in Thread now:

  Running 64-bit HotSpot VM.
 Using compressed references with 3-bit shift.
 Objects are 8 bytes aligned.

 java.lang.Thread
offset  size type description
 012  (assumed to be the object
 header + first field alignment)
12 4  int Thread.priority
16 8 long Thread.eetop
24 8 long Thread.stackSize
32 8 long Thread.nativeParkEventPointer
40 8 long Thread.tid
48 8 long Thread.threadLocalRandomSeed
56 4  int Thread.threadStatus
60 4  int Thread.threadLocalRandomProbe
64 4  int Thread.**
 threadLocalRandomSecondarySeed
68 1  boolean Thread.single_step
69 1  boolean Thread.daemon
70 1  boolean Thread.stillborn
71 1  (alignment/padding gap)
72 4   char[] Thread.name
76 4   Thread Thread.threadQ
80 4 Runnable Thread.target
84 4  ThreadGroup Thread.group
88 4  ClassLoader Thread.contextClassLoader
92 4 AccessControlContext Thread.**
 inheritedAccessControlContext
96 4   ThreadLocalMap Thread.threadLocals
   100 4   ThreadLocalMap Thread.inheritableThreadLocals
   104 4   Object Thread.parkBlocker
   108 4Interruptible Thread.blocker
   112 4   Object Thread.blockerLock
   116 4 UncaughtExceptionHandler Thread.**
 uncaughtExceptionHandler
   120(object boundary, size estimate)
   VM reports 120 bytes per instance


 Assuming current x86 hardware with 64-byte cache line sizes and current
 class layout, we can see the trailing fields in Thread are providing
 enough insulation from the false sharing with an adjacent object. Also,
 the Thread itself is large enough so that two TLRs belonging to
 different threads will not collide.

 However the leading fields are not enough: we have a few words which can
 occupy the same cache line, but belong to another object. This is where
 things can get worse in two ways: a) the TLR update can make the field
 access in adjacent object considerably slower; and much worse b) the
 update in the adjacent field can disturb the TLR state, which is
 critical for j.u.concurrent performance relying heavily on fast TLR.

 To illustrate both points, there is a simple benchmark driven by JMH
 (http://openjdk.java.net/**projects/code-tools/jmh/http://openjdk.java.net/projects/code-tools/jmh/
 ):

 http://cr.openjdk.java.net/~**shade/8014233/threadbench.ziphttp://cr.openjdk.java.net/~shade/8014233/threadbench.zip

 On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
 Thread with/without @Contended that microbenchmark yields the following
 results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:

 Accessing ThreadLocalRandom.current().**nextInt():
baseline:932 +-  4 ops/usec
@Contended:  927 +- 10 ops/usec

 Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
baseline:454 +-  2 ops/usec
@Contended:  490 +-  3 ops/usec

 One might note the $uncaughtExceptionHandler is the trailing field in
 the Thread, 

Re: [OpenJDK 2D-Dev] sun.java2D.Pisces renderer Performance and Memory enhancements

2013-04-17 Thread Laurent Bourgès
Jim,

thanks for having some interest for my efforts !
As I got almost no feedback, I felt quite disappointed and was thinking
that improving pisces was not important ...

Here are ductus results and comparison (open office format):
http://jmmc.fr/~bourgesl/share/java2d-pisces/ductus_det.log
http://jmmc.fr/~bourgesl/share/java2d-pisces/compareRef_Patch.ods

   test threads ops Tavg Tmed stdDev rms *Med+Stddev* min max  boulder_17 1
20 73,92% 69,34% 27,98% 69,34% *69,14%* 69,81% 146,89%  boulder_17 2 20
110,86% 110,48% 613,80% 112,01% *125,43%* 94,71% 136,35%  boulder_17 4 20
135,28% 135,86% 226,61% 136,46% *141,85%* 125,14% 111,32%  shp_alllayers_47
1 20 82,25% 82,49% 47,50% 82,48% *82,30%* 82,64% 78,08%  shp_alllayers_47 2
20 115,87% 115,67% 315,45% 115,85% *119,89%* 109,33% 128,71%
shp_alllayers_47 4 20 218,59% 218,76% 169,38% 218,65% *216,45%* 220,17%
206,17%
Ductus vs Patch
   *1* *80,74%*  *2* *120,69%*  *4* *205,92%*
Reminder: Ref vs Patch
   *1* *237,71%*  *2* *271,68%*  *4* *286,15%*

Note: I only have 2 cores + HT on my laptop and do not test with more
threads (64 like andrea).

2013/4/16 Jim Graham james.gra...@oracle.com

 If I'm reading this correctly, your patch is faster even for a single
 thread?  That's great news.


Not yet, but ductus is now only 20% faster than my patch and 20% and 2x
slower with 2 and 4 threads :
I still hope to beat it applying few more optimizations:
- Renderer.ScanLine iterator / Renderer.endRendering can be improved
- avoid few more array zero fill (reset) if possible
- adding statistics to better set initial array sizes ...
- use SunGraphics2D to hold an hard reference (temporarly) on
RendererContext (to avoid many ThreadLocal.get())
- cache eviction (WeakReference or SoftReference) ?

Why not use divide and conquer (thread pool) to boost single thread
rendering if the machine has more cpu cores ?
It would be helpful if the AATileGenerator has access to SunGraphics2D to
get rendering hints or share variables (cache ...)

For the moment, I did not modify the algorithms itself but I will do it to
enhance clipping (dasher segments / stroker) ...


 One of the problems we've had with replacing Ductus is that it has been
 faster in a single thread situation than the open source versions we've
 created.  One of its drawbacks is that it had been designed to take
 advantage of some AA-accelerating hardware that never came to be.  With the
 accelerator it would have been insanely fast, but hardware went in a
 different direction.  The problem was that this early design goal caused
 the entire library to be built around an abstraction layer that allowed for
 a single tile producer internally (because there would be only one -
 insanely fast - hardware chip available) and the software version of the
 abstraction layer thus had a lot of native static data structures
 (there's only one of me, right?) that prevented MT access.  It was probably
 solvable, but I'd be happier if someone could come up with a faster
 rasterizer, imagining that there must have been some sort of advancements
 in the nearly 2 decades since the original was written.

 If I'm misinterpreting and single thread is still slower than Ductus (or
 if it is still slower on some other platforms), then frowny face.


Not yet: slower than ductus by 20% but faster than current pisces by 2
times !


 Also, this is with the Java version, right?


Yes, my patch is pure java given as webrev:
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/


 We got a decent 2x speedup in FX by porting the version of Open Pisces
 that you started with to C code (all except on Linux where we couldn't find
 the right gcc options to make it faster than Hotspot).  So, we have yet to
 investigate a native version in the JDK which might provide even more
 gains...


Personally I prefer working on java code as hotspot can perform so much
optimizations for free and no pointers to deal with and more important:
concurrent primitives (thread local, collections) !

Laurent



 On 4/15/13 3:01 AM, Laurent Bourgčs wrote:

 Jim, Andrea,

 I updated MapBench to provide test statistics (avg, median, stddev, rms,
 med + stddev, min, max) and CSV output (tab separator):
 http://jmmc.fr/~bourgesl/**share/java2d-pisces/MapBench/http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/MapBench/
 http://jmmc.fr/%7Ebourgesl/**share/java2d-pisces/MapBench/http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/MapBench/
 



 Here are the results (OpenJDK8 Ref vs Patched):
 http://jmmc.fr/~bourgesl/**share/java2d-pisces/ref_det.**loghttp://jmmc.fr/%7Ebourgesl/share/java2d-pisces/ref_det.log
 http://jmmc.fr/~bourgesl/**share/java2d-pisces/patch_det.**loghttp://jmmc.fr/%7Ebourgesl/share/java2d-pisces/patch_det.log

 testthreads ops TavgTmedstdDev  rms
 Med+Stddev  min max
 boulder_17  1   20  180,22% 181,08% 1186,01%
181,17% 185,92%
 176,35% 170,36%
 boulder_17  2   20  

Re: [OpenJDK 2D-Dev] sun.java2D.Pisces renderer Performance and Memory enhancements

2013-04-15 Thread Laurent Bourgès
Jim, Andrea,

I updated MapBench to provide test statistics (avg, median, stddev, rms,
med + stddev, min, max) and CSV output (tab separator):
http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench/


Here are the results (OpenJDK8 Ref vs Patched):
http://jmmc.fr/~bourgesl/share/java2d-pisces/ref_det.log
http://jmmc.fr/~bourgesl/share/java2d-pisces/patch_det.log

   test threads ops Tavg Tmed stdDev rms Med+Stddev min max  boulder_17 1 20
180,22% 181,08% 1186,01% 181,17% 185,92% 176,35% 170,36%  boulder_17 2 20
183,15% 183,80% 162,68% 183,78% 183,17% 174,01% 169,89%  boulder_17 4 20
216,62% 218,03% 349,31% 218,87% 226,68% 172,15% 167,54%  shp_alllayers_47 1
20 243,90% 244,86% 537,92% 244,87% 246,39% 240,64% 231,00%  shp_alllayers_47
2 20 286,42% 287,07% 294,87% 287,07% 287,23% 277,19% 272,23%
shp_alllayers_47 4 20 303,08% 302,15% 168,19% 301,90% 295,90% 462,70%
282,41%

PATCH:
   test threads ops Tavg Tmed stdDev rms Med+Stddev min max  boulder_17 1 20
110,196 109,244 0,529 109,246 109,773 108,197 129,327  boulder_17 2 40
127,916 127,363 3,899 127,423 131,262 125,262 151,561  boulder_17 4 80
213,085 212,268 14,988 212,796 227,256 155,512 334,407  shp_alllayers_47 1
20 1139,452 1134,858 5,971 1134,873 1140,829 1125,859 1235,746
shp_alllayers_47 2 40 1306,889 1304,598 28,157 1304,902 1332,755 1280,49
1420,351  shp_alllayers_47 4 80 2296,487 2303,81 112,816 2306,57 2416,626
1390,31 2631,455

REF:
   test threads ops Tavg Tmed stdDev rms Med+Stddev min max  boulder_17 1 20
198,591 197,816 6,274 197,916 204,091 190,805 220,319  boulder_17 2 40
234,272 234,09 6,343 234,176 240,433 217,967 257,485  boulder_17 4 80
461,579 462,8 52,354 465,751 515,153 267,712 560,254  shp_alllayers_47 1 20
2779,133 2778,823 32,119 2779,009 2810,943 2709,285 2854,557
shp_alllayers_47 2 40 3743,255 3745,111 83,027 3746,031 3828,138 3549,364
3866,612  shp_alllayers_47 4 80 6960,23 6960,948 189,75 6963,533 7150,698
6432,945 7431,541
Linux 64 server vm
JVM: -Xms128m -Xmx128m (low mem)

Laurent

2013/4/14 Andrea Aime andrea.a...@geo-solutions.it

 On Tue, Apr 9, 2013 at 3:02 PM, Laurent Bourgès bourges.laur...@gmail.com
  wrote:

 Dear Java2D members,

 Could someone review the following webrev concerning Java2D Pisces to
 enhance its performance and reduce its memory footprint (RendererContext
 stored in thread local or concurrent queue):
 http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/

 FYI I fixed file headers in this patch and signed my OCA 3 weeks ago.

 Remaining work:
 - cleanup (comments ...)
 - statistics to perform auto-tuning
 - cache / memory cleanup (SoftReference ?): use hints or System
 properties to adapt it to use cases
 - another problem: fix clipping performance in Dasher / Stroker for
 segments out of bounds

 Could somebody support me ? ie help me working on these tasks or just to
 discuss on Pisces algorithm / implementation ?


 Hi,
 I would like to express my support for this patch.
 Given that micro-benchmarks have already been run, I took the patch for a
 spin in a large, real world benchmark instead,
 the OSGeo WMS Shootout 2010 benchmark, for which you can see the results
 here:

 http://www.slideshare.net/gatewaygeomatics.com/wms-performance-shootout-2010

 The presentation is long, but suffice it to say all Java based
 implementations took quite the beating due to the
 poor scalability of Ductus with antialiased rendering of vector data (for
 an executive summary just look
 at slide 27 and slide 66, where GeoServer, Oracle MapViewer and
 Constellation SDI were the
 Java based ones)

 I took the same tests and run them again on my machine (different hardware
 than the tests, don't try to compare
 the absolute values), using Oracle JDK 1.7.0_17, OpenJDK 8 (a checkout a
 couple of weeks old) and the
 same, but with Laurent's patches applied.
 Here are the results, throughput (in maps generated per second) with the
 load generator (JMeter) going
 up from one client to 64 concurrent clients:

*Threads* *JDK 1.7.0_17* *OpenJDK 8, vanilla* *OpenJDK 8 + pisces
 renderer improvements* *Pisces renderer performance gain, %*  1 13,97
 12,43 13,03 4,75%  2 22,08 20,60 20,77 0,81%  4 34,36 33,15 33,36 0,62%  8
 39,39 40,51 41,71 2,96%  16 40,61 44,57 46,98 5,39%  32 41,41 44,73 48,16
 7,66%  64 37,09 42,19 45,28 7,32%

 Well, first of all, congratulations to the JDK developers, don't know what
 changed in JDK 8, but
 GeoServer seems to like it quite a bit :-).
 That said, Laurent's patch also gives a visible boost, especially when
 several concurrent clients are
 asking for the maps.

 Mind, as I said, this is no micro-benchmark, it is a real application
 loading doing a lot of I/O
 (from the operating system file cache), other processing before the data
 reaches the rendering
 pipeline, and then has to PNG encode the output BufferedImage (PNG
 encoding being rather
 expensive), so getting this speedup from just a change in the rendering
 pipeline is significant.

 Long story short... personally I'd be very

Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-11 Thread Laurent Bourgès
Jim and Sergey,

1/ Here are few benchmarks (based on mapBench again) running several
modified versions of AAShapePipe:
http://jmmc.fr/~bourgesl/share/AAShapePipe/mapBench/
- ref:
1 threads and 20 loops per thread, time: 3742 ms
2 threads and 20 loops per thread, time: 4756 ms
4 threads and 20 loops per thread, time: 8528 ms

1 threads and 20 loops per thread, time: 56264 ms
2 threads and 20 loops per thread, time: 75566 ms
4 threads and 20 loops per thread, time: 141546 ms

- int4:
1 threads and 20 loops per thread, time: 3653 ms
2 threads and 20 loops per thread, time: 4684 ms
4 threads and 20 loops per thread, time: 8291 ms

1 threads and 20 loops per thread, time: 55950 ms
2 threads and 20 loops per thread, time: 74796 ms
4 threads and 20 loops per thread, time: 139924 ms

- byte[]:
1 threads and 20 loops per thread, time: 3795 ms
2 threads and 20 loops per thread, time: 4605 ms
4 threads and 20 loops per thread, time: 8246 ms

1 threads and 20 loops per thread, time: 54961 ms
2 threads and 20 loops per thread, time: 72768 ms
4 threads and 20 loops per thread, time: 139430 ms

- int4 / byte[] / rectangle cached in TileState:
1 threads and 20 loops per thread, time: 3610 ms
2 threads and 20 loops per thread, time: 4481 ms
4 threads and 20 loops per thread, time: 8225 ms

1 threads and 20 loops per thread, time: 54651 ms
2 threads and 20 loops per thread, time: 74516 ms
4 threads and 20 loops per thread, time: 140153 ms

So you may be right, results are varying depending on the optimizations
(int4, byte or all) !
Maybe I should test different versions on optimized pisces renderer ...

Here is an updated patch:
http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-2/


2/ Thanks for your comments: actually a refactoring is possible to use a
(shared) TileState instance replacing int[] bbox, rectangle bbox):
- RenderingEngine.AATileGenerator getAATileGenerator(... int[] abox)

it is very interesting here to propose an extensible tile state: maybe
created by the renderer engine to cache other data ?

- Rectangle and Rectangle2D are only used as the shape s and device
rectangle given to CompositePipe.startSequence():
public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
int[] abox);

Changing this interface may become difficult:
AlphaColorPipe.java:
41:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
OK, [s, dev, abox] unused

AlphaPaintPipe.java
81:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR,
create a paint context:
PaintContext paintContext =
sg.paint.createContext(sg.getDeviceColorModel(),
   devR,
   s.getBounds2D(),
   sg.cloneTransform(),
   sg.getRenderingHints());

GeneralCompositePipe.java:
62:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR,
abox unused and create a paint context:
PaintContext paintContext =
sg.paint.createContext(model, devR, s.getBounds2D(),
   sg.cloneTransform(),
   hints);

SpanClipRenderer.java
68:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR,
Forward to another composite pipe
return new SCRcontext(ri, outpipe.startSequence(sg, s, devR, abox));

It could be possible to use TileState into PaintContext interface / fix
implementations but it may become a tricky change (API change).

What do you think ?

Laurent

2013/4/11 Jim Graham james.gra...@oracle.com

 I'm pretty familiar with all of this code and there aren't any places that
 save the tile array that I remember.  The embedded code that Pisces was
 taken from had some caching of alpha arrays, but we didn't use or keep that
 when we converted it for use in the JDK...

 It occurs to me that since you are collecting the various pieces of
 information into an object to store in the thread local storage, perhaps we
 should convert to a paradigm where an entire Tile Generation sequence uses
 that object TileState? as its main way to communicate info around the
 various stages.  Thus, you don't really need an int[4] to store the 4
 parameters, they could be stored directly in the TileState object. This
 would require more sweeping changes to the pipeline, but it might make the
 code a bit more readable (and make the hits to convert over more moot as
 they would be improving readability and give more focus to the
 relationships between all of the various bits of data).  Then it simply
 becomes a matter of managing the lifetime and allocation of the TileState
 objects which is a minor update to the newly refactored code.

 ...jim

 On 4/10/13 3:59 PM, Sergey Bylokhov wrote:

  On 4/10/13 11:46 PM, Laurent Bourgčs wrote:

 I see that some methods which take it as argument doesn't use them. And
 most of the time we pass AATileGenerator and abox[] to the 

Fwd: AWT Dev Swing Dev sun.awt.X11 logs still using String + (waste)

2013-04-11 Thread Laurent Bourgès
Anthony, Mandy,

here the 4th patch:
http://jmmc.fr/~bourgesl/share/webrev-8010297.4/

It only contains awt / swing / net classes that use PlatformLogger (no code
modification).

Laurent


Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-11 Thread Laurent Bourgès
Last idea: I will enhance Andrea's mapBench benchmark to have statistics
per threads: number of loops, avg, min, max, stddev;

I guess that the total bench time is not so representative as the thread
pool can distribute the work load differently at each test = statistics
will help to have better timing / comparison between bench runs.

Regards,
Laurent

2013/4/11 Laurent Bourgès bourges.laur...@gmail.com

 Jim and Sergey,

 1/ Here are few benchmarks (based on mapBench again) running several
 modified versions of AAShapePipe:
 http://jmmc.fr/~bourgesl/share/AAShapePipe/mapBench/
 - ref:
 1 threads and 20 loops per thread, time: 3742 ms
 2 threads and 20 loops per thread, time: 4756 ms
 4 threads and 20 loops per thread, time: 8528 ms

 1 threads and 20 loops per thread, time: 56264 ms
 2 threads and 20 loops per thread, time: 75566 ms
 4 threads and 20 loops per thread, time: 141546 ms

 - int4:
 1 threads and 20 loops per thread, time: 3653 ms
 2 threads and 20 loops per thread, time: 4684 ms
 4 threads and 20 loops per thread, time: 8291 ms

 1 threads and 20 loops per thread, time: 55950 ms
 2 threads and 20 loops per thread, time: 74796 ms
 4 threads and 20 loops per thread, time: 139924 ms

 - byte[]:
 1 threads and 20 loops per thread, time: 3795 ms
 2 threads and 20 loops per thread, time: 4605 ms
 4 threads and 20 loops per thread, time: 8246 ms

 1 threads and 20 loops per thread, time: 54961 ms
 2 threads and 20 loops per thread, time: 72768 ms
 4 threads and 20 loops per thread, time: 139430 ms

 - int4 / byte[] / rectangle cached in TileState:
 1 threads and 20 loops per thread, time: 3610 ms
 2 threads and 20 loops per thread, time: 4481 ms
 4 threads and 20 loops per thread, time: 8225 ms

 1 threads and 20 loops per thread, time: 54651 ms
 2 threads and 20 loops per thread, time: 74516 ms
 4 threads and 20 loops per thread, time: 140153 ms

 So you may be right, results are varying depending on the optimizations
 (int4, byte or all) !
 Maybe I should test different versions on optimized pisces renderer ...

 Here is an updated patch:
 http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-2/


 2/ Thanks for your comments: actually a refactoring is possible to use a
 (shared) TileState instance replacing int[] bbox, rectangle bbox):
 - RenderingEngine.AATileGenerator getAATileGenerator(... int[] abox)

 it is very interesting here to propose an extensible tile state: maybe
 created by the renderer engine to cache other data ?

 - Rectangle and Rectangle2D are only used as the shape s and device
 rectangle given to CompositePipe.startSequence():
 public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
 int[] abox);

 Changing this interface may become difficult:
 AlphaColorPipe.java:

 41:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
 OK, [s, dev, abox] unused

 AlphaPaintPipe.java

 81:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle 
 devR,
 create a paint context:
 PaintContext paintContext =
 sg.paint.createContext(sg.getDeviceColorModel(),
devR,
s.getBounds2D(),
sg.cloneTransform(),
sg.getRenderingHints());

 GeneralCompositePipe.java:

 62:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle 
 devR,
 abox unused and create a paint context:
 PaintContext paintContext =
 sg.paint.createContext(model, devR, s.getBounds2D(),
sg.cloneTransform(),
hints);

 SpanClipRenderer.java

 68:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle 
 devR,
 Forward to another composite pipe
 return new SCRcontext(ri, outpipe.startSequence(sg, s, devR, abox));

 It could be possible to use TileState into PaintContext interface / fix
 implementations but it may become a tricky change (API change).

 What do you think ?

 Laurent


 2013/4/11 Jim Graham james.gra...@oracle.com

 I'm pretty familiar with all of this code and there aren't any places
 that save the tile array that I remember.  The embedded code that Pisces
 was taken from had some caching of alpha arrays, but we didn't use or keep
 that when we converted it for use in the JDK...

 It occurs to me that since you are collecting the various pieces of
 information into an object to store in the thread local storage, perhaps we
 should convert to a paradigm where an entire Tile Generation sequence uses
 that object TileState? as its main way to communicate info around the
 various stages.  Thus, you don't really need an int[4] to store the 4
 parameters, they could be stored directly in the TileState object. This
 would require more sweeping changes to the pipeline, but it might make the
 code a bit more readable (and make the hits to convert over more moot as
 they would be improving readability

Re: AWT Dev Swing Dev sun.awt.X11 logs still using String + (waste)

2013-04-11 Thread Laurent Bourgès
Mandy,

I'd really like to see if we can avoid the boilerplate if
(isLoggable(...)) logger.fine() and help ease of development and I
file a RFE (8012006).

Agreed but there is no easy way to have clear code and performance:
- maybe JDK 8 Supplier may help
- or a shorter syntax: logger.isDebug() or testing its effective level
directly: if (logger.isDebug)
- annotation syntax: @logCheck() but it is not straightforward to guess the
correct level of the log statement ...


I don't understand if I should fix it or not ?

src/solaris/classes/sun/awt/X11/XListPeer.java
 Nit: line 1906 you remove isLoggable call here.  Was it intentional (as it
 doesn't call concatenate any string?)?  I think it's better to use the
 pattern consistently.


it's a mistake (cookie).


 Approved and no need to regenerate a new webrev if you fix the above nit.


To fix it, I need to send you files as a new webrev ?

Laurent


Re: AWT Dev Swing Dev sun.awt.X11 logs still using String + (waste)

2013-04-11 Thread Laurent Bourgès
Anthony,

Here is the updated webrev:
http://jmmc.fr/~bourgesl/share/webrev-8010297.5/


Laurent

2013/4/11 Mandy Chung mandy.ch...@oracle.com

  On 4/11/13 8:43 AM, Laurent Bourgès wrote:


 I don't understand if I should fix it or not ?

   src/solaris/classes/sun/awt/X11/XListPeer.java
 Nit: line 1906 you remove isLoggable call here.  Was it intentional (as
 it doesn't call concatenate any string?)?  I think it's better to use the
 pattern consistently.


 it's a mistake (cookie).


 Please fix it.




  Approved and no need to regenerate a new webrev if you fix the above nit.


 To fix it, I need to send you files as a new webrev ?


 Anthony is going to sponsor for you and I think he asks for a webrev.  So
 please send the latest webrev with this fix then.

 Mandy


 Laurent





Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Laurent Bourgès
Andrea,
I am running benchmarks on my laptop (i7 - 2 core 2.8Ghz + HT = 4 virtual
cpus) on linux 64 (fedora 14).
Note: I always use cpufrequtils to set the cpu governor to performance and
use fixed frequency = 2.8Ghz:
[bourgesl@jmmc-laurent ~]$ uname -a
Linux jmmc-laurent.obs.ujf-grenoble.fr 2.6.35.14-106.fc14.x86_64 #1 SMP Wed
Nov 23 13:07:52 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

2013/4/10 Andrea Aime andrea.a...@geo-solutions.it

 On Tue, Apr 9, 2013 at 7:34 PM, Laurent Bourgès bourges.laur...@gmail.com
  wrote:

 Also, this should be tested on multiple platforms, preferably Linux,
 Windows and Mac to see how it is affected by differences in the platform
 runtimes and threading (hopefully minimal).

 It appears more difficult for me: I can use at work a mac 10.8 and I can
 run Windows XP within virtual box (but it is not very representative).


 I believe I can run MapBench on my Linux 64bit box during the next
 weekend, that would add a platform, and one were the
 server side behavior is enabled by default. And hopefully run the other
 benchmark as well.


I also run j2DBench but I can try also Java2D.demos to perform regression
tests.



 Laurent, have you made any changes to MapBench since I've sent it to you?


Yes I fixed a bit (cached BasicStroke, reused BufferedImage / Graphics) and
added explicit GC before tests (same initial conditions):
http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench/

Look at 
MapBench-src.ziphttp://jmmc.fr/%7Ebourgesl/share/java2d-pisces/MapBench/MapBench-src.zipfor
test changes.

Thanks for your efforts,
Laurent


Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Laurent Bourgès
Dear Jim,

2013/4/9 Jim Graham james.gra...@oracle.com


 The allocations will always show up on a heap profiler, I don't know of
 any way of having them not show up if they are stack allocated, but I don't
 think that stack allocation is the issue here - small allocations come out
 of a fast generation that costs almost nothing to allocate from and nearly
 nothing to clean up.  They are actually getting allocated and GC'd, but the
 process is optimized.

 The only way to tell is to benchmark and see which changes make a
 difference and which are in the noise (or, in some odd counter-intuitive
 cases, counter-productive)...

 ...jim


I advocate I like GC because it avoids in Java dealing with pointers like
C/C++ does; however, I prefer GC clean real garbage (application...) than
wasted memory:
I prefer not count on GC when I can avoid wasting memory that gives GC more
work = reduce useless garbage (save the planet) !

Moreover, GC and / or Thread local allocation (TLAB) seems to have more
overhead than you think = fast generation that costs almost nothing to
allocate from and nearly nothing to clean up.

Here are my micro-benchmark results related to int[4] allocation where I
mimic the AAShapePipe.fillParallelogram() method:
   Patch Ref Gain  5,96 8,27 138,76%  7,31 14,96 204,65%  10,65 20,4 191,55%
15,44 29,83 193,20%
Test environment:
Linux64 with OpenJDK8 (2 real cpu cores, 4 virtual cpus)
JVM settings:
-XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -Xms128m  -Xmx128m

Benchmark code (using Peter Levart microbench classes):
http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/

My conclusion is:  nothing  zero (allocation + cleanup) and it is very
noticeable in multi threading tests.

I advocate that I use a dirty int[4] array (no cleanup) but it is not
necessary : maybe the performance gain come from that reason.


Finally here is the output with  -XX:+PrintTLAB flag:
TLAB: gc thread: 0x7f105813d000 [id: 4053] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 20
waste  1,2% gc: 323712B slow: 600B fast: 0B
TLAB: gc thread: 0x7f105813a800 [id: 4052] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 7 waste
7,9% gc: 745568B slow: 176B fast: 0B
TLAB: gc thread: 0x7f1058138800 [id: 4051] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 15
waste  3,1% gc: 618464B slow: 448B fast: 0B
TLAB: gc thread: 0x7f1058136800 [id: 4050] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 7 waste
0,0% gc: 0B slow: 232B fast: 0B
TLAB: gc thread: 0x7f1058009000 [id: 4037] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 1 waste
27,5% gc: 369088B slow: 0B fast: 0B
TLAB totals: thrds: 5  refills: 50 max: 20 slow allocs: 0 max 0 waste:
3,1% gc: 2056832B max: 745568B slow: 1456B max: 600B fast: 0B max: 0B

I would have expected that TLAB can recycle all useless int[4] arrays as
fast as possible = waste = 100% ???

*Is there any bug in TLAB (core-libs) ?
Should I send such issue to hotspot team ?
*

*Test using ThreadLocal AAShapePipeContext:*
{
AAShapePipeContext ctx = getThreadContext();
int abox[] = ctx.abox;

// use array:
// mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y,
dx1, dy1, dx2, dy2, 0, 0, clip, abox);
abox[0] = 7;
abox[1] = 11;
abox[2] = 13;
abox[3] = 17;

// mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox);
devNull1.yield(abox);

if (!useThreadLocal) {
restoreContext(ctx);
}
}

-XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
-XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
 JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24]
#-
# ContextGetInt4: run duration: 10 000 ms
#
# Warm up:
#   4 threads, Tavg = 13,84 ns/op (σ =   0,23 ns/op), Total ops
=   2889056179 [13,93 (717199825), 13,87 (720665624), 13,48
(741390545), 14,09 (709800185)]
#   4 threads, Tavg = 14,25 ns/op (σ =   0,57 ns/op), Total ops
=   2811615084 [15,21 (658351236), 14,18 (706254551), 13,94
(718202949), 13,74 (728806348)]
cleanup (explicit Full GC) ...
cleanup done.
# Measure:
*1 threads, Tavg =  5,96 ns/op (σ =   0,00 ns/op), Total ops =
1678357614 [ 5,96 (1678357614)]
2 threads, Tavg =  7,33 ns/op (σ =   0,03 ns/op), Total ops =
2729723450 [ 7,31 (1369694121),  7,36 (1360029329)]
3 threads, Tavg = 10,65 ns/op (σ =   2,73 ns/op), Total ops =
2817154340 [13,24 (755190111), 13,23 (755920429),  7,66
(1306043800)]
**4 threads, Tavg = 15,44 ns/op (σ =   3,33 ns/op), Total ops =
2589897733 [17,05 (586353618), 19,23 

Re: AWT Dev Swing Dev sun.awt.X11 logs still using String + (waste)

2013-04-10 Thread Laurent Bourgès
Anthony or Mandy,
Could you ask JMX / Security groups for me to review my patch ?

I am currently not registered to these mailing lists.

Do you ask me to split the patch in two part: PlatformLogger on one side
and Logger on the other side ?

Laurent

2013/4/9 Mandy Chung mandy.ch...@oracle.com

  On 4/9/13 12:37 AM, Laurent Bourgès wrote:

 Mandy,

 first I would like to have the given patch applied to OpenJDK 8 (+ JDK7u)
 as it fixes many problems:
 - string concatenations
 - object creations (Object[] or other objects given as params)
 - method calls in log statements

  This is the patch you refer to:
 http://jmmc.fr/~bourgesl/share/webrev-8010297.3/

 I agree that we should separate the fix to reduce the memory usage from
 the fix to convert JUL to PlatformLogger.  I skimmed on your patch -
 awt/swing uses PlatformLogger and your change looks fine.  You have also
 got Anthony's approval.  Other component security, jmx, snmp, etc are using
 JUL.  I would suggest to fix the use of PlatformLogger in your first patch
 and AWT/Swing is the main area that 8010297 is concerned about so that you
 can resolve 8010297 soon.

 If you want to move ahead to fix use of JUL, you can send your patch to
 serviceability-...@openjdk.java.net and security-...@openjdk.java.net for
 the other areas to review and sponsor.

 Hope this helps.
 Mandy


 In a second step, I can help somebody migrating JUL usages to
 PlatformLogger but it requires PlatformLogger API changes (logp to give
 class and method names)...

 Other comments below:


 Peter's idea is a good one to add a couple of convenient methods to take
 Object parameters that will avoid the creation of array instances.  I'd
 also like to know what variants being used in the jdk to determine the pros
 and cons of the alternatives (this issue also applies to java.util.logging).


 50% of given object parameters are created via method calls so it will not
 be enough.

 Moreover, I prefer having always isLoggable() calls to avoid me looking
 and understanding special cases (more than 2 args or no string concat ...);
 besides, it would be easier to implement code checks testing missing
 isLoggable() calls vs conditional and specific checks (string concat, more
 then 2 args, method calls ...)

 Finally, I prefer having shorter and clearer method names like isFine(),
 isFinest() instead of isLoggable(PlatformLogger.FINE) that is quite ugly
 ...


  It was intentional to have PlatformLogger define only the useful methods
 necessary for jdk use.   The goal of PlatformLogger was to provide a
 light-weight utility for jdk to log debugging messages and eliminate the
 dependency to java.util.logging and avoid the unnecessary j.u.logging
 initialization (as disabled by default).   It was not a goal for
 PlatformLogger to be an exact mirror as java.util.logging.Logger.  My
 advice is to tune PlatformLogger to provide API that helps the platform
 implementation while avoid bloating the API.


 Agreed.

 Maybe JUL Logger.logp(classname, methodname) usages should be rewritten to
 use PlatformLogger.getCallerInfo() instead:
 // Returns the caller's class and method's name; best effort
 // if cannot infer, return the logger's name.
 private String getCallerInfo() {
 String sourceClassName = null;
 String sourceMethodName = null;

 *JavaLangAccess access = SharedSecrets.getJavaLangAccess();
 *Throwable throwable = new Throwable();
 int depth = access.getStackTraceDepth(throwable);

 String logClassName = sun.util.logging.PlatformLogger;
 boolean lookingForLogger = true;
 for (int ix = 0; ix  depth; ix++) {
 // Calling getStackTraceElement directly prevents the VM
 // from paying the cost of building the entire stack frame.
 StackTraceElement frame =
 access.getStackTraceElement(throwable, ix);
 String cname = frame.getClassName();
 if (lookingForLogger) {
 // Skip all frames until we have found the first
 logger frame.
 if (cname.equals(logClassName)) {
 lookingForLogger = false;
 }
 } else {
 if (!cname.equals(logClassName)) {
 // We've found the relevant frame.
 sourceClassName = cname;
 sourceMethodName = frame.getMethodName();
 break;
 }
 }
 }

 if (sourceClassName != null) {
 return sourceClassName +   + sourceMethodName;
 } else {
 return name;
 }
 }
 }



 Since you are touching some jdk files that use java.util.logging, would
 you like to contribute to convert them to use PlatformLogger:
http://bugs.sun.com/bugdatabase

Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Laurent Bourgès
Peter,

1/ I modified your TestRunner class to print total ops and perform explicit
GC before runTests:
http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/

2/ I applied your advice but it does not change much:

 -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
-XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
  JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM
25.0-b24]
 #-
 # ContextGetInt4: run duration: 10 000 ms
 #
 # Warm up:
 #   4 threads, Tavg = 13,84 ns/op (σ =   0,23
ns/op), Total ops =   2889056179 [13,93 (717199825), 13,87
(720665624), 13,48 (741390545), 14,09 (709800185)]
 #   4 threads, Tavg = 14,25 ns/op (σ =   0,57
ns/op), Total ops =   2811615084 [15,21 (658351236), 14,18
(706254551), 13,94 (718202949), 13,74 (728806348)]
 cleanup (explicit Full GC) ...
 cleanup done.
 # Measure:
 1 threads, Tavg =  5,96 ns/op (σ =   0,00 ns/op), Total
ops =   1678357614 [ 5,96 (1678357614)]
 2 threads, Tavg =  7,33 ns/op (σ =   0,03 ns/op), Total
ops =   2729723450 [ 7,31 (1369694121),  7,36 (1360029329)]
 3 threads, Tavg = 10,65 ns/op (σ =   2,73 ns/op), Total
ops =   2817154340 [13,24 (755190111), 13,23 (755920429),  7,66
(1306043800)]
 4 threads, Tavg = 15,44 ns/op (σ =   3,33 ns/op), Total
ops =   2589897733 [17,05 (586353618), 19,23 (519345153), 17,88
(559401974), 10,81 (924796988)]

 -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
-XX:-TieredCompilation -XX:+UseCompressedKlassPointers
-XX:+UseCompressedOops -XX:+UseParallelGC
  JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM
25.0-b24]
 #-
 # GetInt4: run duration: 10 000 ms
 #
 # Warm up:
 #   4 threads, Tavg = 31,56 ns/op (σ =   0,43
ns/op), Total ops =   1267295706 [31,30 (319512554), 31,02
(32229), 32,12 (311334550), 31,82 (314155269)]
 #   4 threads, Tavg = 30,75 ns/op (σ =   1,81
ns/op), Total ops =   1302123211 [32,21 (310949394), 32,37
(309275124), 27,87 (359125007), 31,01 (322773686)]
 cleanup (explicit Full GC) ...
 cleanup done.
 # Measure:
 1 threads, Tavg =  8,36 ns/op (σ =   0,00 ns/op), Total
ops =   1196238323 [ 8,36 (1196238323)]
 2 threads, Tavg = 14,95 ns/op (σ =   0,04 ns/op), Total
ops =   1337648720 [15,00 (666813210), 14,91 (670835510)]
 3 threads, Tavg = 20,65 ns/op (σ =   0,99 ns/op), Total
ops =   1453119707 [19,57 (511100480), 21,97 (455262170), 20,54
(486757057)]
 4 threads, Tavg = 30,76 ns/op (σ =   0,54 ns/op), Total
ops =   1301090278 [31,51 (317527231), 30,79 (324921525), 30,78
(325063322), 29,99 (333578200)]
 #
  JVM END

3/ I tried several heap settings: without Xms/Xmx ... but it has almost no
effect.

*Should I play with TLAB resize / initial size ? or different GC collector
(G1 ...) ?

Does anybody can explain me what PrintTLAB mean ?*

Laurent

2013/4/10 Peter Levart peter.lev...@gmail.com

  Hi Laurent,

 Could you disable tiered compilation for performance tests? Tiered
 compilation is usually a source of jitter in the results. Pass
 -XX:-TieredCompilation to the VM.

 Regards, Peter



 On 04/10/2013 10:58 AM, Laurent Bourgès wrote:

  Dear Jim,

 2013/4/9 Jim Graham james.gra...@oracle.com


 The allocations will always show up on a heap profiler, I don't know of
 any way of having them not show up if they are stack allocated, but I don't
 think that stack allocation is the issue here - small allocations come out
 of a fast generation that costs almost nothing to allocate from and nearly
 nothing to clean up.  They are actually getting allocated and GC'd, but the
 process is optimized.

 The only way to tell is to benchmark and see which changes make a
 difference and which are in the noise (or, in some odd counter-intuitive
 cases, counter-productive)...

 ...jim


 I advocate I like GC because it avoids in Java dealing with pointers like
 C/C++ does; however, I prefer GC clean real garbage (application...) than
 wasted memory:
 I prefer not count on GC when I can avoid wasting memory that gives GC
 more work = reduce useless garbage (save the planet) !

 Moreover, GC and / or Thread local allocation (TLAB) seems to have more
 overhead than you think = fast

Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Laurent Bourgès
Sergey,

I am not an expert here but just my 50 cents.
 This optimization shall take place only if it is really hotspot. But if it
 is a really hotspot - probably it would be better to remove these
 array/object allocation at all and use plane bytes?


Java2D calls AAShapePipe for each shape (line, rectangle ...) rendering so
it is an hotspot for me for big drawings as it will depends on the drawing
complexity (for example, Andrea MapBench can produce maps having more than
100 000 shapes per image ...)


 I see that some methods which take it as argument doesn't use them. And
 most of the time we pass AATileGenerator and abox[] to the same methods, so
 it could be merged?


For now I did not want to modify the AAShapePipe signatures: abox[] is
filled by AATileGenerator implementations (ductus, pisces, others) in order
to have the shape bounds and render only tiles covering this area.



 Also I suggest to use jmh for java micrbenchmarks.
 http://openjdk.java.net/**projects/code-tools/jmhhttp://openjdk.java.net/projects/code-tools/jmh
 So your test will be:
 http://cr.openjdk.java.net/~**serb/AAShapePipeBenchmark.javahttp://cr.openjdk.java.net/%7Eserb/AAShapePipeBenchmark.java


Thanks,
I will try it asap

Laurent




 --
 Best regards, Sergey.




Re: AWT Dev Swing Dev sun.awt.X11 logs still using String + (waste)

2013-04-09 Thread Laurent Bourgès
Mandy,

first I would like to have the given patch applied to OpenJDK 8 (+ JDK7u)
as it fixes many problems:
- string concatenations
- object creations (Object[] or other objects given as params)
- method calls in log statements

In a second step, I can help somebody migrating JUL usages to
PlatformLogger but it requires PlatformLogger API changes (logp to give
class and method names)...

Other comments below:


 Peter's idea is a good one to add a couple of convenient methods to take
 Object parameters that will avoid the creation of array instances.  I'd
 also like to know what variants being used in the jdk to determine the pros
 and cons of the alternatives (this issue also applies to java.util.logging).


50% of given object parameters are created via method calls so it will not
be enough.

Moreover, I prefer having always isLoggable() calls to avoid me looking and
understanding special cases (more than 2 args or no string concat ...);
besides, it would be easier to implement code checks testing missing
isLoggable() calls vs conditional and specific checks (string concat, more
then 2 args, method calls ...)

Finally, I prefer having shorter and clearer method names like isFine(),
isFinest() instead of isLoggable(PlatformLogger.FINE) that is quite ugly
...


 It was intentional to have PlatformLogger define only the useful methods
 necessary for jdk use.   The goal of PlatformLogger was to provide a
 light-weight utility for jdk to log debugging messages and eliminate the
 dependency to java.util.logging and avoid the unnecessary j.u.logging
 initialization (as disabled by default).   It was not a goal for
 PlatformLogger to be an exact mirror as java.util.logging.Logger.  My
 advice is to tune PlatformLogger to provide API that helps the platform
 implementation while avoid bloating the API.


Agreed.

Maybe JUL Logger.logp(classname, methodname) usages should be rewritten to
use PlatformLogger.getCallerInfo() instead:
// Returns the caller's class and method's name; best effort
// if cannot infer, return the logger's name.
private String getCallerInfo() {
String sourceClassName = null;
String sourceMethodName = null;

*JavaLangAccess access = SharedSecrets.getJavaLangAccess();
*Throwable throwable = new Throwable();
int depth = access.getStackTraceDepth(throwable);

String logClassName = sun.util.logging.PlatformLogger;
boolean lookingForLogger = true;
for (int ix = 0; ix  depth; ix++) {
// Calling getStackTraceElement directly prevents the VM
// from paying the cost of building the entire stack frame.
StackTraceElement frame =
access.getStackTraceElement(throwable, ix);
String cname = frame.getClassName();
if (lookingForLogger) {
// Skip all frames until we have found the first logger
frame.
if (cname.equals(logClassName)) {
lookingForLogger = false;
}
} else {
if (!cname.equals(logClassName)) {
// We've found the relevant frame.
sourceClassName = cname;
sourceMethodName = frame.getMethodName();
break;
}
}
}

if (sourceClassName != null) {
return sourceClassName +   + sourceMethodName;
} else {
return name;
}
}
}



 Since you are touching some jdk files that use java.util.logging, would
 you like to contribute to convert them to use PlatformLogger:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7054233

 It's perfectly fine to convert only some of them if not all.


I want to help but as you said it should be done in collaboration because
it requires API changes (JMX, RMI ...) but I prefer after this patch is
reviewed and submitted and in coming weeks.



 Jim Gish is the owner of logging and will check with him if he has cycles
 to work with you on this.


Great.

Cheers,
Laurent


[OpenJDK 2D-Dev] sun.java2D.Pisces renderer Performance and Memory enhancements

2013-04-09 Thread Laurent Bourgès
Dear Java2D members,

Could someone review the following webrev concerning Java2D Pisces to
enhance its performance and reduce its memory footprint (RendererContext
stored in thread local or concurrent queue):
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/

FYI I fixed file headers in this patch and signed my OCA 3 weeks ago.

Remaining work:
- cleanup (comments ...)
- statistics to perform auto-tuning
- cache / memory cleanup (SoftReference ?): use hints or System properties
to adapt it to use cases
- another problem: fix clipping performance in Dasher / Stroker for
segments out of bounds

Could somebody support me ? ie help me working on these tasks or just to
discuss on Pisces algorithm / implementation ?

Best regards,
Laurent Bourgès


Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-09 Thread Laurent Bourgès
Dear Jim,

I advocated I only looked at the netbeans memory profiler's output: no more
megabytes allocated !

The main question is: how to know how GC / hotspot deals with such small
allocations ? Is there any JVM flag to enable to see real allocations as
does jmap -histo.


 Quick questions - which benchmarks were run before/after?  I see a lot of
 benchmark running in your Pisces improvement thread, but but none here.


Agreed; I can try running j2dBench on this fix only. I generally run
Andrea's MapBench as I appeared more complex and using multiple threads.


 Also, this should be tested on multiple platforms, preferably Linux,
 Windows and Mac to see how it is affected by differences in the platform
 runtimes and threading (hopefully minimal).


It appears more difficult for me: I can use at work a mac 10.8 and I can
run Windows XP within virtual box (but it is not very representative).

Don't you have at oracle any test platform to perform such tests /
benchmark ?


 Finally, Hotspot is supposed to deal very well for small thread-local
 allocations like the int[4] and Rectangle2D that you optimized.  Was it
 necessary to cache those at all?  I'm sure the statistics for the
 allocations show up in a memory profile, but that doesn't mean it is
 costing us anything - ideally such small allocations are as fast as free
 and having to deal with caching them in a context will actually lose
 performance.  It may be that the tile caching saved enough that it might
 have masked unnecessary or detrimental changes for the smaller objects...


I repeat my question: how can I know at runtime how hotspot optimizes
AAShapePipe code (allocations ...) ? Does hotspot can do stack allocation ?
is it explained somewhere (allocation size threshold) ?

Maybe verbose:gc output may help ?

Finally I spent a lot of time on pisces renderer and running MapBench to
show performance gains.

Thanks for your interesting feedback,

Laurent

On 4/5/2013 5:20 AM, Laurent Bourgčs wrote:

 Dear java2d members,

 I figured out some troubles in java2d.pipe.AAShapePipe related to both
 concurrency  memory usage:
 - concurrency issue related to static theTile field: only 1 tile is cached
 so a new byte[] is created for other threads at each call to renderTile()
 - excessive memory usage (byte[] for tile, int[] and rectangle): at each
 call to renderPath / renderTiles, several small objects are created (never
 cached) that leads to hundreds megabytes that GC must deal with

 Here are profiling screenshots:
 - 4 threads drawing on their own buffered image (MapBench test):
 http://jmmc.fr/~bourgesl/**share/AAShapePipe/AAShapePipe_**byte_tile.pnghttp://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_byte_tile.png

 - excessive int[] / Rectangle creation:
 http://jmmc.fr/~bourgesl/**share/AAShapePipe/AAShapePipe_**int_bbox.pnghttp://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_int_bbox.png
 http://jmmc.fr/~bourgesl/**share/AAShapePipe/AAShapePipe_**
 rectangle_bbox.pnghttp://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_rectangle_bbox.png

 Here is the proposed patch:
 http://jmmc.fr/~bourgesl/**share/AAShapePipe/webrev-1/http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-1/

 I applied a simple solution = use a ThreadLocal or ConcurrentLinkedQueue
 (see useThreadLocal flag) to cache one AAShapePipeContext per thread (2K
 max).
 As its memory footprint is very small, I recommend using ThreadLocal.

 Is it necessary to use Soft/Weak reference to avoid excessive memory usage
 for such cache ?

 Is there any class dedicated to such cache (ThreadLocal with cache
 eviction or ConcurrentLinkedQueue using WeakReference ?) ?
 I think it could be very useful at the JDK level to have such feature (ie
 a generic GC friendlycache )

 Regards,
 Laurent



Re: [OpenJDK 2D-Dev] sun.java2D.pisces big memory usage (waste ?)

2013-04-08 Thread Laurent Bourgès
Andrea,

Any feedback from anybody else ?

Here are J2D Bench results:
http://jmmc.fr/~bourgesl/share/java2d-pisces/j2DBench/

Depending on the test, performance gains varies from 20% to 100% !

I think it could be nice if you can perform tests (regression and
benchmarks using MapBench, J2DBench, Java2D demos).
I still have to enhance cleanup / tuning my code (stats, initial array
sizes ...) and cache eviction (memory cleanup) using Weak references or
manual cleanup using timestamps ...

To test regression, I propose you to enhance your MapBench class to save
image results (randomly in tests) and compare them after the test run
(image diff) even when multiple threads are in use.

If you apply my patch, take care of the following main settings:
useThreadLocal should be disabled in a web container (to avoid too much
memory waste): a RendererContext represents ~ 2Mb:
rowAARLE = new int[INITIAL_Y_ARRAY][INITIAL_ARRAY]; // ~2Mb +1 to avoid
recycling such shared arrays

PiscesConst:
/** use ThreadLocal or ConcurrentLinkedQueue to get the thread
RendererContext */
static final boolean useThreadLocal = true;

Initial RendererContext array capacity:
// +1 to avoid recycling such shared arrays
static final int INITIAL_ARRAY = 256 + 1;
static final int INITIAL_Y_ARRAY = 2048 + 1;
static final int INITIAL_LARGE_ARRAY = 16384 + 1; // large

Laurent

2013/4/7 Andrea Aime andrea.a...@geo-solutions.it

 On Fri, Apr 5, 2013 at 4:32 PM, Laurent Bourgès bourges.laur...@gmail.com
  wrote:

 Dear all,

 Here is my first pisces (beta) patch (webrev):
 http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/

 I succeed in fixing memory usage by having only 1 pisces instance
 (Renderer, stroker, iterator ...) per RendererContext (GC friendly).

 Impressive results between small and large drawings:


 Good stuff. Is there anything I can do to help? I do have an up to date
 version of JDK 8 sources on my disk, maybe I could
 try to apply the patch and take it for a spin in a real GeoServer and see
 how it goes.

 By the way, wondering, is there any other benchmark to try out?
 A possibly interesting one is this, where the author re-coded some
 selected methods of Graphics2D for maximum performance
 (but sometimes lower antialiasing quality) and created a applet to compare
 the results in a number of synthetic test cases:
 http://www.randelshofer.ch/oop/graphics/

 Cheers
 Andrea


 --
 ==
 Our support, Your Success! Visit http://opensdi.geo-solutions.it for more
 information.
 ==

 Ing. Andrea Aime
 @geowolf
 Technical Lead

 GeoSolutions S.A.S.
 Via Poggio alle Viti 1187
 55054  Massarosa (LU)
 Italy
 phone: +39 0584 962313
 fax: +39 0584 1660272
 mob: +39  339 8844549

 http://www.geo-solutions.it
 http://twitter.com/geosolutions_it

 ---



Re: AWT Dev Swing Dev sun.awt.X11 logs still using String + (waste)

2013-04-08 Thread Laurent Bourgès
Peter, Mandy,

I think it would be great to make PlatformLogger very similar to Logger
API:
I mean JUL Logger already has only 1 convenient method with 1 param so it
may be great to have the same API in PlatformLogger too: maybe extend it to
2 or 3 params ...

Peter, could you look at the API differences between Logger /
PlatformLogger to make PlatformLogger API more similar to JUL Logger ?

Laurent

2013/4/8 Peter Levart peter.lev...@gmail.com

  On 04/07/2013 07:11 PM, Laurent Bourgès wrote:

 Hi

 I started fixing All PlatformlLogger bad usages and I am fixing also
 many jul Logger usages without isLoggable ...
 That represents a lot of changes and a very boring job.

 I think sending a webrew tomorrow.


 Hi Laurent,

 Since you're already deep in the task, you might have a rough feeling what
 part of such unguarded log statements are of the following type:

 logger.[fine,info,...](a message with {0} and {1} placeholders,
 someArg1, someArg2);

 where someArg1, ... are some values that are already present in the
 context of the logging statement and don't have to be computed just for
 satisfying the logging (not even boxing). Such usages could be effectively
 optimized by adding some overloaded methods to PlatformLogger that take 1,
 2, 3, ... arguments:

 class PlatformLogger {

 ...

 public void finest(String msg, Object param1) {
 if (isLoggable(Level.FINEST)) {
 loggerProxy.doLog(Level.FINEST, msg, param1);
 }
 }

 public void finest(String msg, Object param1, Object param2) {
 if (isLoggable(Level.FINEST)) {
 loggerProxy.doLog(Level.FINEST, msg, param1, param2);
 }
 }

 public void finest(String msg, Object param1, Object param2, Object
 param3) {
 if (isLoggable(Level.FINEST)) {
 loggerProxy.doLog(Level.FINEST, msg, param1, param2, param3);
 }
 }

 ...

 This would effectively guard creation of the arguments array with an
 isLoggable check for some common usages, eliminating the need to explicitly
 guard such statements. Of course, the user would have to be aware of when
 such unguarded logging statement is without overhead and when not...

 How do you feel about such API extension?


 Regards, Peter



  Laurent
 Le 4 avr. 2013 14:08, Laurent Bourgès bourges.laur...@gmail.com a
 écrit :

 Ok, I can provide an updated patch after finding / fixing all usages.

 Probably in 1 or 2 days,

 Laurent

 2013/4/4 Anthony Petrov anthony.pet...@oracle.com

 Yes, this makes sense. Do you want to update your fix for 8010297 to
 include these changes as well?

 --
 best regards,
 Anthony


 On 04/04/13 15:47, Laurent Bourgès wrote:

 Dear all,

  I just realized there is another problem with PlatformLogger log
 statements:
 XBaseWindow:
  public boolean grabInput() {
  grabLog.fine(Grab input on {0}, this);
 ...
 }

 This calls the PlatformLogger.fine( varargs):
  public void fine(String msg, Object... params) {
  logger.doLog(FINE, msg, params);
  }

 Doing so, the JVM creates a new Object[] instance to provide params as
 varargs.

 I would recommend using isLoggable() test to avoid such waste if the log
 is disabled (fine, finer, finest ...).

 Do you want me to provide a new patch related to this problem ?

 Does somebody have an idea to automatically analyze the JDK code and
 detect missing isLoggable statements ...

 Regards,
 Laurent

 2013/4/3 Laurent Bourgès bourges.laur...@gmail.com
  mailto:bourges.laur...@gmail.com


 Anthony,

 could you tell me once it is in the OpenJDK8 repository ?
 I would then perform again profiling tests to check if there is no
 more missing isLoggable() statements.

 Once JMX and other projects switch to PlatformLogger, I could check
 again.

 Maybe I could write a small java code checker (pmd rule) to test if
 there is missing isLoggable() statements wrapping PlatformLogger log
 statements. Any idea about how to reuse java parser to do so ?

 Regards,

 Laurent

 2013/4/2 Anthony Petrov anthony.pet...@oracle.com
  mailto:anthony.pet...@oracle.com


 Looks fine to me as well. Thanks for fixing this, Laurent.

 Let's wait a couple more days in case Net or Swing folks want to
 review the fix. After that I'll push it to the repository.

 --
 best regards,
 Anthony


 On 4/2/2013 15:35, Laurent Bourgès wrote:

 Here is the updated patch:
  http://jmmc.fr/~bourgesl/__share/webrev-8010297.2/
 http://jmmc.fr/%7Ebourgesl/share/webrev-8010297.2/


 Fixed inconsistencies between FINE / FINER log statements:
 - XScrollbarPeer
 - XWindowPeer

 Laurent

 2013/4/2 Anthony Petrov anthony.pet...@oracle.com
 mailto:anthony.pet...@oracle.com
  mailto:anthony.petrov@oracle.__com

 mailto:anthony.pet...@oracle.com

Re: AWT Dev Swing Dev sun.awt.X11 logs still using String + (waste)

2013-04-08 Thread Laurent Bourgès
Anthony,

here is an updated patch concerning both PlatformLogger and Logger 'bad'
usages:
http://jmmc.fr/~bourgesl/share/webrev-8010297.3/

It impacts now awt, swing, JMX, security, network, and apache xml.

Thanks for the review,
Laurent

2013/4/8 Laurent Bourgès bourges.laur...@gmail.com

 Peter, Mandy,

 I think it would be great to make PlatformLogger very similar to Logger
 API:
 I mean JUL Logger already has only 1 convenient method with 1 param so it
 may be great to have the same API in PlatformLogger too: maybe extend it to
 2 or 3 params ...

 Peter, could you look at the API differences between Logger /
 PlatformLogger to make PlatformLogger API more similar to JUL Logger ?

 Laurent


 2013/4/8 Peter Levart peter.lev...@gmail.com

  On 04/07/2013 07:11 PM, Laurent Bourgès wrote:

 Hi

 I started fixing All PlatformlLogger bad usages and I am fixing also
 many jul Logger usages without isLoggable ...
 That represents a lot of changes and a very boring job.

 I think sending a webrew tomorrow.


 Hi Laurent,

 Since you're already deep in the task, you might have a rough feeling
 what part of such unguarded log statements are of the following type:

 logger.[fine,info,...](a message with {0} and {1} placeholders,
 someArg1, someArg2);

 where someArg1, ... are some values that are already present in the
 context of the logging statement and don't have to be computed just for
 satisfying the logging (not even boxing). Such usages could be effectively
 optimized by adding some overloaded methods to PlatformLogger that take 1,
 2, 3, ... arguments:

 class PlatformLogger {

 ...

 public void finest(String msg, Object param1) {
 if (isLoggable(Level.FINEST)) {
 loggerProxy.doLog(Level.FINEST, msg, param1);
 }
 }

 public void finest(String msg, Object param1, Object param2) {
 if (isLoggable(Level.FINEST)) {
 loggerProxy.doLog(Level.FINEST, msg, param1, param2);
 }
 }

 public void finest(String msg, Object param1, Object param2, Object
 param3) {
 if (isLoggable(Level.FINEST)) {
 loggerProxy.doLog(Level.FINEST, msg, param1, param2, param3);
 }
 }

 ...

 This would effectively guard creation of the arguments array with an
 isLoggable check for some common usages, eliminating the need to explicitly
 guard such statements. Of course, the user would have to be aware of when
 such unguarded logging statement is without overhead and when not...

 How do you feel about such API extension?


 Regards, Peter



  Laurent
 Le 4 avr. 2013 14:08, Laurent Bourgès bourges.laur...@gmail.com a
 écrit :

 Ok, I can provide an updated patch after finding / fixing all usages.

 Probably in 1 or 2 days,

 Laurent

 2013/4/4 Anthony Petrov anthony.pet...@oracle.com

 Yes, this makes sense. Do you want to update your fix for 8010297 to
 include these changes as well?

 --
 best regards,
 Anthony


 On 04/04/13 15:47, Laurent Bourgès wrote:

 Dear all,

  I just realized there is another problem with PlatformLogger log
 statements:
 XBaseWindow:
  public boolean grabInput() {
  grabLog.fine(Grab input on {0}, this);
 ...
 }

 This calls the PlatformLogger.fine( varargs):
  public void fine(String msg, Object... params) {
  logger.doLog(FINE, msg, params);
  }

 Doing so, the JVM creates a new Object[] instance to provide params as
 varargs.

 I would recommend using isLoggable() test to avoid such waste if the
 log
 is disabled (fine, finer, finest ...).

 Do you want me to provide a new patch related to this problem ?

 Does somebody have an idea to automatically analyze the JDK code and
 detect missing isLoggable statements ...

 Regards,
 Laurent

 2013/4/3 Laurent Bourgès bourges.laur...@gmail.com
  mailto:bourges.laur...@gmail.com


 Anthony,

 could you tell me once it is in the OpenJDK8 repository ?
 I would then perform again profiling tests to check if there is no
 more missing isLoggable() statements.

 Once JMX and other projects switch to PlatformLogger, I could check
 again.

 Maybe I could write a small java code checker (pmd rule) to test if
 there is missing isLoggable() statements wrapping PlatformLogger
 log
 statements. Any idea about how to reuse java parser to do so ?

 Regards,

 Laurent

 2013/4/2 Anthony Petrov anthony.pet...@oracle.com
  mailto:anthony.pet...@oracle.com


 Looks fine to me as well. Thanks for fixing this, Laurent.

 Let's wait a couple more days in case Net or Swing folks want
 to
 review the fix. After that I'll push it to the repository.

 --
 best regards,
 Anthony


 On 4/2/2013 15:35, Laurent Bourgès wrote:

 Here is the updated patch:
  http://jmmc.fr/~bourgesl/__share/webrev-8010297.2/
 http://jmmc.fr/%7Ebourgesl/share/webrev-8010297.2/


 Fixed inconsistencies between FINE

Re: AWT Dev Swing Dev sun.awt.X11 logs still using String + (waste)

2013-04-07 Thread Laurent Bourgès
Hi

I started fixing All PlatformlLogger bad usages and I am fixing also many
jul Logger usages without isLoggable ...
That represents a lot of changes and a very boring job.

I think sending a webrew tomorrow.

Laurent
Le 4 avr. 2013 14:08, Laurent Bourgès bourges.laur...@gmail.com a
écrit :

 Ok, I can provide an updated patch after finding / fixing all usages.

 Probably in 1 or 2 days,

 Laurent

 2013/4/4 Anthony Petrov anthony.pet...@oracle.com

 Yes, this makes sense. Do you want to update your fix for 8010297 to
 include these changes as well?

 --
 best regards,
 Anthony


 On 04/04/13 15:47, Laurent Bourgès wrote:

 Dear all,

 I just realized there is another problem with PlatformLogger log
 statements:
 XBaseWindow:
  public boolean grabInput() {
  grabLog.fine(Grab input on {0}, this);
 ...
 }

 This calls the PlatformLogger.fine( varargs):
  public void fine(String msg, Object... params) {
  logger.doLog(FINE, msg, params);
  }

 Doing so, the JVM creates a new Object[] instance to provide params as
 varargs.

 I would recommend using isLoggable() test to avoid such waste if the log
 is disabled (fine, finer, finest ...).

 Do you want me to provide a new patch related to this problem ?

 Does somebody have an idea to automatically analyze the JDK code and
 detect missing isLoggable statements ...

 Regards,
 Laurent

 2013/4/3 Laurent Bourgès bourges.laur...@gmail.com
 mailto:bourges.laurent@gmail.**com bourges.laur...@gmail.com


 Anthony,

 could you tell me once it is in the OpenJDK8 repository ?
 I would then perform again profiling tests to check if there is no
 more missing isLoggable() statements.

 Once JMX and other projects switch to PlatformLogger, I could check
 again.

 Maybe I could write a small java code checker (pmd rule) to test if
 there is missing isLoggable() statements wrapping PlatformLogger log
 statements. Any idea about how to reuse java parser to do so ?

 Regards,

 Laurent

 2013/4/2 Anthony Petrov anthony.pet...@oracle.com
 mailto:anthony.petrov@oracle.**com anthony.pet...@oracle.com


 Looks fine to me as well. Thanks for fixing this, Laurent.

 Let's wait a couple more days in case Net or Swing folks want to
 review the fix. After that I'll push it to the repository.

 --
 best regards,
 Anthony


 On 4/2/2013 15:35, Laurent Bourgès wrote:

 Here is the updated patch:
 
 http://jmmc.fr/~bourgesl/__**share/webrev-8010297.2/http://jmmc.fr/%7Ebourgesl/__share/webrev-8010297.2/
 
 http://jmmc.fr/%7Ebourgesl/**share/webrev-8010297.2/http://jmmc.fr/%7Ebourgesl/share/webrev-8010297.2/
 


 Fixed inconsistencies between FINE / FINER log statements:
 - XScrollbarPeer
 - XWindowPeer

 Laurent

 2013/4/2 Anthony Petrov anthony.pet...@oracle.com
 mailto:anthony.petrov@oracle.**comanthony.pet...@oracle.com
 
 mailto:anthony.petrov@oracle.**__com

 mailto:anthony.petrov@oracle.**comanthony.pet...@oracle.com
 


  1. Sergey: I believe this is for purposes of better
 formating the
  log output and making it more readable by separating or
 highlighting
  some sections. I don't think this should be changed.

  2. Laurent: can you please address this issue and send
 us a new patch?

  --
  best regards,
  Anthony


  On 4/1/2013 16:08, Sergey Bylokhov wrote:


  Hi, Anthony
  Only two comments:
  1 Why we need some special text in the log output
 like *** and
 ###
  2 XScrollbarPeer.java:

  +if
 (log.isLoggable(**PlatformLogger.FINEST)) {


  +log.finer(KeyEvent on scrollbar:  +
 event);
  +}



  On 4/1/13 12:20 PM, Anthony Petrov wrote:

  Awt, Swing, Net engineers,

  Could anyone review the fix please? For your
 convenience:

  Bug:
 
 http://bugs.sun.com/view_bug._**___do?bug_id=8010297http://bugs.sun.com/view_bug.do?bug_id=8010297
 
 http://bugs.sun.com/view_bug.**__do?bug_id=8010297http://bugs.sun.com/view_bug.__do?bug_id=8010297
 

 
 http://bugs.sun.com/view_bug.**__do?bug_id=8010297http://bugs.sun.com/view_bug.__do?bug_id=8010297
 
 http://bugs.sun.com/view_bug.**do?bug_id=8010297http://bugs.sun.com/view_bug.do?bug_id=8010297
 

  Fix:
 http://cr.openjdk.java.net/~__**
 __anthony/8-55-isLoggable-**8010297.0/http://cr.openjdk.java.net

Re: Review request: 8011380: FX dependency on PlatformLogger broken

2013-04-06 Thread Laurent Bourgès
Mandy,

I believe the awt/net code started logging in 1.4 and 1.5 using j.u.logging
 that was really early taker before the performance overhead was considered.

 I filed a bug 8011557: improve the logging documentation to advice on
 performance consideration as we may want to mention this in the tutorial or
 other docs as well.  This should belong to java.util.logging instead of
 sun.util.logging.


That's a good idea to update both javadoc (JUL, PlatformLogger) and maybe
the java code convention related to OpenJDK code.

I insist on pointing this performance issue in the PlatformLogger class too
because it belongs to JDK and I expect it to be as fast as possible !

Moreover, it would be very great if openjdk's build can perform code
quality tests and report such incorrect JUL / PlatformLogger usages. Any
idea to do so ?



 Having a second thought, Supplier may not address the memory overhead
 concern you raise.  Worth considering any API improvement to address both
 the runtime and memory concern.


I looked at JDK8 Logger doc and Supplier may help (but maybe the underlying
implementation may be slow or create objects too):
A set of methods alternatively take a msgSupplier instead of a msg
argument. These methods take a
Supplierhttp://download.java.net/jdk8/docs/api/java/util/function/Supplier.html
String function which is invoked to construct the desired log message
only when the message actually is to be logged based on the effective log
level thus eliminating unnecessary message construction. For example, if
the developer wants to log system health status for diagnosis, with the
String-accepting version, the code would look like:

   class DiagnosisMessages {
 static String systemHealthStatus() {
   // collect system health information
   ...
 }
   }
   ...
   logger.log(Level.FINER, DiagnosisMessages.systemHealthStatus());

*With the above code, the health status is collected unnecessarily even
when the log level FINER is disabled. With the Supplier-accepting version
as below, the status will only be collected when the log level FINER is
enabled. *

*   logger.log(Level.FINER, DiagnosisMessages::systemHealthStatus);*




 It would also be helpful if IDE and code analysis tool can hint the
 developers of such pattern.


Netbeans has a warning saying string concatenation in log statements but
it does not propose to fix it by adding proper isLoggable statement: maybe
we should submit a RFE to netbeans project ?
http://wiki.netbeans.org/Java_Hints#Logging

*String concatenation in logger* It is not performance efficient to
concatenate strings in logger messages. It is better to use a template
message with placeholders that are replaced by concrete values only when
the message is really going to be logged. *Since NetBeans 6.9* 

Cheers,
Laurent


 Mandy
 P.S. I'll be pushing the changeset today.


 On 4/5/2013 9:02 AM, Laurent Bourgès wrote:

 Mandy,

 I agree it should be well known; but I fixed several cases in awt/net code
 where isLoggable() calls were missing.

 I think it is quite cheap to remind good practices in the PlatformLogger /
 jul Logger javadoc ...

 PS: maybe some quality control tools could check such missing tests (PMD
 can do it)...

   I believe this was mentioned somewhere in j.u.logging.  A better
 solution may be to take java.util.function.Supplier parameter that
 constructs the log message lazily (see
 http://download.java.net/jdk8/docs/api/java/util/logging/Logger.html#fine(java.util.function.Supplier).
 I can file a RFE to investigate the use of Supplier as in j.u.l.Logger.


 Very interesting feature, but I am still not aware of such JDK 8 features
 (lambda) ... it seems nice but maybe the syntax will be more complicated /
 verbose than our usual log statements:
 log.fine(...)

 Laurent



 On 4/5/2013 1:55 AM, Laurent Bourgès wrote:

 Mandy,

 I would like to add few performance advices in the PlatformLogger Javadoc:
 
 NOTE: For performance reasons, PlatformLogger usages should take care of
 avoiding useless / wasted object creation and method calls related to *
 disabled* log statements:
 Always use isLoggable(level) wrapping logs at levels (FINEST, FINER,
 FINE, CONFIG):

 Bad practices:
 - string concat:
 log.fine(message + value); // means
 StringBuilder(message).append(String.valueOf(value)).toString(): 2 objects
 created and value.toString() called
 - varags:
 log.fine(message {0}, this); // create an Object[]

 Best practices:
 if (log.isLoggable(PlatformLogger.FINE) {
 log.fine(message + value);
 }

 if (log.isLoggable(PlatformLogger.FINE) {
 log.fine(message {0}, this);
 }
 

 What is your opinion ?

 Thanks for the given explanations and I hope that this patch will be
 submitted soon to JDK8 repository.

 Laurent






Re: Review request: 8011380: FX dependency on PlatformLogger broken

2013-04-05 Thread Laurent Bourgès
Mandy,

I would like to add few performance advices in the PlatformLogger Javadoc:

NOTE: For performance reasons, PlatformLogger usages should take care of
avoiding useless / wasted object creation and method calls related to *
disabled* log statements:
Always use isLoggable(level) wrapping logs at levels (FINEST, FINER, FINE,
CONFIG):

Bad practices:
- string concat:
log.fine(message + value); // means
StringBuilder(message).append(String.valueOf(value)).toString(): 2 objects
created and value.toString() called
- varags:
log.fine(message {0}, this); // create an Object[]

Best practices:
if (log.isLoggable(PlatformLogger.FINE) {
log.fine(message + value);
}

if (log.isLoggable(PlatformLogger.FINE) {
log.fine(message {0}, this);
}


What is your opinion ?

Thanks for the given explanations and I hope that this patch will be
submitted soon to JDK8 repository.

Laurent

2013/4/4 Mandy Chung mandy.ch...@oracle.com

 Alan - can you review this change?

 I have changed Level.valueOf(int) to return the nearest Level as Peter
 suggests using binary search:

 http://cr.openjdk.java.net/~**mchung/jdk8/webrevs/8011380/**webrev.01/http://cr.openjdk.java.net/%7Emchung/jdk8/webrevs/8011380/webrev.01/

 I want to push the changeset tomorrow since we need this fix before the TL
 integration.

 Several questions were brought up and I'm answering them in one shot:
 1. The PlatformLogger static final fields have to retain the same since
 existing code can call:
 int level = PlatformLogger.FINE;
 logger.setLevel(level);

 There is also native code accessing PlatformLogger fields (will need to
 investigate more).  Once we fix this type of incompatibility references, we
 can change them.  Or we could remove these static final constants
 completely but it's less preferable since it touches many of the JDK files.
  I would keep these static fields as a shortcut.

 2. New convenient methods (isInfoLoggable, isWarningLoggable) to minimize
 the dependency to the constants

 It's not a problem to have the dependencies.  This is an issue this time
 since we were not aware of such dependency.  The isLoggable method is
 simple enough.

 3. 3 methods with two versions (one with int and one with Level parameter)

 As I said, I'll file a bug to remove the 3 deprecated methods in jdk8. I'm
 certain I can do so but just take some time to synchronize the changes.

 4. It's true that you can set a PlatformLogger with a custom level via
 PlatformLogger API.  But you can access a platform logger using
 java.util.logging API.

Logger.getLogger(sun.awt.**someLogger).setLevel(MyLevel.**
 CUSTOM_LEVEL);

 PlatformLogger.getLevel() has to return some thing.  Laurent suggests to
 remove (deprecate) PlatformLogger.getLevel() or level() method.  I have to
 understand how the FX code uses getLevel().  We have to keep it due to the
 regression and for testing.  For API perspective, having a method to find
 what level it's at is reasonable.  Since we now have a clean solution to
 deal with custom level, I don't see any issue keeping it.

 5. JavaFX 8 is likely to switch to build with JDK8 in a few weeks (really
 soon).

 6. Level.intValue() public or not

 It mirrors java.util.logging.Level but may be able to do without it.
  Let's leave it public for now until FX converts to use the new code.  I
 can clean this up at the same time.

 Mandy



 On 4/3/13 9:52 PM, Mandy Chung wrote:

 Peter, Laurent,

 History and details are described below.

 Webrev at:
 http://cr.openjdk.java.net/~**mchung/jdk8/webrevs/8011380/**webrev.00/http://cr.openjdk.java.net/%7Emchung/jdk8/webrevs/8011380/webrev.00/

 I add back the getLevel(int), setLevel(int) and isLoggable(int) methods
 but marked deprecated and also revert the static final variables to resolve
 the regression. They can be removed when JavaFX transitions to use the
 Level enums (I'll file one issue for FX to use PlatformLogger.Level and one
 for core-libs to remove the deprecated methods).  The performance overhead
 is likely small since it's direct mapping from int to the Level enum that
 was used in one of your previous performance measurement.

 Laurent - you have a patch to add isLoggable calls in the awt/swing code.
 Can you check if there is any noticeable performance difference?

 I also take the opportunity to reconsider what JavaLoggerProxy.getLevel()
 should return when it's a custom Level. Use of logging is expected not to
 cause any fatal error to the running application.  The previous patch
 throwing IAE needs to be fixed.  I propose to return the first
 Level.intValue() = the custom level value since the level value is used to
 evaluate if it's loggable.

 History and details:
 JavaFX 8 has converted to use sun.util.logging.**PlatformLogger (
 https://javafx-jira.kenai.**com/browse/RT-24458https://javafx-jira.kenai.com/browse/RT-24458).
 I was involved in the early discussion but wasn't aware of the decision
 made.  Thanks to Alan for catching this regression out 

AAShapePipe concurrency memory waste

2013-04-05 Thread Laurent Bourgès
Dear java2d members,

I figured out some troubles in java2d.pipe.AAShapePipe related to both
concurrency  memory usage:
- concurrency issue related to static theTile field: only 1 tile is cached
so a new byte[] is created for other threads at each call to renderTile()
- excessive memory usage (byte[] for tile, int[] and rectangle): at each
call to renderPath / renderTiles, several small objects are created (never
cached) that leads to hundreds megabytes that GC must deal with

Here are profiling screenshots:
- 4 threads drawing on their own buffered image (MapBench test):
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_byte_tile.png

- excessive int[] / Rectangle creation:
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_int_bbox.png
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_rectangle_bbox.png

Here is the proposed patch:
http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-1/

I applied a simple solution = use a ThreadLocal or ConcurrentLinkedQueue
(see useThreadLocal flag) to cache one AAShapePipeContext per thread (2K
max).
As its memory footprint is very small, I recommend using ThreadLocal.

Is it necessary to use Soft/Weak reference to avoid excessive memory usage
for such cache ?

Is there any class dedicated to such cache (ThreadLocal with cache eviction
or ConcurrentLinkedQueue using WeakReference ?) ?
I think it could be very useful at the JDK level to have such feature (ie a
generic GC friendlycache )

Regards,
Laurent


Re: Review request: 8011380: FX dependency on PlatformLogger broken

2013-04-05 Thread Laurent Bourgès
Mandy,

I agree it should be well known; but I fixed several cases in awt/net code
where isLoggable() calls were missing.

I think it is quite cheap to remind good practices in the PlatformLogger /
jul Logger javadoc ...

PS: maybe some quality control tools could check such missing tests (PMD
can do it)...

I believe this was mentioned somewhere in j.u.logging.  A better solution
 may be to take java.util.function.Supplier parameter that constructs the
 log message lazily (see
 http://download.java.net/jdk8/docs/api/java/util/logging/Logger.html#fine(java.util.function.Supplier).
 I can file a RFE to investigate the use of Supplier as in j.u.l.Logger.


Very interesting feature, but I am still not aware of such JDK 8 features
(lambda) ... it seems nice but maybe the syntax will be more complicated /
verbose than our usual log statements:
log.fine(...)

Laurent



 On 4/5/2013 1:55 AM, Laurent Bourgès wrote:

 Mandy,

 I would like to add few performance advices in the PlatformLogger Javadoc:
 
 NOTE: For performance reasons, PlatformLogger usages should take care of
 avoiding useless / wasted object creation and method calls related to *
 disabled* log statements:
 Always use isLoggable(level) wrapping logs at levels (FINEST, FINER, FINE,
 CONFIG):

 Bad practices:
 - string concat:
 log.fine(message + value); // means
 StringBuilder(message).append(String.valueOf(value)).toString(): 2 objects
 created and value.toString() called
 - varags:
 log.fine(message {0}, this); // create an Object[]

 Best practices:
 if (log.isLoggable(PlatformLogger.FINE) {
 log.fine(message + value);
 }

 if (log.isLoggable(PlatformLogger.FINE) {
 log.fine(message {0}, this);
 }
 

 What is your opinion ?

 Thanks for the given explanations and I hope that this patch will be
 submitted soon to JDK8 repository.

 Laurent




Re: Review request: 8011380: FX dependency on PlatformLogger broken

2013-04-04 Thread Laurent Bourgès
Mandy, Peter,

here are my comments:

On 04/04/2013 06:52 AM, Mandy Chung wrote:

 Peter, Laurent,

 History and details are described below.

 Webrev at:
 http://cr.openjdk.java.net/~mchung/jdk8/webrevs/8011380/webrev.00/

 I add back the getLevel(int), setLevel(int) and isLoggable(int) methods
 but marked deprecated and also revert the static final variables to resolve
 the regression. They can be removed when JavaFX transitions to use the
 Level enums (I'll file one issue for FX to use PlatformLogger.Level and one
 for core-libs to remove the deprecated methods).  The performance overhead
 is likely small since it's direct mapping from int to the Level enum that
 was used in one of your previous performance measurement.

 Look OK for me; the Level valueOf(int level) is back and is basically an
efficient switch case that performs well (performance overhead is minor). I
hope other projects will be built again soon to remove such workarrounds.


 I think that it is not strictly necessary to restore the PlatformLogger
 static final fields back to int type. They are compile-time constants and
 so are already baked into the classes compiled with older version of
 PlatformLogger. Am I right? The only problem I see if fields are kept as
 Level type is when JFX is compiled with JDK8 and then run on JDK7... But
 changing them temporarily back to int is a conservative approach and I
 support it.


I think you're right: static constants are copied into class bytecode;
maybe the old API (int level in method signatures) could be kept and marked
deprecated but the class will become quite big (double API: one with int
and one with Level) !!



 Laurent - you have a patch to add isLoggable calls in the awt/swing code.
 Can you check if there is any noticeable performance difference?

 The awt patch consists in adding missing isLoggable statements to avoid
object creation (string concatenations) and method calls
(Component.toString() ...).

Maybe I should also check that calls to log(msg, varargs) are using
isLoggable() tests to avoid Object[] creation due to varargs: what is your
opinion ?



 I also take the opportunity to reconsider what JavaLoggerProxy.getLevel()
 should return when it's a custom Level. Use of logging is expected not to
 cause any fatal error to the running application.  The previous patch
 throwing IAE needs to be fixed.  I propose to return the first
 Level.intValue() = the custom level value since the level value is used to
 evaluate if it's loggable.


 That's a good compromise.


I think custom logger are ILLEGAL in the PlatformLogger API and must be
discarded. Maybe comments should explain such design decision and returning
an IllegalStateException is OK for me.




 History and details:
 JavaFX 8 has converted to use sun.util.logging.PlatformLogger (
 https://javafx-jira.kenai.com/browse/RT-24458). I was involved in the
 early discussion but wasn't aware of the decision made.  Thanks to Alan for
 catching this regression out before it's integrated to jdk8.  jfxrt.jar is
 cobundled with jdk8 during the installer build step.  My build doesn't
 build installer and thus we didn't see this regression.

 I ran jdeps on jdk8/lib/ext/jfxrt.jar (windows-i586) that shows 112
 references to PlatformLogger and on jdk7/lib/jfxrt.jar that shows no
 reference to sun.util.logging.

 I have a method finder tool (planning to include it in jdk8) to search for
 use of PlatformLogger.getLevel and it did find 2 references from
 jdk8/lib/ext/jfxrt.jar.

 Personally I doubt getLevel() method is useful: why not use isLoggable()
instead !! maybe getLevel() should become @deprecated too.



 JavaFX 8 is going to upgrade to use JDK 8 for JavaFX build (
 https://javafx-jira.kenai.com/browse/RT-27794) soon (currently it's built
 with JDK 7).  As soon as JavaFX code are changed to reference
 PlatformLogger.Level enum instead, we can remove the deprecated methods and
 change the PlatformLogger constants.

 Great, any idea about their roadmap ?
do you think other projects are also depending on PlatformLogger (JMX ...)
and may be impacted ?


 What do you think of deprecating the PlatformLogger constants altogether
 (regardless of their type). The code using them could be migrated to use
 PlatformLogger.Level enum members directly and if 
 PlatformLogger.Level.INFO looks to verbose, static imports can help or
 there could be some more helper methods added to PlatformLogger that would
 minimize the need to access the constants like:

 isInfoLoggable();
 isWarningLoggable();
 ...


It starts to mimic log4j / slf4j / logback  syntax I love but I think it
will change so many files that I think it is a waste of time for now.

I vote for adding such useful shortcuts in the new API but not change other
methods.

Cheers,
Laurent


Re: AWT Dev Swing Dev sun.awt.X11 logs still using String + (waste)

2013-04-04 Thread Laurent Bourgès
Dear all,

I just realized there is another problem with PlatformLogger log statements:
XBaseWindow:
public boolean grabInput() {
grabLog.fine(Grab input on {0}, this);
...
}

This calls the PlatformLogger.fine( varargs):
public void fine(String msg, Object... params) {
logger.doLog(FINE, msg, params);
}

Doing so, the JVM creates a new Object[] instance to provide params as
varargs.

I would recommend using isLoggable() test to avoid such waste if the log is
disabled (fine, finer, finest ...).

Do you want me to provide a new patch related to this problem ?

Does somebody have an idea to automatically analyze the JDK code and detect
missing isLoggable statements ...

Regards,
Laurent

2013/4/3 Laurent Bourgès bourges.laur...@gmail.com

 Anthony,

 could you tell me once it is in the OpenJDK8 repository ?
 I would then perform again profiling tests to check if there is no more
 missing isLoggable() statements.

 Once JMX and other projects switch to PlatformLogger, I could check again.

 Maybe I could write a small java code checker (pmd rule) to test if there
 is missing isLoggable() statements wrapping PlatformLogger log statements.
 Any idea about how to reuse java parser to do so ?

 Regards,

 Laurent

 2013/4/2 Anthony Petrov anthony.pet...@oracle.com

 Looks fine to me as well. Thanks for fixing this, Laurent.

 Let's wait a couple more days in case Net or Swing folks want to review
 the fix. After that I'll push it to the repository.

 --
 best regards,
 Anthony


 On 4/2/2013 15:35, Laurent Bourgès wrote:

 Here is the updated patch:
 http://jmmc.fr/~bourgesl/**share/webrev-8010297.2/http://jmmc.fr/%7Ebourgesl/share/webrev-8010297.2/

 Fixed inconsistencies between FINE / FINER log statements:
 - XScrollbarPeer
 - XWindowPeer

 Laurent

 2013/4/2 Anthony Petrov anthony.pet...@oracle.com mailto:
 anthony.petrov@oracle.**com anthony.pet...@oracle.com


 1. Sergey: I believe this is for purposes of better formating the
 log output and making it more readable by separating or highlighting
 some sections. I don't think this should be changed.

 2. Laurent: can you please address this issue and send us a new
 patch?

 --
 best regards,
 Anthony


 On 4/1/2013 16:08, Sergey Bylokhov wrote:


 Hi, Anthony
 Only two comments:
 1 Why we need some special text in the log output like *** and
 ###
 2 XScrollbarPeer.java:

 +if (log.isLoggable(__**PlatformLogger.FINEST)) {

 +log.finer(KeyEvent on scrollbar:  + event);
 +}



 On 4/1/13 12:20 PM, Anthony Petrov wrote:

 Awt, Swing, Net engineers,

 Could anyone review the fix please? For your convenience:

 Bug: 
 http://bugs.sun.com/view_bug._**_do?bug_id=8010297http://bugs.sun.com/view_bug.__do?bug_id=8010297
 
 http://bugs.sun.com/view_bug.**do?bug_id=8010297http://bugs.sun.com/view_bug.do?bug_id=8010297
 

 Fix:
 http://cr.openjdk.java.net/~__**anthony/8-55-isLoggable-__**
 8010297.0/http://cr.openjdk.java.net/%7E__anthony/8-55-isLoggable-__8010297.0/
 http://cr.openjdk.java.net/%**7Eanthony/8-55-isLoggable-**
 8010297.0/http://cr.openjdk.java.net/%7Eanthony/8-55-isLoggable-8010297.0/
 

 -- best regards,
 Anthony

 On 3/22/2013 2:26, Anthony Petrov wrote:

 Hi Laurent,

 The fix looks great to me. Thank you very much.

 We still need at least one review, though. Hopefully
 net-dev@ and/or swing-dev@ folks might help us out a
 bit.

 -- best regards,
 Anthony

 On 3/20/2013 15:10, Anthony Petrov wrote:

 Hi Laurent,

 Thanks for the patch. I've filed a bug at:
 
 http://bugs.sun.com/view_bug._**_do?bug_id=8010297http://bugs.sun.com/view_bug.__do?bug_id=8010297

 
 http://bugs.sun.com/view_bug.**do?bug_id=8010297http://bugs.sun.com/view_bug.do?bug_id=8010297
 
 (should be available in a day or two)

 and published a webrev generated from your patch at:
 http://cr.openjdk.java.net/~__**
 anthony/8-55-isLoggable-__**8010297.0/http://cr.openjdk.java.net/%7E__anthony/8-55-isLoggable-__8010297.0/
 http://cr.openjdk.java.net/%**
 7Eanthony/8-55-isLoggable-**8010297.0/http://cr.openjdk.java.net/%7Eanthony/8-55-isLoggable-8010297.0/
 


 I'm also copying swing-dev@ and net-dev@ because the
 fix affects those areas too. I myself will review
 the fix a bit later but am sending it now for other
 folks to take a look at it.

 On 3/19/2013 15:29, Laurent Bourgès wrote:

 I am

Re: AWT Dev Swing Dev sun.awt.X11 logs still using String + (waste)

2013-04-04 Thread Laurent Bourgès
Ok, I can provide an updated patch after finding / fixing all usages.

Probably in 1 or 2 days,

Laurent

2013/4/4 Anthony Petrov anthony.pet...@oracle.com

 Yes, this makes sense. Do you want to update your fix for 8010297 to
 include these changes as well?

 --
 best regards,
 Anthony


 On 04/04/13 15:47, Laurent Bourgès wrote:

 Dear all,

 I just realized there is another problem with PlatformLogger log
 statements:
 XBaseWindow:
  public boolean grabInput() {
  grabLog.fine(Grab input on {0}, this);
 ...
 }

 This calls the PlatformLogger.fine( varargs):
  public void fine(String msg, Object... params) {
  logger.doLog(FINE, msg, params);
  }

 Doing so, the JVM creates a new Object[] instance to provide params as
 varargs.

 I would recommend using isLoggable() test to avoid such waste if the log
 is disabled (fine, finer, finest ...).

 Do you want me to provide a new patch related to this problem ?

 Does somebody have an idea to automatically analyze the JDK code and
 detect missing isLoggable statements ...

 Regards,
 Laurent

 2013/4/3 Laurent Bourgès bourges.laur...@gmail.com
 mailto:bourges.laurent@gmail.**com bourges.laur...@gmail.com


 Anthony,

 could you tell me once it is in the OpenJDK8 repository ?
 I would then perform again profiling tests to check if there is no
 more missing isLoggable() statements.

 Once JMX and other projects switch to PlatformLogger, I could check
 again.

 Maybe I could write a small java code checker (pmd rule) to test if
 there is missing isLoggable() statements wrapping PlatformLogger log
 statements. Any idea about how to reuse java parser to do so ?

 Regards,

 Laurent

 2013/4/2 Anthony Petrov anthony.pet...@oracle.com
 mailto:anthony.petrov@oracle.**com anthony.pet...@oracle.com


 Looks fine to me as well. Thanks for fixing this, Laurent.

 Let's wait a couple more days in case Net or Swing folks want to
 review the fix. After that I'll push it to the repository.

 --
 best regards,
 Anthony


 On 4/2/2013 15:35, Laurent Bourgès wrote:

 Here is the updated patch:
 
 http://jmmc.fr/~bourgesl/__**share/webrev-8010297.2/http://jmmc.fr/%7Ebourgesl/__share/webrev-8010297.2/
 
 http://jmmc.fr/%7Ebourgesl/**share/webrev-8010297.2/http://jmmc.fr/%7Ebourgesl/share/webrev-8010297.2/
 


 Fixed inconsistencies between FINE / FINER log statements:
 - XScrollbarPeer
 - XWindowPeer

 Laurent

 2013/4/2 Anthony Petrov anthony.pet...@oracle.com
 mailto:anthony.petrov@oracle.**comanthony.pet...@oracle.com
 
 mailto:anthony.petrov@oracle.**__com

 mailto:anthony.petrov@oracle.**comanthony.pet...@oracle.com
 


  1. Sergey: I believe this is for purposes of better
 formating the
  log output and making it more readable by separating or
 highlighting
  some sections. I don't think this should be changed.

  2. Laurent: can you please address this issue and send
 us a new patch?

  --
  best regards,
  Anthony


  On 4/1/2013 16:08, Sergey Bylokhov wrote:


  Hi, Anthony
  Only two comments:
  1 Why we need some special text in the log output
 like *** and
 ###
  2 XScrollbarPeer.java:

  +if
 (log.isLoggable(**PlatformLogger.FINEST)) {


  +log.finer(KeyEvent on scrollbar:  +
 event);
  +}



  On 4/1/13 12:20 PM, Anthony Petrov wrote:

  Awt, Swing, Net engineers,

  Could anyone review the fix please? For your
 convenience:

  Bug:
 
 http://bugs.sun.com/view_bug._**___do?bug_id=8010297http://bugs.sun.com/view_bug.do?bug_id=8010297
 
 http://bugs.sun.com/view_bug.**__do?bug_id=8010297http://bugs.sun.com/view_bug.__do?bug_id=8010297
 

 
 http://bugs.sun.com/view_bug.**__do?bug_id=8010297http://bugs.sun.com/view_bug.__do?bug_id=8010297
 
 http://bugs.sun.com/view_bug.**do?bug_id=8010297http://bugs.sun.com/view_bug.do?bug_id=8010297
 

  Fix:
 http://cr.openjdk.java.net/~__**
 __anthony/8-55-isLoggable-**8010297.0/http://cr.openjdk.java.net/%7Eanthony/8-55-isLoggable-8010297.0/
 http://cr.openjdk.java.net/%**7E__anthony/8-55-isLoggable-__
 **8010297.0/http://cr.openjdk.java.net/%7E__anthony/8-55-isLoggable-__8010297.0/
 
 http://cr.openjdk.java.net/%_**_7Eanthony/8-55-isLoggable-__
 **8010297.0

Re: [OpenJDK 2D-Dev] sun.java2D.pisces big memory usage (waste ?)

2013-04-03 Thread Laurent Bourgès
Thanks for your valueable feedback!

Here is the current status of my patch alpha version:
 http://jmmc.fr/~bourgesl/share/java2d-pisces/

 There is still a lot to be done: clean-up, stats, pisces class instance
 recycling (renderer, stroker ...) and of course sizing correctly initial
 arrays (dirty or clean) in the RendererContext (thread local storage).
 For performance reasons, I am using now RendererContext members first
 (cache for rowAARLE for example) before using ArrayCaches (dynamic arrays).


 Thank you Laurent, those are some nice speedups.

I think it can still be improved: I hope to make it as fast as ductus or
maybe more (I have several idea for aggressive optimizations) but the main
improvement consist in reusing memory (like C / C++ does) to avoid wasted
memory / GC overhead in concurrent environment.


 About the thread local storage, that is a sensible choice for highly
 concurrent systems, at the same time, web containers normally complain about
 orphaned thread locals created by an application and not cleaned up.
 Not sure if ones created at the core libs level get special treatment, but
 in general, I guess it would be nice to have some way to clean them up.


You're right that's why my patch is not ready !

I chose ThreadLocal for simplicity and clarity but I see several issues:
1/ Web container: ThreadLocal must be clean up when stopping an application
to avoid memory leaks (application becomes unloadable due to classloader
leaks)
2/ ThreadLocal access is the fastest way to get the RendererContext as it
does not require any lock (unsynchronized); As I get the RendererContext
once per rendering request, I think the ThreadLocal can be replaced by a
thread-safe ConcurrentLinkedQueueRendererContext but it may become a
performance bootleneck
3/ Using a ConcurrentLinkedQueueRendererContext requires an efficient /
proper cache eviction to free memory (Weak or Soft references ?) or using
statistics (last usage timestamp, usage counts)

Any other idea (core-libs) to have an efficient thread context in a web
container ?

I'm not familiar with the API, but is there any way to clean them up when
 the graphics2d gets disposed of?


The RenderingEngine is instanciated by the JVM once and I do not see in the
RenderingEngine interface any way to perform callbacks for warmup / cleanup
... nor access to the Graphics RenderingHints (other RFE for tuning
purposes)


 A web application has no guarantee to see the same thread ever again
 during his life, so thread locals have to be cleaned right away.


I advocate ThreadLocal can lead to wasted memory as only few concurrent
threads can really use their RendererContext instance while others can
simply answer web requests = let's use a
ConcurrentLinkedQueueRendererContext with a proper cache eviction.



 Either that, or see if there is any way to store the array caches in a
 global structure backed by a concurrent collection to reduce/eliminate
 contention.


Yes, it is a interesting alternative to benchmark.

Regards,
Laurent


Re: hg: jdk8/tl/jdk: 8010309: Improve PlatformLogger.isLoggable performance by direct mapping from an integer to Level

2013-04-03 Thread Laurent Bourgès
Does JavaFX belong to OpenJDK projects
(*openjfx/8http://hg.openjdk.java.net/openjfx/8/)
*?

How do I build the complete OpenJDK (javafx, java web start ...) ?

Laurent

2013/4/3 Alan Bateman alan.bate...@oracle.com

  On 28/03/2013 20:16, mandy.ch...@oracle.com wrote:

 Changeset: e433ed08b733
 Author:mchung
 Date:  2013-03-28 13:14 -0700
 URL:   http://hg.openjdk.java.net/jdk8/tl/jdk/rev/e433ed08b733

 8010309: Improve PlatformLogger.isLoggable performance by direct mapping from 
 an integer to Level
 Reviewed-by: mchung
 Contributed-by: peter.lev...@gmail.com, bourges.laur...@gmail.com

 ! src/share/classes/sun/util/logging/PlatformLogger.java
 ! test/sun/util/logging/PlatformLoggerTest.java

  It seems that FX doesn't like this good work.

   Caused by: java.lang.NoSuchMethodError: 
 sun.util.logging.PlatformLogger.getLevel()I
   at com.sun.javafx.css.parser.CSSParser.clinit(CSSParser.java:164)
   at 
 com.sun.javafx.css.StyleManager.loadStylesheetUnPrivileged(StyleManager.java:854)
   at com.sun.javafx.css.StyleManager.loadStylesheet(StyleManager.java:674)
   at 
 com.sun.javafx.css.StyleManager.setDefaultUserAgentStylesheet(StyleManager.java:1050)
   at 
 com.sun.javafx.css.StyleManager.setDefaultUserAgentStylesheet(StyleManager.java:1020)
   at com.sun.javafx.application.PlatformImpl$10.run(PlatformImpl.java:525)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 com.sun.javafx.application.PlatformImpl.setPlatformUserAgentStylesheet(PlatformImpl.java:522)
   at 
 com.sun.javafx.application.PlatformImpl.setDefaultPlatformUserAgentStylesheet(PlatformImpl.java:474)
   at javafx.scene.control.Control.clinit(Control.java:82)
   at helloworld.HelloWorld.start(HelloWorld.java:14)
   at com.sun.javafx.application.LauncherImpl$5.run(LauncherImpl.java:772)
   at com.sun.javafx.application.PlatformImpl$6.run(PlatformImpl.java:260)
   at 
 com.sun.javafx.application.PlatformImpl$5$1.run(PlatformImpl.java:223)
   at 
 com.sun.javafx.application.PlatformImpl$5$1.run(PlatformImpl.java:220)
   at java.security.AccessController.doPrivileged(Native Method)
   at com.sun.javafx.application.PlatformImpl$5.run(PlatformImpl.java:220)
   at 
 com.sun.glass.ui.InvokeLaterDispatcher$Future.run(InvokeLaterDispatcher.java:94)
   at com.sun.glass.ui.win.WinApplication._runLoop(Native Method)
   at 
 com.sun.glass.ui.win.WinApplication.access$300(WinApplication.java:39)
   at com.sun.glass.ui.win.WinApplication$3$1.run(WinApplication.java:101)
   ... 1 more






Re: RFR JDK-7143928 : (coll) Optimize for Empty ArrayList and HashMap

2013-04-02 Thread Laurent Bourgès
 ---

  604 Arrays.fill(elementData, newSize, size, null);

 In performance-critical code I would avoid Arrays.fill because it adds a
 bit of overhead (unless it's intrinsified, which I don't think it is).


Last week, I sent few benchmarks I did on array cleaning (zero fill)
comparing Arrays.fill, System.arraycopy, Unsafe.setMemory ...
Arrays.fill is the winner (always faster than arraycopy which use native
code) by 10 - 20% !
I suspect aggressive hotspot optimizations (native code ?) because I agree
Arrays.fill looks like a stupid for-loop !

Does somebody have clues explaining the Arrays.fill performance ?

Is there an alternative way to clear an array giving the best possible
performance (like c memset) ?
it could be useful to have in JDK a native array fill implementation
(System.fill ?) if it gives better performance.

Laurent


Re: [OpenJDK 2D-Dev] sun.java2D.pisces big memory usage (waste ?)

2013-03-30 Thread Laurent Bourgès
Jim,

There are finally only few growable arrays (edges, curves, rowAARLE) and
now I have a working Pisces code (J2DBench pass OK) that performs better
than current (2x - 3x faster on dasher or big shapes) using only few
megabytes (Xmx32m) ...

Moreover, these arrays could be created once per thread (thread local) to
avoid GC (low footprint) and enhance performance (no GC / array resizing or
coyping) with an initial size of 4K or 16K representing only 16 x 4 = 65Kb !

The only array that causes troubles is the growable
PiscesCache.rowAARLE[Y][tuples] which Y size depends on the shape / clip
bounds and tuples depend on the crossing numbers at the current row Y.

For now I am using int[2048][20] but it could be tuned ...

Finally, I figured out several hard-coded values relative to AA (3x3
samples, error level ...) that could be set as rendering hints. For example
I would like to tune the number of samples to 8x8 or 2x2 ... depending on
the use case.

I expect to send an alpha version of my patch to illustrate my talks.

2013/3/30 Jim Graham james.gra...@oracle.com

 Other thoughts - using chained buckets of edges instead of one single long
 list.  It would be easier to keep a pool of buckets (each holding, say, 256
 edges?) than a one-size-fits-all pool of arrays.  Then all you have to do
 is keep high water marks on the number of simultaneously used buckets in
 order to tune the cache for a given application.

 It would make the code that manages pointers to edges a little more
 complicated, though...


Good idea, but maybe difficult for me to implement. As I said, this array
may be less a problem than rowAARLE.

Of course, I need some real life tests to be able to perform better
diagnostics: maybe I could add some instrumentations / statistics to
gather while running in order to tune Pisces automatically depending on the
user work load.

Regards,
Laurent

On 3/29/2013 6:53 AM, Laurent Bourgès wrote:

 Phil,

 I agree it is a complex issue to improve memory usage while maintaining
 performance at the JDK level: applications can use java2d pisces in very
 different contexts: Swing app (client with only EDT thread), server-side
 application (multi thread headless) ...

 For the moment, I spent a lot of my time understanding the different
 classes in java2d.pisces and analyzing memory usage / performance ... using
 J2DBench (all graphics tests).

 In my Swing application, pisces produces a lot of waste (GC) but on server
 side, the GC overhead can be more important if several threads use pisces.

 Pisces uses memory differently:
 - fixed arrays (dasher, stroker)
 - dynamic arrays (edges ...) rowAARLE (very big one for big shapes)

 For the moment I am trying to avoid memory waste (pooling or kept
 reference) without any memory constraint (no eviction) but I agree it is an
 important aspect for server-side applications.

 To avoid concurrency issues, I use a ThreadLocal context named
 RendererContext to keep few temporary arrays (float6 and a BIG rowAARLE
 instance) but there is also dynamic IntArrayCache et FloatArrayCache which
 have several pools divided in buckets (256, 1024, 4096, 16384, 32768)
 containing only few instances.

 To have best performance, I studied pisces code to clear only the used
 array parts when recycling or using dirty arrays (only clear
 rowAARLE[...][1]).

 I think Andrea's proposal is interesting to maybe put some system
 properties to give hints (low memory footprint, use cache or not ...).

 2013/3/28 Phil Race philip.r...@oracle.com

  Maintaining a pool of objects might be an appropriate thing for an
 applications,
 but its a lot trickier for the platform as the application's usage pattern
 or intent
 is largely unknown. Weak references or soft references might be of use but
 weak references usually go away even at the next incremental GC and soft
 references tend to not go away at all until you run out of heap.


 Agreed; for the moment, pool eviction policy is not implemented but kept in
 mind.
 FYI: each RendererContext (per thread) has its own array pools (not shared)
 that could have different caching policies:
 For instance, AWT / EDT (repaint) could use a large cache although other
 threads do not use array caching at all.


  You may well be right that always doubling the array size may be too
 simplistic,
 but it would need some analysis of the code and its usage to see how much
 better we can do.



 There is two part:
 - initial array size for dynamic arrays: difficult to estimate but for now
 set to very low capacity (8 / 50 ...) to avoid memory waste for rectangle /
 line shapes. In my patch, I have defined MIN_ARRAY_SIZE = 128 (array pool)
 to avoid too much resizing as I am doing array recycling.
 - grow: I use x4 instead of x2 to avoid array copies.

 Laurent



 2013/3/28 Phil Race philip.r...@oracle.com

  Maintaining a pool of objects might be an appropriate thing for an
 applications,
 but its a lot trickier for the platform as the application's usage

Re: [OpenJDK 2D-Dev] sun.java2D.pisces big memory usage (waste ?)

2013-03-30 Thread Laurent Bourgès
Andrea,

thanks for these estimates / stats.

I think such use case could benefit a lot from my patch (low memory
footprint) but thread-safe cached / reused arrays/ LineIteratorData /
StrokerData / DasherData.

As I said to Jim, there are two sort of problems:
- memory handling (growable arrays) has to be smart (auto - tune)
- clipping issues (dasher, stroker) and spatial resolution (no cpu/memory
waste)


2013/3/30 Andrea Aime andrea.a...@geo-solutions.it

 On Fri, Mar 29, 2013 at 3:16 PM, Laurent Bourgès 
 bourges.laur...@gmail.com wrote:

 Andrea,

 It could be very interesting if you could provide some concrete example
 about your map server: typical image height / width, shape complexity
 (number, size, path sizes ...).


 It's all over the place. Image size: from 256x256 to 1x1, number
 of shapes, from a few hundreds to a few millions, path sizes, from say 10
 points
 to thousands (the code already makes sure there are no two consequent
 points sitting in the same pixel before trying to draw them),
 stroke wise it can be simple or using dasharrays.



 Ideally if you could write a test class (sample) computing a single image
 it would be very helpful for me to compare my patched pisces vs current one.


 I don't have anything ready, the existing code loads data from spatial
 database, sets up styles from XML descriptions, turns each spatial db
 geomety in java shapes
 and so on. But I guess I could concoct something close enough.
 I'm writing some code to gather statistics about the shapes that are
 actually painted (scrolling over the path iterators) and then I'll try to
 make a randomized
 shape painter that gets close to those stats.


Great news ! I am waiting your test code to perform few benchmarks.

I am using J2DBench but it does not perform such complex drawing (complex
shapes ...) or multi threading tests (parallel drawings on buffered images)

Laurent


Re: [OpenJDK 2D-Dev] sun.java2D.pisces big memory usage (waste ?)

2013-03-29 Thread Laurent Bourgès
Phil,

I agree it is a complex issue to improve memory usage while maintaining
performance at the JDK level: applications can use java2d pisces in very
different contexts: Swing app (client with only EDT thread), server-side
application (multi thread headless) ...

For the moment, I spent a lot of my time understanding the different
classes in java2d.pisces and analyzing memory usage / performance ... using
J2DBench (all graphics tests).

In my Swing application, pisces produces a lot of waste (GC) but on server
side, the GC overhead can be more important if several threads use pisces.

Pisces uses memory differently:
- fixed arrays (dasher, stroker)
- dynamic arrays (edges ...) rowAARLE (very big one for big shapes)

For the moment I am trying to avoid memory waste (pooling or kept
reference) without any memory constraint (no eviction) but I agree it is an
important aspect for server-side applications.

To avoid concurrency issues, I use a ThreadLocal context named
RendererContext to keep few temporary arrays (float6 and a BIG rowAARLE
instance) but there is also dynamic IntArrayCache et FloatArrayCache which
have several pools divided in buckets (256, 1024, 4096, 16384, 32768)
containing only few instances.

To have best performance, I studied pisces code to clear only the used
array parts when recycling or using dirty arrays (only clear
rowAARLE[...][1]).

I think Andrea's proposal is interesting to maybe put some system
properties to give hints (low memory footprint, use cache or not ...).

2013/3/28 Phil Race philip.r...@oracle.com

 Maintaining a pool of objects might be an appropriate thing for an
 applications,
 but its a lot trickier for the platform as the application's usage pattern
 or intent
 is largely unknown. Weak references or soft references might be of use but
 weak references usually go away even at the next incremental GC and soft
 references tend to not go away at all until you run out of heap.


Agreed; for the moment, pool eviction policy is not implemented but kept in
mind.
FYI: each RendererContext (per thread) has its own array pools (not shared)
that could have different caching policies:
For instance, AWT / EDT (repaint) could use a large cache although other
threads do not use array caching at all.


 You may well be right that always doubling the array size may be too
 simplistic,
 but it would need some analysis of the code and its usage to see how much
 better we can do.


There is two part:
- initial array size for dynamic arrays: difficult to estimate but for now
set to very low capacity (8 / 50 ...) to avoid memory waste for rectangle /
line shapes. In my patch, I have defined MIN_ARRAY_SIZE = 128 (array pool)
to avoid too much resizing as I am doing array recycling.
- grow: I use x4 instead of x2 to avoid array copies.

Laurent



2013/3/28 Phil Race philip.r...@oracle.com

 Maintaining a pool of objects might be an appropriate thing for an
 applications,
 but its a lot trickier for the platform as the application's usage pattern
 or intent
 is largely unknown. Weak references or soft references might be of use but
 weak references usually go away even at the next incremental GC and soft
 references tend to not go away at all until you run out of heap.

 You may well be right that always doubling the array size may be too
 simplistic,
 but it would need some analysis of the code and its usage to see how much
 better we can do.


 Apparently, Arrays.fill is always faster (size in 10 ... 10 000) !
  I suspect hotspot to optimize its code and use native functions, isn't
 it ???

 I suppose there is some hotspot magic involved to recognise and intrinsify
 this
 method, since the source code looks like a plain old for loop.

 -phil.



 On 3/26/2013 4:00 AM, Laurent Bourgès wrote:

 Dear all,

 First I joined recently the openJDK contributors, and I plan to fix
 java2D pisces code in my spare time.

 I have a full time job on Aspro2: http://www.jmmc.fr/aspro; it is an
 application to prepare astronomical observations at VLTI / CHARA and is
 very used in our community (200 users): it provides scientific computations
 (observability, model images using complex numbers ...) and zoomable plots
 thanks to jFreeChart.

 Aspro2 is known to be very efficient (computation parallelization) and I
 am often doing profiling using netbeans profiler or visualVM.

 To fix huge memory usages by java2d.pisces, I started implementing an
 efficient ArrayCache (int[] and float[]) (in thread local to concurrency
 problems):
 - arrays in sizes between 10 and 1 (more small arrays used than big
 ones)
 - resizing support (Arrays.copyOf) without wasting arrays
 - reentrance i.e. many arrays are used at the same time (java2D Pisces
 stroke / dash creates many segments to render)
 - GC / Heap friendly ie support cache eviction and avoid consuming too
 much memory

 I know object pooling is known to be not efficient with recent VM (GC is
 better) but I think it is counter productive to create so

  1   2   >