Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v12]

2022-05-25 Thread Laurent Bourgès
On Mon, 14 Mar 2022 19:12:29 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   * Improved mixed insertion sort
>   * Optimized insertion sort
>   * Improved merging sort
>   * Optimized soring tests

Well explained, vladimir.

I propose to adopt max_heap / 128 min to stay less than 1% footprint.

Then factors are 7 for ints, 8 for long, min.

For 1gb heap, it gives 8mb, so 2m ints or 1m long.

Laurent

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v12]

2022-05-15 Thread Laurent Bourgès
On Mon, 14 Mar 2022 19:12:29 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   * Improved mixed insertion sort
>   * Optimized insertion sort
>   * Improved merging sort
>   * Optimized soring tests

I checked the latest code: line 128:

Max length of additional buffer, limited by max_heap / 64 or 256mb elements 
(2gb max).

private static final int MAX_BUFFER_LENGTH =
(int) Math.min(Runtime.getRuntime().maxMemory() >> 6, 256L << 20);

So the current upper limit concerns the max length = max_heap_bytes / 64 (max 
heap/8) or 2gb (if heap is huge).

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v12]

2022-05-15 Thread Laurent Bourgès
On Fri, 13 May 2022 17:48:50 GMT, Piotr Tarsa  wrote:

> allocating extra buffers and catching OOME when sorting primitives is rather 
> unsatisfactory. you're not giving a reliable option for sorting under low 
> memory conditions. IMO at least the single-threaded primitives (ints, longs, 
> floats, etc all non-objects) sorting should be frugal when it comes to memory 
> usage.
> 
> just my 2 cents.

DPQS uses several sorting algorithms, merge sort & new radix sort need an extra 
buffer in contrary to isort, mixed isort, single and dual pivot quick sort.
In this PR an upper limit on the heap usage is in use min(max heap / 16, 128mb) 
to reduce the memory footprint. 
Anyway catching OOME now permits DPQS to use in-place but slower algorithms if 
the extra buffer can not be allocated.
Possibly the upper limit could be made configurable using system properties if 
it is really critical.
To sum up, low allocations are now under control in this PR.

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v12]

2022-05-06 Thread Laurent Bourgès
On Mon, 14 Mar 2022 19:12:29 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   * Improved mixed insertion sort
>   * Optimized insertion sort
>   * Improved merging sort
>   * Optimized soring tests

I am improving test code (jmh + robust estimators) to have more robust results 
on small arrays (20 - 1000).
I do hope having new results within 1 or 2 weeks.

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v10]

2022-01-12 Thread Laurent Bourgès
On Mon, 29 Nov 2021 21:16:32 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   * Updated javadoc
>   * Optimized insertion sort threshold
>   * Refactored parallel sorting section
>   * Improved step for pivot candidates
>   * Changed condition for Radix sort

Please review this PR, it is ready for months, or give comments, please !

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v10]

2021-12-12 Thread Laurent Bourgès
On Mon, 29 Nov 2021 21:16:32 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   * Updated javadoc
>   * Optimized insertion sort threshold
>   * Refactored parallel sorting section
>   * Improved step for pivot candidates
>   * Changed condition for Radix sort

Vladimir,
I updated the benchmarking code and here are the complete test results: system 
vs dpqs 21.5 vs 21.11:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/sort-bench/results/2021/2021-12-final/DPQS-sort-1k-1M-stats.out

Full results:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/sort-bench/results/2021/2021-12-final/sort-1k-1M-stats.out

All good, on 1k, 10k, 100k, 1m integer arrays.
Congratulations
🥳

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: 8265891: (ch) InputStream returned by Channels.newInputStream should override transferTo [v13]

2021-12-06 Thread Laurent Bourgès
On Mon, 6 Dec 2021 07:07:03 GMT, Markus KARG  wrote:

>> Markus KARG has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Draft: Eliminated duplicate code using lambda expressions
>>  - Draft: Use blocking mode also for target channel
>
> Please keep this PR open. More commits will follow on the next weeks.

Looks promising, keep moving, @mkarg !

-

PR: https://git.openjdk.java.net/jdk/pull/4263


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v9]

2021-11-28 Thread Laurent Bourgès
On Mon, 15 Nov 2021 16:20:07 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   Updated comments for partitioning

As jdk18 Ramp Down 0 is approaching and jdk19 is being opened soon, maybe this 
PR review should be planned for jdk19 build 1 ?

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v9]

2021-11-17 Thread Laurent Bourgès
On Mon, 15 Nov 2021 16:20:07 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   Updated comments for partitioning

Please core-libs reviewers could you have a look before jdk18 rdp0 or not ?

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v9]

2021-11-15 Thread Laurent Bourgès
On Mon, 15 Nov 2021 16:20:07 GMT, iaroslavski  wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   Updated comments for partitioning

As I am not an official openjdk reviewer, I can not approve, but I do support

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8277095 : Empty streams create too many objects

2021-11-15 Thread Laurent Bourgès
On Fri, 5 Nov 2021 12:53:46 GMT, kabutz  wrote:

> This is a draft proposal for how we could improve stream performance for the 
> case where the streams are empty. Empty collections are common-place. If we 
> iterate over them with an Iterator, we would have to create one small 
> Iterator object (which could often be eliminated) and if it is empty we are 
> done. However, with Streams we first have to build up the entire pipeline, 
> until we realize that there is no work to do. With this example, we change 
> Collection#stream() to first check if the collection is empty, and if it is, 
> we simply return an EmptyStream. We also have EmptyIntStream, EmptyLongStream 
> and EmptyDoubleStream. We have taken great care for these to have the same 
> characteristics and behaviour as the streams returned by Stream.empty(), 
> IntStream.empty(), etc. 
> 
> Some of the JDK tests fail with this, due to ClassCastExceptions (our 
> EmptyStream is not an AbstractPipeline) and AssertionError, since we can call 
> some methods repeatedly on the stream without it failing. On the plus side, 
> creating a complex stream on an empty stream gives us upwards of 50x increase 
> in performance due to a much smaller object allocation rate. This PR includes 
> the code for the change, unit tests and also a JMH benchmark to demonstrate 
> the improvement.

Looks a good idea to reduce the pipeline overhead !

-

PR: https://git.openjdk.java.net/jdk/pull/6275


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v6]

2021-10-08 Thread Laurent Bourgès
On Wed, 6 Oct 2021 21:21:29 GMT, iaroslavski 
 wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   Added more comments

It is possible to use an upper limit on the array size to disable radix sort 
and reduce memory footprint and gc overhead...

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) [v6]

2021-10-08 Thread Laurent Bourgès
On Wed, 6 Oct 2021 21:21:29 GMT, iaroslavski 
 wrote:

>> Sorting:
>> 
>> - adopt radix sort for sequential and parallel sorts on 
>> int/long/float/double arrays (almost random and length > 6K)
>> - fix tryMergeRuns() to better handle case when the last run is a single 
>> element
>> - minor javadoc and comment changes
>> 
>> Testing:
>> - add new data inputs in tests for sorting
>> - add min/max/infinity values to float/double testing
>> - add tests for radix sort
>
> iaroslavski has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)
>   
>   Added more comments

I want to give my point of view on the new memory requirements:
- as radix sort is used for length > 4k, more often a temporary buffer will be 
needed (size = array length), so more often young gen GC passes will happen and 
maybe full gc if this buffer is really big (half heap?)
- if OOME is encountered when the buffer is allocated, then dpqs is used 
(in-place no-alloc) so new sorting is more robust than dpqs in jdk17

Sorting tests could be made with verbose gc to ascertain the increased gc work, 
latency...

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Fri, 14 May 2021 17:50:11 GMT, Piotr Tarsa 
 wrote:

>> I made a JMH test on jdk16 to test count4 (xor) performance:
>> https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/sort-bench/results/count_xor
>> 
>> 
>> Benchmark (size) Mode Cnt Score Error Units
>> ArrayXorBenchmark.arrayAndOriginal 100 avgt 20 684561,951 ± 2177,120 
>> ns/op
>> ArrayXorBenchmark.arrayXorOriginal 100 avgt 20 777255,465 ± 1815,136 
>> ns/op
>> ArrayXorBenchmark.arrayXor_Masked 100 avgt 20 814163,377 ± 2665,436 ns/op
>> ArrayXorBenchmark.arrayXor_Unsafe 100 avgt 20 707273,922 ± 2994,380 ns/op
>> 
>> Masked xor does not get optimized by c2 too.
>> 
>> Using Unsafe is better, see:
>> https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/sort-bench/src/main/java/edu/sorting/bench/ArrayXorBenchmark.java
>> 
>> If you want, I could make another radixsort() using on or off heap count 
>> buffers and Unsafe, as I did in Marlin to avoid bound checks...
>
> I think an uncontroversial and efficient solution would be to replace
> 
> count4[(a[i] >>> 24) ^ 0x80]--;
> 
> with
> 
> count4[(a[i] >>> 24) & 0xFF]--;
> 
> and after finishing counting, first half of `count4` should be swapped with 
> second half of `count4`.

FYI I made another radixsortNew() using Unsafe to access to all count arrays:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/dbfbd731ffd798defc75528cc1db04063bdb4619/src/edu/sorting/DualPivotQuicksort202105.java#L795

But it is only 2% faster in radix-sort-benchmark (1M arrays):
https://github.com/bourgesl/radix-sort-benchmark/blob/main/results/2021-ref/cmp-16-1M-full.log

It looks not worth the extra complexity for few percents, I think.

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Thu, 20 May 2021 21:21:26 GMT, Nils Eliasson  wrote:

>> A small update of the XorINodes type calculation makes the bound check go 
>> away. Testing now.
>
> That bounds check shouldn't be there, it's an obvious case and will be fixed: 
> https://github.com/openjdk/jdk/pull/4136

Thanks for fixing hotspot C2 (xor) so quickly !

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Fri, 14 May 2021 07:41:19 GMT, Richard Startin 
 wrote:

>> Great analysis on C2, richard.
>> 
>> maybe (x ^ 0x80) &0xFF would help C2 to eliminate bound checks...
>
> I don't know Laurent, I find the handling of signed order over-complicated. 
> Subtracting `Integer.MIN_VALUE` is really cheap...

I made a JMH test on jdk16 to test count4 (xor) performance:
https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/sort-bench/results/count_xor


Benchmark (size) Mode Cnt Score Error Units
ArrayXorBenchmark.arrayAndOriginal 100 avgt 20 684561,951 ± 2177,120 ns/op
ArrayXorBenchmark.arrayXorOriginal 100 avgt 20 777255,465 ± 1815,136 ns/op
ArrayXorBenchmark.arrayXor_Masked 100 avgt 20 814163,377 ± 2665,436 ns/op
ArrayXorBenchmark.arrayXor_Unsafe 100 avgt 20 707273,922 ± 2994,380 ns/op

Masked xor does not get optimized by c2 too.

Using Unsafe is better, see:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/sort-bench/src/main/java/edu/sorting/bench/ArrayXorBenchmark.java

If you want, I could make another radixsort() using on or off heap count 
buffers and Unsafe, as I did in Marlin to avoid bound checks...

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Thu, 13 May 2021 09:53:37 GMT, Richard Startin 
 wrote:

>> Laurent, the date in this class is not the date of our last commit,
>> this date is the date when I have final idea regarding to Radix sort,
>> therefore, I prefer to keep 2020.06.14
>
> Hi @iaroslavski I'm unconvinced that this work was from 14/06/2020 - I 
> believe this work derives from an unsigned radix sort I implemented on 
> 10/04/2021 
> https://github.com/richardstartin/radix-sort-benchmark/commit/ab4da230e1d0ac68e5ee2cee38d71c7e7d50f49b#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR226
>  which has numerous structural similarities to this work:
> * Producing all four histograms in one pass
> * Skipping passes based on detecting the total in the histogram
> * Bailing out of the skip detection if a nonzero value not equal to the total 
> is encountered
> * Manually unrolling the LSD radix sort loop in order to avoid array copies
> 
> My implementation from 10th April is below for reference:
> 
>   public static void unrollOnePassHistogramsSkipLevels(int[] data) {
> int[] histogram1 = new int[257];
> int[] histogram2 = new int[257];
> int[] histogram3 = new int[257];
> int[] histogram4 = new int[257];
> 
> for (int value : data) {
>   ++histogram1[(value & 0xFF) + 1];
>   ++histogram2[((value >>> 8) & 0xFF) + 1];
>   ++histogram3[((value >>> 16) & 0xFF) + 1];
>   ++histogram4[(value >>> 24) + 1];
> }
> boolean skipLevel1 = canSkipLevel(histogram1, data.length);
> boolean skipLevel2 = canSkipLevel(histogram2, data.length);
> boolean skipLevel3 = canSkipLevel(histogram3, data.length);
> boolean skipLevel4 = canSkipLevel(histogram4, data.length);
> 
> if (skipLevel1 && skipLevel2 && skipLevel3 && skipLevel4) {
>   return;
> }
> int[] copy = new int[data.length];
> 
> int[] source = data;
> int[] dest = copy;
> 
> if (!skipLevel1) {
>   for (int i = 1; i < histogram1.length; ++i) {
> histogram1[i] += histogram1[i - 1];
>   }
>   for (int value : source) {
> dest[histogram1[value & 0xFF]++] = value;
>   }
>   if (!skipLevel2 || !skipLevel3 || !skipLevel4) {
> int[] tmp = dest;
> dest = source;
> source = tmp;
>   }
> }
> 
> if (!skipLevel2) {
>   for (int i = 1; i < histogram2.length; ++i) {
> histogram2[i] += histogram2[i - 1];
>   }
>   for (int value : source) {
> dest[histogram2[(value >>> 8) & 0xFF]++] = value;
>   }
>   if (!skipLevel3 || !skipLevel4) {
> int[] tmp = dest;
> dest = source;
> source = tmp;
>   }
> }
> 
> if (!skipLevel3) {
>   for (int i = 1; i < histogram3.length; ++i) {
> histogram3[i] += histogram3[i - 1];
>   }
>   for (int value : data) {
> dest[histogram3[(value >>> 16) & 0xFF]++] = value;
>   }
>   if (!skipLevel4) {
> int[] tmp = dest;
> dest = source;
> source = tmp;
>   }
> }
> 
> if (!skipLevel4) {
>   for (int i = 1; i < histogram4.length; ++i) {
> histogram4[i] += histogram4[i - 1];
>   }
>   for (int value : source) {
> dest[histogram4[value >>> 24]++] = value;
>   }
> }
> if (dest != data) {
>   System.arraycopy(dest, 0, data, 0, data.length);
> }
>   }
> 
>   private static boolean canSkipLevel(int[] histogram, int dataSize) {
> for (int count : histogram) {
>   if (count == dataSize) {
> return true;
>   } else if (count > 0) {
> return false;
>   }
> }
> return true;
>   }
> 
> 
> Moreover, @bourgesl forked my repository on 11/04/2021 and communicated with 
> me about doing so. On 25/04/2021 there was a new implementation of 
> `DualPivotQuicksort` with a signed radix sort but the same structural 
> similarities, and with the same method and variable names in places 
> https://github.com/bourgesl/radix-sort-benchmark/commit/90ff7e427da0fa49f374bff0241fb2487bd87bde#diff-397ce8fd791e2ce508cf9127201bc9ab46264cd2a79fd0487a63569f2e4b59b2R607-R609
> 
> 
> // TODO add javadoc
> private static void radixSort(Sorter sorter, int[] a, int low, int high) {
> int[] b;
> // LBO: prealloc (high - low) +1 element:
> if (sorter == null || (b = sorter.b) == null || b.length < (high - 
> low)) {
> // System.out.println("alloc b: " + (high - low));
> b = new int[high - low];
> }
> 
> int[] count1, count2, count3, count4;
> if (sorter != null) {
> sorter.resetRadixBuffers();
> count1 = sorter.count1;
> count2 = sorter.count2;
> count3 = sorter.count3;
> count4 = sorter.count4;
> } else {
> // System.out.println("alloc radix buffers(4x256)");
> count1 = new int[256];
> count2 = new int[256];
> count3 = new int[256];
>  

Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Thu, 13 May 2021 21:44:38 GMT, Richard Startin 
 wrote:

>> For what it's worth, I benchmarked this implementation radix sort ([adapted 
>> here to fit in to my 
>> harness](https://github.com/richardstartin/radix-sort-benchmark/commit/07169e8e8602152cfda859baa159db165bf5fcab#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR681-R710782))
>>  against a [signed 
>> variant](https://github.com/richardstartin/radix-sort-benchmark/commit/07169e8e8602152cfda859baa159db165bf5fcab#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR396-R478)
>>  of what I have claimed this work was derived from and the proposed 
>> implementation does not perform favourably on uniformly random data:
>> 
>> 
>> 
>> Benchmark (bits)  (padding)  
>> (scenario)  (seed)   (size)  Mode  Cnt  Score Error  Units
>> RadixSortBenchmark.jdk17  7 
>> UNIFORM   0  100  avgt5  11301.950 ± 113.691  us/op
>> RadixSortBenchmark.jdk23  7 
>> UNIFORM   0  100  avgt5  11792.351 ±  60.757  us/op
>> RadixSortBenchmark.jdk30  7 
>> UNIFORM   0  100  avgt5  11184.616 ±  67.094  us/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned  17  7 
>> UNIFORM   0  100  avgt5   9564.626 ±  69.497  us/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned  23  7 
>> UNIFORM   0  100  avgt5   9432.085 ±  58.983  us/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned  30  7 
>> UNIFORM   0  100  avgt5  10772.975 ±  51.848  us/op
>> 
>> 
>> 
>> I believe the root cause is a defect in the mechanism employed to skip 
>> passes as can be seen by the increased number of instructions and cycles 
>> here. In the proposed implementation, instructions is roughly constant as a 
>> function of bits. In the case where all passes must be performed (bits = 
>> 30), IPC is superior in `unrollOnePassHistogramsSkipLevelsSigned`.
>> 
>> 
>> Benchmark   
>> (bits)  (padding)  (scenario)  (seed)   (size)  Mode  Cnt Score 
>> Error  Units
>> RadixSortBenchmark.jdk:cycles   
>> 17  7 UNIFORM   0  100  avgt   34976971.877  
>>#/op
>> RadixSortBenchmark.jdk:instructions 
>> 17  7 UNIFORM   0  100  avgt   70121142.003  
>>#/op
>> RadixSortBenchmark.jdk:cycles   
>> 23  7 UNIFORM   0  100  avgt   32369970.385  
>>#/op
>> RadixSortBenchmark.jdk:instructions 
>> 23  7 UNIFORM   0  100  avgt   70201664.963  
>>#/op
>> RadixSortBenchmark.jdk:cycles   
>> 30  7 UNIFORM   0  100  avgt   30789736.602  
>>#/op
>> RadixSortBenchmark.jdk:instructions 
>> 30  7 UNIFORM   0  100  avgt   70180942.122  
>>#/op
>> RadixSortBenchmark.jdk:IPC  
>> 30  7 UNIFORM   0  100  avgt  2.279  
>>   insns/clk
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:cycles 
>> 17  7 UNIFORM   0  100  avgt   26983994.479  
>>#/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:instructions   
>> 17  7 UNIFORM   0  100  avgt   62065304.827  
>>#/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:cycles 
>> 23  7 UNIFORM   0  100  avgt   26161703.124  
>>#/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:instructions   
>> 23  7 UNIFORM   0  100  avgt   63102683.167  
>>#/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:cycles 
>> 30  7 UNIFORM   0  100  avgt   29780103.795  
>>#/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:instructions   
>> 30  7 UNIFORM   0  100  avgt   74437097.025  
>>#/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:IPC
>> 30  7 UNIFORM   0  100  avgt  2.500  
>>   insns/clk
>> 
>> 
>> 
>> The biggest difference in executed code appears to be that bounds checks are 
>> not eliminated in the first counting pass in the proposed implementation:
>> 
>> 
>> [Hottest Region 
>> 1]..

Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Wed, 12 May 2021 12:20:09 GMT, iaroslavski 
 wrote:

>> src/java.base/share/classes/java/util/DualPivotQuicksort.java line 47:
>> 
>>> 45:  * @author Doug Lea
>>> 46:  *
>>> 47:  * @version 2020.06.14
>> 
>> Vladimir, I would update to 2021.05.06 (+your hash)
>
> Laurent, the date in this class is not the date of our last commit,
> this date is the date when I have final idea regarding to Radix sort,
> therefore, I prefer to keep 2020.06.14

as you want, but you should maybe indicate which version corresponds to your 
master code too.

It would help tracking changes between forks (iarosalvskiy/sorting master)

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Laurent Bourgès
On Sat, 8 May 2021 20:54:48 GMT, iaroslavski 
 wrote:

> Sorting:
> 
> - adopt radix sort for sequential and parallel sorts on int/long/float/double 
> arrays (almost random and length > 6K)
> - fix tryMergeRuns() to better handle case when the last run is a single 
> element
> - minor javadoc and comment changes
> 
> Testing:
> - add new data inputs in tests for sorting
> - add min/max/infinity values to float/double testing
> - add tests for radix sort

Congratulation,
what an amazing job !

DPQS is now 50% faster in average but 2.5 times faster (rms) my favorite !!

Finally OOME is now handled to let sort work under low-mem conditions !

I will work on more benchmarks for long/float/double types.

Laurent

Still waiting for individual OCA approval

src/java.base/share/classes/java/util/DualPivotQuicksort.java line 47:

> 45:  * @author Doug Lea
> 46:  *
> 47:  * @version 2020.06.14

Vladimir, I would update to 2021.05.06 (+your hash)

src/java.base/share/classes/java/util/DualPivotQuicksort.java line 288:

> 286: /*
> 287:  * Invoke radix sort on large array.
> 288:  */

I prefer testing (sorter == null) first as it is always true for sequential 
sort and avoid testing bits > ... in this case

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: How to know if cpu supports unaligned memory accesses ?

2021-04-19 Thread Laurent Bourgès
Thanks, Alan !

Probably I should adopt VarHandle and use their int / long views... (for
jdk11u).

Or adapt my arrays/off-heap buffers (struct like) using primitive objects
JEP 401 for jdk17 !

It looks very promising as the Marlin renderer uses arrays of structs,
packed in arrays or off-heap (edges, rle tile cache).

Cheers,
Laurent

Le dim. 18 avr. 2021 à 09:11, Alan Bateman  a
écrit :

>
> On 17/04/2021 19:02, Laurent Bourgès wrote:
> > Hi,
> > In the Marlin renderer & in other Unsafe use cases (pixel tile
> processing)
> > for java2d pipelines, I do use putInt()/putLong() primitives even if the
> > address is not aligned (to 4 or 8 bytes) to get faster processing of byte
> > buffer or arrays...
> >
> > Is it possible in java (jdk internals) to query cpu capabilities like
> > endianness, unaligned support ?
> >
> Since Marlin is in the JDK and using Unsafe already then can't it just
> using the isBigEndian and unalignedAccess methods? The Buffer
> implementation and a few places do the same.
>
> -Alan.
>


How to know if cpu supports unaligned memory accesses ?

2021-04-17 Thread Laurent Bourgès
Hi,
In the Marlin renderer & in other Unsafe use cases (pixel tile processing)
for java2d pipelines, I do use putInt()/putLong() primitives even if the
address is not aligned (to 4 or 8 bytes) to get faster processing of byte
buffer or arrays...

Is it possible in java (jdk internals) to query cpu capabilities like
endianness, unaligned support ?

For x86-64: unaligned ~ aligned, so it is not necessary to add padding in
the data layout and then it is better to have smaller buffers.

I would like write code like:
if (cpu_supports_unaligned) {
...
Use int/long to pack bytes...
} else {
Use byte (fallback)
}


Laurent


Re: 8252827: Caching Integer.toString just like Integer.valueOf

2021-04-17 Thread Laurent Bourgès
Hi,

I read the JBS bug and I interpret it as:
- IntegerCache provides Integer instances for [-128, 127] by default
- Having Integer.toString(int) could behave the same or at least cache most
probable values like [-1 to 16] or using the IntegerCache range.

It looks trivial and potentially could benefits to jdk itself to reduce
memory garbage : is Integer.toString(int) widely used in the jdk codebase ?

Finally it can be alternatively implemented in application's code.

My 2 cents,
Laurent

Le sam. 17 avr. 2021 à 11:06, Raffaello Giulietti <
raffaello.giulie...@gmail.com> a écrit :

>
>
> On 2021-04-17 07:07, David Holmes wrote:
> > On 17/04/2021 4:54 am, Raffaello Giulietti wrote:
> >> I guess the reporter meant to limit the cache range similarly to the
> >> one used for valueOf().
> >>
> >> I have no clue about the benefit/cost ratio for the proposed String
> >> cache. It really depends on usage, workload, etc. One can easily
> >> imagine both extreme scenarios but it's hard to tell how the average
> >> one would look.
> >>
> >> My post is only about either solving the issue by implementing the
> >> cache, which I can contribute to; or closing it because of lack of
> >> real-world need or interest.
> >
> > Caching for the sake of caching is not an objective in itself. Unless
> > the caching can be shown to solve a real problem, and the strategy for
> > managing the cache is well-defined, then I would just close the
> > enhancement request. (Historically whether an issue we don't have any
> > firm plans to address is just left open "forever" or closed, depends
> > very much on who does the bug triaging in that area. :) )
> >
> > Cheers,
> > David
> >
>
>
> Indeed, the impression is that many of the issues are probably open
> because it's unclear whether they should be addressed with some
> implementation or spec effort or whether they should be closed. Triaging
> is certainly a harder job than it appears at first sight ;-)
>
> It would be useful to have a kind of "suspended" or "limbo" resolution
> state on the JBS for issues like this one, so people searching for more
> compelling open ones would not encounter them.
>
> Personally, I would close this issue without a "fix"; or "suspend" it.
>
>
> Greetings
> Raffaello
>
>
>
> >>
> >> Greetings
> >> Raffaello
> >>
> >>
> >> On 2021-04-16 20:36, Roger Riggs wrote:
> >>> Hi,
> >>>
> >>> Is there any way to quantify the savings?
> >>> And what technique can be applied to limit the size of the cache.
> >>> The size of the small integer cache is somewhat arbitrary.
> >>>
> >>> Regards, Roger
> >>>
> >>> On 4/16/21 12:48 PM, Raffaello Giulietti wrote:
>  Hello,
> 
>  does the enhancement proposed in [1] make sense, both today and when
>  wrappers will be migrated to primitive classes?
>  If so, it should be rather simple to add it and I could prepare a PR.
>  If not, the issue might perhaps be closed.
> 
> 
>  Greetings
>  Raffaello
> 
>  
> 
>  [1] https://bugs.openjdk.java.net/browse/JDK-8252827
> >>>
>


Re: RFR(S): 8242848: Improve performance of InflaterOutputStream.write()

2020-04-23 Thread Laurent Bourgès
Thank you Volker for sharing your results.

It is amazing, to improve such widely used feature (zip -> jar, http
compression...)

Congratulations.

Laurent

Le jeu. 23 avr. 2020 à 12:19, Volker Simonis  a
écrit :

> On Thu, Apr 23, 2020 at 10:56 AM Laurent Bourgès
>  wrote:
> >
> > Hi Volker,
> >
> > Could you give some benchmark results in the jbs bug to have an idea of
> the performance gain ?
>
> The results of a benchmark run are at the end of the microbenchmark
> which is attached to https://bugs.openjdk.java.net/browse/JDK-8242848
>
> I've attached them here for your convenience. The first run is before,
> the second run after applying this patch.
>
> E.g. for a user supplied buffer of size 16384, the time for inflation
> decreases from "2.603 ± 0.404  ms/op" to "2.187 ± 0.126  ms/op" which
> is an improvement of 16%. Obviously, increasing the default internal
> buffer size from 512 to 16384 (which I want to do in
> https://bugs.openjdk.java.net/browse/JDK-8242864) will get you even
> bigger speedups if you use the default buffer size :)
>
> /output/jdk-opt/images/jdk/bin/java
> --
> Benchmark(scale)  (size)  Mode  Cnt  Score
> Error  Units
> InflaterOutputStreamWrite.write1 512  avgt3  5.378 ±
> 0.301  ms/op
> InflaterOutputStreamWrite.write11024  avgt3  3.917 ±
> 0.165  ms/op
> InflaterOutputStreamWrite.write12048  avgt3  3.158 ±
> 0.097  ms/op
> InflaterOutputStreamWrite.write14096  avgt3  2.707 ±
> 0.138  ms/op
> InflaterOutputStreamWrite.write18192  avgt3  2.600 ±
> 0.399  ms/op
> InflaterOutputStreamWrite.write1   16384  avgt3  2.603 ±
> 0.404  ms/op
> InflaterOutputStreamWrite.write1   32768  avgt3  2.622 ±
> 0.211  ms/op
> InflaterOutputStreamWrite.write1   65536  avgt3  2.605 ±
> 0.170  ms/op
> InflaterOutputStreamWrite.write2 512  avgt3  4.015 ±
> 0.150  ms/op
> InflaterOutputStreamWrite.write21024  avgt3  3.225 ±
> 0.178  ms/op
> InflaterOutputStreamWrite.write22048  avgt3  2.745 ±
> 0.261  ms/op
> InflaterOutputStreamWrite.write24096  avgt3  2.614 ±
> 0.542  ms/op
> InflaterOutputStreamWrite.write28192  avgt3  2.593 ±
> 0.206  ms/op
> InflaterOutputStreamWrite.write2   16384  avgt3  2.606 ±
> 0.055  ms/op
> InflaterOutputStreamWrite.write2   32768  avgt3  2.611 ±
> 0.116  ms/op
> InflaterOutputStreamWrite.write2   65536  avgt3  2.617 ±
> 0.170  ms/op
> InflaterOutputStreamWrite.write4 512  avgt3  3.376 ±
> 0.599  ms/op
> InflaterOutputStreamWrite.write41024  avgt3  2.840 ±
> 0.155  ms/op
> InflaterOutputStreamWrite.write42048  avgt3  2.633 ±
> 0.550  ms/op
> InflaterOutputStreamWrite.write44096  avgt3  2.598 ±
> 0.166  ms/op
> InflaterOutputStreamWrite.write48192  avgt3  2.602 ±
> 0.054  ms/op
> InflaterOutputStreamWrite.write4   16384  avgt3  2.601 ±
> 0.039  ms/op
> InflaterOutputStreamWrite.write4   32768  avgt3  2.639 ±
> 0.020  ms/op
> InflaterOutputStreamWrite.write4   65536  avgt3  2.619 ±
> 0.260  ms/op
> InflaterOutputStreamWrite.write8 512  avgt3  2.882 ±
> 0.149  ms/op
> InflaterOutputStreamWrite.write81024  avgt3  2.695 ±
> 0.586  ms/op
> InflaterOutputStreamWrite.write82048  avgt3  2.644 ±
> 0.472  ms/op
> InflaterOutputStreamWrite.write84096  avgt3  2.616 ±
> 0.052  ms/op
> InflaterOutputStreamWrite.write88192  avgt3  2.616 ±
> 0.063  ms/op
> InflaterOutputStreamWrite.write8   16384  avgt3  2.611 ±
> 0.090  ms/op
> InflaterOutputStreamWrite.write8   32768  avgt3  2.633 ±
> 0.216  ms/op
> InflaterOutputStreamWrite.write8   65536  avgt3  2.634 ±
> 0.180  ms/op
>
> /priv/simonisv/output/jdk-opt-JDK-8242848/images/jdk/bin/java
> --
> Benchmark(scale)  (size)  Mode  Cnt  Score
> Error  Units
> InflaterOutputStreamWrite.write1 512  avgt3  5.388 ±
> 0.349  ms/op
> InflaterOutputStreamWrite.write11024  avgt3  3.799 ±
> 0.093  ms/op
> InflaterOutputStreamWrite.write12048  avgt3  2.994 ±
> 0.023  ms/op
> InflaterOutputStreamWrite.write14096  avgt3  2.583 ±
> 0.159  ms/op
> InflaterOutputStreamWrite.write18192  avgt3  2.325 ±
&g

Re: RFR(S): 8242848: Improve performance of InflaterOutputStream.write()

2020-04-23 Thread Laurent Bourgès
Hi Volker,

Could you give some benchmark results in the jbs bug to have an idea of the
performance gain ?

Thanks,
Laurent

Le jeu. 23 avr. 2020 à 10:20, Langer, Christoph 
a écrit :

> Hi Volker,
>
> since it's not yet pushed, I also went over the change and I like it. +1 😊
>
> One little style nit caught my eye, which you could correct before
> pushing: The style of the if/else blocks in
> test/jdk/java/util/zip/DeflateIn_InflateOut.java in lines 48/49 and 59/60
> does not match the other if/else blocks in the file. You should probably
> have the else on the line of the closing bracket before...
>
> Thanks
> Christoph
>
>
> > -Original Message-
> > From: core-libs-dev  On Behalf
> > Of Volker Simonis
> > Sent: Mittwoch, 22. April 2020 22:09
> > To: Lance Andersen 
> > Cc: Java Core Libs ; Vyom Tewari
> > 
> > Subject: Re: RFR(S): 8242848: Improve performance of
> > InflaterOutputStream.write()
> >
> > On Tue, Apr 21, 2020 at 5:23 PM Lance Andersen
> >  wrote:
> > >
> > > Hi Volker,
> > >
> > > I think overall this looks OK.  I went through the older SCCS
> histories to see
> > if I could figure out why they were using 512 for the input length but
> could
> > not find anything that might shed some light for me.
> > >
> >
> > Hi Lance,
> >
> > thanks a lot for digging in the old sources to review my change. It's
> > great that we stil have people who can use SCCS :)
> >
> > > I am not sure you can guarantee that src.zip exists but others might
> be able
> > to weigh in here.  What we have been trying to do going forward is to
> have
> > the tests create  the zip files that it needs.  In some cases, we have
> created a
> > byte array within the test which represents the zip and just write it out
> > before the test begins.
> > >
> >
> > Yes, the dependency on an external file was not nice, so I rewrote the
> > benchmark to programmatically create a file which can be compressed by
> > a factor of ~6:
> >
> > http://cr.openjdk.java.net/~simonis/webrevs/2020/8242848.02/
> >
> > Notice that this new version only changes the microbenchmark, all the
> > other files are untouched.
> >
> > As everybody seemed to be happy with the change itself and the
> > regression test, I'm now waiting for your and Clae's final review of
> > the microbenchmark before pushing. Please let me know hat you think?
> >
> > Best regards,
> > Volker
> >
> > > I am hoping others with more history might also chime in case they are
> > aware of the history here.
> > >
> > > Thank you for helping improve the performance.
> > >
> > > Best
> > > Lance
> > >
> > > On Apr 17, 2020, at 6:49 AM, Volker Simonis 
> > wrote:
> > >
> > > Thanks everybody for looking at this change!
> > >
> > > Please find an updated webrev (with a new test and micro-benchmark)
> > > and my answers to your comments below:
> > >
> > > http://cr.openjdk.java.net/~simonis/webrevs/2020/8242848.01/
> > >
> > > On Thu, Apr 16, 2020 at 6:40 AM Vyom Tiwari 
> > wrote:
> > >
> > > Thanks for doing this, i think the below code change is not required .
> > >
> > > Please do let me know if i am not reading it correctly.
> > >
> > > if (inf.finished() || (len == 0)/* no more input */) {
> > >
> > > If you check the native code "inf.finished() will be true only of the
> > corresponding native call inflate(strm, Z_PARTIAL_FLUSH) returns
> > "Z_STREAM_END".
> > >
> > > After your code change  write may return even if finished() is not
> true.
> > >
> > >
> > > Yes, that's true, but that's what we must do if there's no more input
> > > available. Before my change this break on "len < 1" was in the "if
> > > (inf.needsInput())" branch.
> > >
> > > On Thu, Apr 16, 2020 at 8:22 AM Thomas Stüfe
> >  wrote:
> > >
> > > 252 // Check the decompressor
> > > 253 if (inf.finished() || (len == 0)/* no more input
> */) {
> > > 254 break;
> > > 255 }
> > >
> > > Not sure but I think this is wrong because now you bypass the followup
> > handling of inf.needsDirectory.
> > >
> > > Inflater.inflate() returns 0 if either needsInput or needsDirectory.
> We have
> > to ignore needsInput since we have no input anymore, but needsDirectory
> > has to be handled, no?
> > >
> > >
> > > You're absolutely right Thomas. Thanks for catching this! I've moved
> > > the check for "needsDictionary" in front of the "finished() || len ==
> > > 0" check.
> > >
> > > Unfortunately there is not very good test coverage for zip with preset
> > > dictionaries (jtreg and submit repo passed without problems). So I
> > > added a new test for this use case to "
> > > test/jdk/java/util/zip/DeflateIn_InflateOut.java".
> > >
> > > On Thu, Apr 16, 2020 at 8:48 AM Thomas Stüfe
> >  wrote:
> > >
> > >
> > > As for increasing the buffer size, it makes sense but IMHO needs a CSR
> and
> > a release note.
> > >
> > >
> > > I don't think so. This is an internal, implementation specific setting
> > > which has never been exposed or documented before so I don't se

Re: RFR: 8226297: Dual-pivot quicksort improvements

2019-08-07 Thread Laurent Bourgès
Hi Brent,

I made a manual comparison of your webrev.4 and my previous patch.
- I noticed plenty of javadoc / comment changes that seem totally
appropriate in Arrays class: ok for me.
- You also fixed the problem with the ForkJoinPool field: good.
- No change in the DualPivotQuickSort class except copyright header: OK
(you reverted reordering I suppose)

My former question was more about the process: how to produce incremental
web revs easily ? it is easier for me to have a quick look, than inspecting
the global patch where DPQS class was almost rewritten.

Thanks for your help,
Laurent

Le mer. 7 août 2019 à 01:33, Brent Christian  a
écrit :

> On 8/6/19 12:47 PM, Vladimir Yaroslavskiy wrote:
> >
> > I moved Object sorting related stuff after primitives sorting methods
> > to separate functionality logically.
>
> Sure, fine to keep that all together.  I can move that back:
> http://cr.openjdk.java.net/~bchristi/8226297/webrev04/
>
> > The order of methods in my version is:
> >
> > 1. primitives (sequential sorting)
> > - int
> > - long
> > - byte
> > - char
> > - short
> > - float
> > - double
>
> The order for sequential sorting of primitives in Arrays.java checked
> into the JDK is:
>
> - int
> - long
> * short
> * char
> * byte
> - float
> - double
>
> It simplifies the webrev for reviewing to keep that ordering, so that's
> what I've done.
>
> Thanks,
> -Brent
>
> > Вторник, 6 августа 2019, 21:35 +03:00 от Brent Christian
> > :
> >
> > Hi, Laurent
> >
> > I'm not sure what exactly is causing the problem, but here's my
> hunch:
> >
> > In Vladimir's version of Arrays.java:
> > MIN_ARRAY_SORT_GRAN
> > NaturalOrder
> > rangeCheck
> > got moved around, but were unchanged.
> >
> > In the interest of keeping the change as simple as possible, I
> restored
> > these to their original location, so they don't show up in my
> changes.
> > That could confuse things when comparing diffs.
> >
> > One idea would be to restore those elements back in their original
> > locations in your version, and re-generate your patch. I don't know
> if
> > that would be less work than just comparing raw files.
> >
> > Alternatively, if it would be easiest for those familiar with the
> > evolution of this fix to leave things where Vladimir had them, I can
> do
> > that.
> >
> > Thanks,
> > -Brent
> >
> > On 8/6/19 6:32 AM, Laurent Bourgès wrote:
> >  > Hi Brent,
> >  >
> >  > Thank you for sponsoring this patch.
> >  >
> >  > I tried to compare your webrev against my latest (diff patch
> > files) but
> >  > it gives me too many changes lines.
> >  >
> >  > Do you have another idea to see incremental changes only ?
> >  > (anyway I can compare raw files)
> >  >
> >  > Thanks,
> >  > Laurent
> >  >
> >  > Le lun. 5 août 2019 à 23:43, Brent Christian
> > mailto:brent.christ...@oracle.com>
> >  > <mailto:brent.christ...@oracle.com>> a écrit :
> >  >
> >  > Hi,
> >  >
> >  > Please review Vladimir Yaroslavskiy's changes to
> DualPivotQuickSort
> >  > (seen earlier[1] on this alias).  I will be sponsoring this
> change.
> >  >
> >  > I have a webrev against jdk-jdk here:
> >  > http://cr.openjdk.java.net/~bchristi/8226297/webrev03-rfr/
> >  >
> >  > (Note that I did a little re-ordering, and removed some
> superfluous
> >  > spacing changes, in order to simplify the webrev.  I've also
> included
> >  > Vladimir's FailedFloat[2] test case.)
> >  >
> >  > Information about benchmarking the changes was posted[3] recently.
> >  > An automated test run passes cleanly.
> >  >
> >  > Thanks!
> >  > -Brent
> >  > --
> >  > 1.
> >  >
> >
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061363.html
> >  >
> >  > 2.
> >  >
> >
> https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061513.html
> >  >
> >  > 3.
> >  >
> >
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061553.html
> >  >
> >
> >
> >
> > --
> > Vladimir Yaroslavskiy
>


-- 
-- 
Laurent Bourgès


Re: RFR: 8226297: Dual-pivot quicksort improvements

2019-08-06 Thread Laurent Bourgès
Hi Brent,

Thank you for sponsoring this patch.

I tried to compare your webrev against my latest (diff patch files) but it
gives me too many changes lines.

Do you have another idea to see incremental changes only ?
(anyway I can compare raw files)

Thanks,
Laurent

Le lun. 5 août 2019 à 23:43, Brent Christian  a
écrit :

> Hi,
>
> Please review Vladimir Yaroslavskiy's changes to DualPivotQuickSort
> (seen earlier[1] on this alias).  I will be sponsoring this change.
>
> I have a webrev against jdk-jdk here:
> http://cr.openjdk.java.net/~bchristi/8226297/webrev03-rfr/
>
> (Note that I did a little re-ordering, and removed some superfluous
> spacing changes, in order to simplify the webrev.  I've also included
> Vladimir's FailedFloat[2] test case.)
>
> Information about benchmarking the changes was posted[3] recently.
> An automated test run passes cleanly.
>
> Thanks!
> -Brent
> --
> 1.
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061363.html
>
> 2.
> https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061513.html
>
> 3.
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061553.html
>


Re: The final optimized version of Dual-Pivot Quicksort (ver.19.1)

2019-07-31 Thread Laurent Bourgès
Hi Brent,

Le jeu. 1 août 2019 à 02:32, Brent Christian  a
écrit :

> Thanks, Laurent.  I can sponsor this fix, get a RFR thread going for
> JDK-8226297, keep webrevs updated as the review progresses, etc.
>

Excellent, I will let you and Vladimir work on this review.

FYI I have implemented DPQS (int[] ) sorting 2 arrays (values + indices) in
the Marlin renderer. It could be generalized to any array type and having
the index array is very important in many algorithms...


> However I've uncovered an issue with the fix that should be resolved
> before proceeding.
>
> Specifically, the new Arrays.COMMON_PARALLELISM static constant causes
> exceptions at startup under some circumstances:
>  * running with LANG=C on Linux[1]
>  * setting a property value with non-Ascii characters on Mac[2]
>
> java.util.Arrays is used a fair bit for String handling
> (java.lang.StringCoding, java.langStringLatin1, etc).  The problem is
> that early in startup, depending on the locale/language setup and/or
> property settings, java.util.Arrays can be called during initialization
> of system properties.
>
> During Arrays., COMMON_PARALLELISM is setup with a call to
> ForkJoinPool.getCommonPoolParallelism().  ForkJoinPool sets up
> some VarHandles, VarHandles leads to
> MethodHandleStatics., which reads some system properties.  But
> we're still initializing the system properties - 'props' is null, so NPE.
>

Chicken-egg problem. Maybe this field could be wrapped in a getter doing
lazy resolution...


> I think Arrays.java needs to remove the COMMON_PARALLELISM constant, and
> go back to making calls to ForkJoinPool.getCommonPoolParallelism().
>

I do not know why Vladimir changed that recently. Any good reason ?


> I can do the update and post an updated webrev (unless further
> discussion is needed).
>

PS: I can help on benchmarking as I developped a fair sort benchmark based
on JMH:
https://github.com/bourgesl/nearly-optimal-mergesort-code

JMH test code:
https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/sort-bench/src/main/java/edu/sorting/bench
I would be interested by improving the perf test code and possibly
integrate it in OpenJDK JMH test suite... (later)

>
Goodbye & good luck,
Laurent


Re: The final optimized version of Dual-Pivot Quicksort (ver.19.1)

2019-07-26 Thread Laurent Bourgès
Hi all,

I updated the DPQS patch thanks to latest release from Vladimir ([image:
Pièces jointes]2019.07.25):
http://cr.openjdk.java.net/~lbourges/dpqs/dpqs-8226297/dpqs-8226297.1/

"
Updated version of Arrays and DualPivotQuicksort files, the summary of
changes:

1. Corrected minimal threshold for parallelism argument (0 -> 1)
2. Added constant for common parallelism
3. Updated byte / char / short counting sort.
"

Incremental patch change:
http://cr.openjdk.java.net/~lbourges/dpqs/dpqs-8226297/diff_dpqs_1_vs_0.log

Cheers,
Laurent


Le mer. 17 juil. 2019 à 17:12, Laurent Bourgès 
a écrit :

> Hi,
>
> I integrated locally the complete DPQS 19.2 patch from Vladimir
> Yaroslavskiy with up-to-date jdk14 (client) repository:
> build OK and jtreg passed OK (java/util/Arrays).
>
> Here is the corresponding webrev for review:
> http://cr.openjdk.java.net/~lbourges/dpqs/dpqs-8226297/webrev/
>
> Cheers,
> Laurent
>


Re: Benchmarking of new optimized Dual-Pivot Quicksort

2019-07-26 Thread Laurent Bourgès
Hi Vladimir,

First Thank you for these impressive results: 15% to 70% overall gains is
amazing and will make a big difference, Congratulations !

Could you publish the different test environment ?
- HW: cpu (lscpu output)
- OS version
- JDK version + JVM settings

Ideally you or I could write a shell script to collect setup and write a
simple log file.

PS: I suppose you tested DPQS only on intel cpus, could somebody test on
ppc or arm cpus as well ?

Cheers,
Laurent

Le jeu. 25 juil. 2019 à 16:04, Vladimir Yaroslavskiy  a
écrit :

> Hi all,
>
> With the help of Laurent Bourges I run benchmarking of new optimized
> Dual-Pivot Quicksort
> and summarized results. The tests were run for all types (int / long / ...
> / double), for both types:
> sequential and parallel sorting, for small, medium and large sizes. The
> comparison was done
> on several data types, such as: random, period, sorted, equal, stagger,
> shuffle.
>
> Here is the summary of total gains. The gain is defined as (jdk_time -
> dpqs_time) / jdk_time,
> where jdk_time is sorting time by the current jdk14 and dpqs_time is
> sorting time by new
> optimized Dual-Pivot Quicksort. To get stable and reproducible results,
> min time of many
> invocations is taken.
>
> You can find all detailed results there
> https://github.com/iaroslavski/sorting/tree/master/benchmarking/results
>
> Sources of benchmarking tests are there
> https://github.com/iaroslavski/sorting/tree/master/benchmarking/sources
>
> You
> can see not good performance for float / double types (parallel sorting).
> This behavior can be explained by existing bug in jdk, when NaNs and -0.0s
> are not processed properly. New sorting has total losses for small float /
> double
> arrays, few percents only. For all other cases new optimized sorting is
> winner.
>
>
> 
> [int]
>
> sequential, Length = 150, Gain: 0.15
> sequential, Length = 3, Gain: 0.28
> sequential, Length = 100, Gain: 0.31
> parallel 8, Length = 150, Gain: 0.14
> parallel 8, Length = 3, Gain: 0.15
> parallel 8, Length = 100, Gain: 0.14
>
> [long]
> sequential, Length = 150, Gain: 0.12
> sequential, Length = 3, Gain: 0.23
> sequential, Length = 100, Gain: 0.32
> parallel 8, Length = 150, Gain: 0.11
> parallel 8, Length = 3, Gain: 0.16
> parallel 8, Length = 100, Gain: 0.24
> parallel 88 processors, Length = 100, Gain: 0.05
>
> [byte]
> sequential, Length = 150, Gain: 0.06
> sequential, Length = 3, Gain: 0.36
> sequential, Length = 100, Gain: 0.37
> parallel 8, Length = 150, Gain: 0.13
> parallel 8, Length = 3, Gain: 0.73
> parallel 8, Length = 100, Gain: 0.65
>
> [char]
> sequential, Length = 150, Gain: 0.10
> sequential, Length = 3, Gain: 0.00
> sequential, Length = 100, Gain: 0.17
> parallel 8, Length = 150, Gain: 0.10
> parallel 8, Length = 3, Gain: 0.33
> parallel 8, Length = 100, Gain: 0.62
>
> [short]
> sequential, Length = 150, Gain: 0.14
> sequential, Length = 3, Gain: 0.10
> sequential, Length = 100, Gain: 0.18
> parallel 8, Length = 150, Gain: 0.13
> parallel 8, Length = 3, Gain: 0.41
> parallel 8, Length = 100, Gain: 0.65
>
> [float]
> sequential, Length = 150, Gain: -0.02
> sequential, Length = 3, Gain: 0.21
> sequential, Length = 100, Gain: 0.24
> parallel 8, Length = 150, Gain: -0.05
> parallel 8, Length = 3, Gain: -0.04
> parallel 8, Length = 100, Gain: -0.12
>
> [double]
> sequential, Length = 150, Gain: -0.03
> sequential, Length = 3, Gain: 0.19
> sequential, Length = 100, Gain: 0.23
> parallel 8, Length = 150, Gain: -0.05
> parallel 8, Length = 3, Gain: 0.05
> parallel 8, Length = 100, Gain: 0.15
>
> Thank you,
> Vladimir
>


Re: The final optimized version of Dual-Pivot Quicksort (ver.19.1)

2019-07-17 Thread Laurent Bourgès
Hi,

I integrated locally the complete DPQS 19.2 patch from Vladimir
Yaroslavskiy with up-to-date jdk14 (client) repository:
build OK and jtreg passed OK (java/util/Arrays).

Here is the corresponding webrev for review:
http://cr.openjdk.java.net/~lbourges/dpqs/dpqs-8226297/webrev/

Cheers,
Laurent

Le ven. 12 juil. 2019 à 09:34, Laurent Bourgès 
a écrit :

> Hi,
> I asked alan to update the webrev, but I can publish it myself...
>
> Will do it asap,
> Stay tuned,
> Laurent
>
> Le mar. 9 juil. 2019 à 22:19, Brent Christian 
> a écrit :
>
>> Hi,
>>
>> Is the webrev incomplete?  It only contains the changes for
>> DualPivotQuicksort.java, and not for Arrays.java, etc.
>>
>> Thanks,
>> -Brent
>>
>> On 6/18/19 2:37 PM, Vladimir Yaroslavskiy wrote:
>> > Hi team,
>> >
>> > Please review the final optimized version of Dual-Pivot Quicksort
>> (ver.19.1),
>> > see http://cr.openjdk.java.net/~alanb/8226297/0/webrev
>> > (link to Jira task https://bugs.openjdk.java.net/browse/JDK-8226297)
>> >
>> > The new version was published here more than one year ago (on Jan 19,
>> 2018).
>> > Since that time Dual-Pivot Quicksort has been significantly improved
>> > and this work was done in collaboration with Doug Lea and Laurent
>> Bourges.
>> >
>> > You can find summary of all changes below.
>> >
>> > DualPivotQuicksort.java
>> > --
>> > * The 1-st and 5-th candidates are taken as pivots
>> >instead of 2-nd and 4-th
>> > * Pivots are chosen with another step
>> > * Pivot candidates are sorted by combination of
>> >5-element network sorting and insertion sort
>> > * New backwards partitioning was introduced
>> > * Partitioning is related to the golden ratio
>> > * Quicksort tuning parameters were updated
>> > * Thresholds for byte / char / short sorting were updated
>> > * Additional steps for float / double were fully rewritten
>> > * Parallel sorting is based on combination of merge sort
>> >and Dual-Pivot Quicksort
>> > * Pair insertion sort was optimized and became a part
>> >of mixed insertion sort
>> > * Mixed insertion sort was introduced: combination
>> >of simple insertion sort, pin insertion sort and
>> >pair insertion sort
>> > * Simple insertion sort is invoked on tiny array
>> > * Pin insertion sort is started on small array
>> > * Pair insertion sort is continued on remain part
>> > * Merging of runs is invoked on each iteration of Quicksort
>> > * Merging of runs was fully rewritten
>> > * Optimal partitioning of merging is used
>> > * Not more than one copy of part to buffer during merging
>> > * Merging parameters were updated
>> > * Initial capacity of runs is based on size
>> > * Max number of runs is constant
>> > * Optimized version of heap sort was introduced
>> > * Heap sort is used as a guard against quadratic time
>> >(int / long / float / double)
>> > * Parallel merging of runs was also provided
>> > * Parallel sorting, heap sort and merging of runs
>> >are not used in byte / char / short sorting
>> > * Counting sort for byte / char / short were optimized
>> > * Counting sort is used as a guard against quadratic time
>> >(byte / char / short)
>> > * Code style and javadoc improvements
>> >
>> > Sorting.java
>> > --
>> > * New test cases were added
>> > * Test cases are invoked for both: sequential and
>> >parallel sorting
>> > * Check for Object sorting was added
>> > * New tests were added against heap sort
>> > * Added test case against bug in parallel merge
>> >sort on float / double data
>> >
>> > ParallelSorting.java
>> > --
>> > * This class was removed, because Sorting class
>> >covers all cases
>> >
>> > SortingHelper.java
>> > ----------
>> > * The helper class provides access package-private
>> >methods of DualPivotQuicksort class during testing
>> >
>> > Arrays.java
>> > --
>> > * Calls of Dual-Pivot Quicksort were updated
>> > * Parallel sorting of primitives was switched to parallel
>> >Dual-Pivot Quicksort
>> > * Javadoc was updated, double spaces were removed
>> > * Format changes
>> >
>> > ArraysParallelSortHelpers.java
>> > --
>> > * Implementation of parallel sorting
>> >for primitives was removed
>> > * Javadoc was updated
>> >
>> > Thank you,
>> > Vladimir
>> >
>>
>

-- 
-- 
Laurent Bourgès


Re: The final optimized version of Dual-Pivot Quicksort (ver.19.1)

2019-07-12 Thread Laurent Bourgès
Hi,
I asked alan to update the webrev, but I can publish it myself...

Will do it asap,
Stay tuned,
Laurent

Le mar. 9 juil. 2019 à 22:19, Brent Christian 
a écrit :

> Hi,
>
> Is the webrev incomplete?  It only contains the changes for
> DualPivotQuicksort.java, and not for Arrays.java, etc.
>
> Thanks,
> -Brent
>
> On 6/18/19 2:37 PM, Vladimir Yaroslavskiy wrote:
> > Hi team,
> >
> > Please review the final optimized version of Dual-Pivot Quicksort
> (ver.19.1),
> > see http://cr.openjdk.java.net/~alanb/8226297/0/webrev
> > (link to Jira task https://bugs.openjdk.java.net/browse/JDK-8226297)
> >
> > The new version was published here more than one year ago (on Jan 19,
> 2018).
> > Since that time Dual-Pivot Quicksort has been significantly improved
> > and this work was done in collaboration with Doug Lea and Laurent
> Bourges.
> >
> > You can find summary of all changes below.
> >
> > DualPivotQuicksort.java
> > --
> > * The 1-st and 5-th candidates are taken as pivots
> >instead of 2-nd and 4-th
> > * Pivots are chosen with another step
> > * Pivot candidates are sorted by combination of
> >5-element network sorting and insertion sort
> > * New backwards partitioning was introduced
> > * Partitioning is related to the golden ratio
> > * Quicksort tuning parameters were updated
> > * Thresholds for byte / char / short sorting were updated
> > * Additional steps for float / double were fully rewritten
> > * Parallel sorting is based on combination of merge sort
> >and Dual-Pivot Quicksort
> > * Pair insertion sort was optimized and became a part
> >of mixed insertion sort
> > * Mixed insertion sort was introduced: combination
> >of simple insertion sort, pin insertion sort and
> >pair insertion sort
> > * Simple insertion sort is invoked on tiny array
> > * Pin insertion sort is started on small array
> > * Pair insertion sort is continued on remain part
> > * Merging of runs is invoked on each iteration of Quicksort
> > * Merging of runs was fully rewritten
> > * Optimal partitioning of merging is used
> > * Not more than one copy of part to buffer during merging
> > * Merging parameters were updated
> > * Initial capacity of runs is based on size
> > * Max number of runs is constant
> > * Optimized version of heap sort was introduced
> > * Heap sort is used as a guard against quadratic time
> >(int / long / float / double)
> > * Parallel merging of runs was also provided
> > * Parallel sorting, heap sort and merging of runs
> >are not used in byte / char / short sorting
> > * Counting sort for byte / char / short were optimized
> > * Counting sort is used as a guard against quadratic time
> >(byte / char / short)
> > * Code style and javadoc improvements
> >
> > Sorting.java
> > --
> > * New test cases were added
> > * Test cases are invoked for both: sequential and
> >parallel sorting
> > * Check for Object sorting was added
> > * New tests were added against heap sort
> > * Added test case against bug in parallel merge
> >sort on float / double data
> >
> > ParallelSorting.java
> > --
> > * This class was removed, because Sorting class
> >covers all cases
> >
> > SortingHelper.java
> > --
> > * The helper class provides access package-private
> >methods of DualPivotQuicksort class during testing
> >
> > Arrays.java
> > --
> > * Calls of Dual-Pivot Quicksort were updated
> > * Parallel sorting of primitives was switched to parallel
> >Dual-Pivot Quicksort
> > * Javadoc was updated, double spaces were removed
> > * Format changes
> >
> > ArraysParallelSortHelpers.java
> > --
> > * Implementation of parallel sorting
> >for primitives was removed
> > * Javadoc was updated
> >
> > Thank you,
> > Vladimir
> >
>


Re: Extending Java Arrays/Collection Sort API

2018-12-05 Thread Laurent Bourgès
John,

Any feedback ?
We could discuss that during next OpenJDK workshop, but I would prefer
going slowly but surely.

Laurent

Le jeu. 29 nov. 2018 à 17:52, Laurent Bourgès  a
écrit :

> Hi John & Brian,
>
> Thanks for the proposed approach, I agree we are in the design discussion;
> adding such API enhancements will take time, I said 13 to not hurry for 12
> (CSR...).
>
> John, you wrote me a very detailed letter, I will try to be more concise
> and focus on Array.sort() API.
>
> Le jeu. 29 nov. 2018 à 02:12, John Rose  a écrit :
>
>>
>> >
>> > In this renderer, I needed a fast sort algorithm on small almost sorted
>> > arrays, but more important, to deal with 2 arrays: x values and their
>> > corresponding edge indices (pointer like). I derived my MergeSort class
>> to
>> > perform 2 swaps (x & edge ptr) as Arrays.sort() or TimSort() did not
>> > provide such API.
>>
>> There are two distinct requirements here:  1. Selecting or tweaking a
>> sorting
>> algorithms for particular expected patterns in the data set
>> (almost-sorted).
>> 2. Sorting two arrays in tandem, with one array supplying the sort key,
>> and the other serving as a passively permuted payload ("PPP").
>>
>> Re 1: Another pattern sometimes expected is range-bounded values,
>> which may prompt consideration of bin-sort.  Perhaps unique values.
>> Permutation vectors have both properties.
>>
>
> Yes, the discussion has 2 points:
> - provide more sorting algorithms in the public API
> - provide new sort variants handling indices
>
> For example, Julia has a simple sort API:
> https://docs.julialang.org/en/v0.6.2/stdlib/sort/
> It provides several basic sort functions (isort, qsort, msort) but the
> general sort() accepts optional custom comparison (less) & transformation
> functions.
> Moreover, another sortperm() function that returns a permutation vector of
> indices, it accepts an optional custom comparison (less).
>
>
>
>>
>> Re 2: In some cases, it's not two arrays that need the work.  Sometimes
>> it is one array of keys, plus an index set; this can be reduced to two
>> arrays
>> one of which is an int-array carrying indexes, but in some proposals you
>> see
>> such an int-array appearing only as an output, with an implicit input of
>> the iota-vector for the other array's indexes.
>>
>> More deeply, sometimes the index array is concrete, but the array of keys
>> is virtualized.  Canonical example (for me) is a set of indexes into a
>> data
>> set such as a file.  (These indexes are dense for compression algorithms
>> like B-W, or could be sparse for lexical analysis of bulk text in situ.)
>> To
>> handle this properly, a good old index array is just fine, as long as the
>> sort algorithm *also* accepts an int-comparator.
>>
>> There is no strong need to generalize to three arrays in tandem, or to
>> have separate algorithms for treating additional arrays as containing
>> secondary keys instead of PPP.  This is because given a two-array tandem
>> sort with PPP, plus an int-array sort with int-comparator, you can compose
>> many other shapes at a modest footprint cost by adding two index vector
>> to control the various sorts.
>>
>
> So you propose 2 new variants:
> - sort(int[]a, int[] b, low, high) where b values are permuted
> - sort(int[]a, low, high, Comparator) where b values are permuted
>
> I proposed the 2 array variants: sort( int[] values, int[] indices, low,
> high) as it helps when the user wants both values and indices sorted
> (easier use), but this can be reduced to sort(int[] indices, low, high,
> comparator) as the user can use the permutation vector (indices) to reorder
> values.
>
> I would prefer adding the more general solution sort(int[] indices, low,
> high, comparator) first.
>
> However, having extracted the values is really helpful for performance
> (locality) and simplifies the sorting algorithms but costs more data swaps.
> For example in Marlin, my comparator would be:
> Comparator::compare(i,j) {
> return Integer.compareTo(edges[i].x, edges[j].x);
> }
> However, my edge data are packed in an Unsafe buffer so it is more:
> Comparator::compare(i,j) {
> return Integer.compareTo(Unsafe.getInt(addr0 + i), Unsafe.getInt(addr0
> + j) );
> }
> I wonder if such lookups may be costly (random access + overheads) in
> contrary to having the values given as an extra int[].
>
> If you want, I can prototype later both solutions...
>
>
>>
>> Simplified sketch of generalized tandem array sort:
>>
&g

Re: Extending Java Arrays/Collection Sort API

2018-11-29 Thread Laurent Bourgès
Hi John & Brian,

Thanks for the proposed approach, I agree we are in the design discussion;
adding such API enhancements will take time, I said 13 to not hurry for 12
(CSR...).

John, you wrote me a very detailed letter, I will try to be more concise
and focus on Array.sort() API.

Le jeu. 29 nov. 2018 à 02:12, John Rose  a écrit :

>
> >
> > In this renderer, I needed a fast sort algorithm on small almost sorted
> > arrays, but more important, to deal with 2 arrays: x values and their
> > corresponding edge indices (pointer like). I derived my MergeSort class
> to
> > perform 2 swaps (x & edge ptr) as Arrays.sort() or TimSort() did not
> > provide such API.
>
> There are two distinct requirements here:  1. Selecting or tweaking a
> sorting
> algorithms for particular expected patterns in the data set
> (almost-sorted).
> 2. Sorting two arrays in tandem, with one array supplying the sort key,
> and the other serving as a passively permuted payload ("PPP").
>
> Re 1: Another pattern sometimes expected is range-bounded values,
> which may prompt consideration of bin-sort.  Perhaps unique values.
> Permutation vectors have both properties.
>

Yes, the discussion has 2 points:
- provide more sorting algorithms in the public API
- provide new sort variants handling indices

For example, Julia has a simple sort API:
https://docs.julialang.org/en/v0.6.2/stdlib/sort/
It provides several basic sort functions (isort, qsort, msort) but the
general sort() accepts optional custom comparison (less) & transformation
functions.
Moreover, another sortperm() function that returns a permutation vector of
indices, it accepts an optional custom comparison (less).



>
> Re 2: In some cases, it's not two arrays that need the work.  Sometimes
> it is one array of keys, plus an index set; this can be reduced to two
> arrays
> one of which is an int-array carrying indexes, but in some proposals you
> see
> such an int-array appearing only as an output, with an implicit input of
> the iota-vector for the other array's indexes.
>
> More deeply, sometimes the index array is concrete, but the array of keys
> is virtualized.  Canonical example (for me) is a set of indexes into a data
> set such as a file.  (These indexes are dense for compression algorithms
> like B-W, or could be sparse for lexical analysis of bulk text in situ.)
> To
> handle this properly, a good old index array is just fine, as long as the
> sort algorithm *also* accepts an int-comparator.
>
> There is no strong need to generalize to three arrays in tandem, or to
> have separate algorithms for treating additional arrays as containing
> secondary keys instead of PPP.  This is because given a two-array tandem
> sort with PPP, plus an int-array sort with int-comparator, you can compose
> many other shapes at a modest footprint cost by adding two index vector
> to control the various sorts.
>

So you propose 2 new variants:
- sort(int[]a, int[] b, low, high) where b values are permuted
- sort(int[]a, low, high, Comparator) where b values are permuted

I proposed the 2 array variants: sort( int[] values, int[] indices, low,
high) as it helps when the user wants both values and indices sorted
(easier use), but this can be reduced to sort(int[] indices, low, high,
comparator) as the user can use the permutation vector (indices) to reorder
values.

I would prefer adding the more general solution sort(int[] indices, low,
high, comparator) first.

However, having extracted the values is really helpful for performance
(locality) and simplifies the sorting algorithms but costs more data swaps.
For example in Marlin, my comparator would be:
Comparator::compare(i,j) {
return Integer.compareTo(edges[i].x, edges[j].x);
}
However, my edge data are packed in an Unsafe buffer so it is more:
Comparator::compare(i,j) {
return Integer.compareTo(Unsafe.getInt(addr0 + i), Unsafe.getInt(addr0
+ j) );
}
I wonder if such lookups may be costly (random access + overheads) in
contrary to having the values given as an extra int[].

If you want, I can prototype later both solutions...


>
> Simplified sketch of generalized tandem array sort:
>
> 1. Let L be the common length of the N arrays, A[N][L].
> 2. Create a new int index array P initialized to iota(L).
> 3. Create an int-comparator C over 0<=i,j A[*][j].
> 4. Sort P using C.  Now P is a permutation vector to be applied to each
> A[*].
> 5. Create a second index array Q such that Q[P[i]] = i for all i.  Q is
> P's inverse.
> 6. Tandem-sort each Q[*] (as a PPP) using a fresh copy of Q for each array.
>
> There is more than one way to do this; the above sketch is intended to be
> straightforward.  The temp. space required is two length-L int arrays.
>

I like you proposal, but I dislike object allocations so I would prefer
providing result arrays as arguments if possible.
In Marlin renderer, allocations are always done in its RendererContext +
XXXArrayCache (soft ref) to ensure no GC overhead.


>
> My point is that the "missing prim

Re: Extending Java Arrays/Collection Sort API

2018-11-28 Thread Laurent Bourgès
Hi John & Brian,

Thank you for your feedback finally, we almost agree Java Sort API could be
improved, in jdk13 possibly.

>
> Doug is right that there is an enormous potential list of sort methods,
> and we can't include them all.  But I do miss the ability to extract
> indexes instead of sorting the array itself.
>
> If you read my first email 11.14, I asked if you ever agree exposing few
other sort algorithms to have a public sort API useful on special cases
(isort, radix, merge sorts) when needed to manage specific types of
datasets.
Never mind: third-party libraries can provide advanced / other algorithms.

Vladimir Yaroslavskiy made huge progresses on his DPQS 18.11 and I hope it
will be the overall winner on almost all data distributions at all scale
soon to provide a better Arrays.sort() implementation soon.


> Or, slightly more generally, sorting an int[] or long[] array with a
> comparator.
> Sometimes you don't want an object per sorted entity; you want some kind
> of oracle function that can compare two indexes into a non-concrete
> (virtualized) set of values; the sort operation would output the reordered
> ints.
>
> Example:  Sorting indexes into a large text file according to some
> algorithm
> such as Burrows-Wheeler.  You don't want to make all the substrings and
> put them into a String[] array, and you don't even want to make
> CharSequence
> objects that view into the text file.  You just want indexes into the text
> file to
> be sorted.
>
> IMO, the very simplest answer with significantly better semantics is:
>
>   void sort(int[] a, (int[] a, int fromIndex, int toIndex,
> IntBinaryOperator cmp);
>
> And maybe add parallelSort, and use a new IntComparator instead of
> repurposing IntBinaryOperator.
>
> Extracting the permutation vector of indexes from an array sorting
> operation
> would then look like this:
>
> public static  int[] sortToIndexes(T[] a, Comparator c) {
> int[] perm = new int[a.length];  // starts as an iota vector
> for (int i = 0; i < a.length; i++)  perm[i] = i;
> sort(perm, 0, a.length, (i, j) -> c.compare(a[i], a[j]));  // NEW
> PRIMITIVE
> return perm;
> }
>
> But there are other use cases for comparator-based sorting of indexes,
> which start not with an iota vector, but with an array of index-like values
> which may index into something other than an array (like a text file,
> which is why 'long[] a' might be useful too).
>
> In Valhalla many such use cases can be covered by using a flat array of
> values which encapsulate the integer (or long) indexes, and sorting that.
> The array of wrapped primitives will look like Object[] and so the existing
> API point with a comparator would allow the flat array of ints (dressed as
> value Objects) to be sorted according to any ad hoc ordering rule,
> including
> treating them as indexes.
>
> So maybe the answer is "wait for Valhalla".  Still, the lack of a
> comparator
> on primitive arrays is an unnecessary limitation, which excludes
> index-based
> computations from the portfolio of our sort methods.
>

1. I am waiting for Value types for several years as Marlin renderer is
already using array of structs in java, ie edge data are packed in int[]
and unsafe buffers, to avoid bound checks (like C). It is a good candidate
to evaluate this new feature.
I can help here, but not alone.

2. John, your proposal matches mine:
"alternatively, it could be expressed as Comparator
Arrays.sort(int[] a, low, high, Comparator cmp) that sorts array
a using the given comparator. For example, a[] could be edge indices sorted
according to their edge x position...
This comparator is specific as it compares anything (external storage) at
given indices i,j."

Let's go for 13...

If I may insist a bit on the 2 arrays variant, Marlin needs the x array
sorted for future traversal...
I do not know what is faster: sorting 2 arrays (indices swapped according
to x array order: 2x times swaps) or using a comparator (lookups to compare
(x[i], x[j] ) + x array reconstruction).

Finally how can we manage such improvement in 13? CSR / RFE ?

John or Brian, could you lead that specification request (CSR) and formal
process ?
 I can propose implementations on my side, I already have a working DPQS
18.11 with 2 arrays on Marlin's github.

Regards,
Laurent


Re: Extending Java Arrays/Collection Sort API

2018-11-23 Thread Laurent Bourgès
Hi,

I am happy to announce that I succeeded in writing my own BentleyBasher
working like a swiss clock:
- auto-tune benchmark passes & hot loop to obtain high accuracy on
measurements ~ 2% (guaranteed), with proper variance estimators
- test > 10 sorters with small, large & very large int[] arrays including
DPQS11, 18.2, 18.11 ... including my variants sorting 2 arrays altogether.

I observed DPQS 18.11 working very well...
we are currently fixing performance issue on small almost-sorted arrays (<
1000).

>
My first results & all code (benchmark + all sorters) are available in this
github repo:
- Ea results with stats:
https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/results
- source code:
https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/src/edu/sorting

It is my Sort Race 2018... in progress.

PS: plots & JMH cross-validation remains to be implemented.

Cheers,
Laurent


Re: Extending Java Arrays/Collection Sort API

2018-11-20 Thread Laurent Bourgès
Hi Doug,
That's a pleasure to read you.

Le mar. 20 nov. 2018 à 13:17, Doug Lea  a écrit :

> On 11/20/18 2:50 AM, Laurent Bourgès wrote:
> > Hi,
> > Any feedback on improving Java Sort API ?
>
> I think most considerations await progress on value types. I posted some
> ideas and links to code last summer on Valhalla list:
> http://mail.openjdk.java.net/pipermail/valhalla-dev/2018-August/thread.html


I already followed this thread and I am looking for Value types for long.
I looked at your CVSorter class and it looks promising, I can try adapting
it to my needs and compare its own performance in my Sort Race 2018.

Even if Value types are coming, the Arrays/Collections sort() API will
remain and must still evolve to provide the best (overall) sort.
As said previously, adaptive algorithms already embed several Sort
algorithms (insertion sort, merge sort, dual pivot quick sort, radix sort)
so I proposed to wrap them as public API to let the developper select the
best sort according to its needs or data sets.


> They didn't include APIs for sorting based on indexes, which I agree
> should be considered. However, until value types arrive, it would be
> hard to justify adding variants for each primitive type.
>

Thanks, you agree it is useful. For now I only focused on sort(int[] a,
int[] indices, low, high) for my own needs.
As Java arrays are still only integer based, I propose to only support
int[] indices.
I already implemented such method in my DualPivotQuicksort201802Ext (that
uses pre-allocation):
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/src/edu/sorting/DualPivotQuicksort201802Ext.java

static void sort(int[] a, int[] b, int low, int high, int[] auxA, int[] auxB,
int[] run) { ... }
note: int[] auxA, int[] auxB, int[] run are given for pre-allocation
purposes


>
> Almost every sorting algorithm is best for some data types and inputs.
> And there are a lot of algorithms. Among the hard choices for JDK is to
> pick a small a only a few (stable vs nonstable, parallelizable, etc),
> that together provide good choices for most usages. Even then, some
> usages (possibly including the one in your initial post) might be better
> off creating a custom solution. (As is true for all JDK
> Collections/Array-related components.)
>

So is my proposal: Arrays/Collection sort() must be the overall winner (all
data distributions, all array lengths) ~ best compromise.

For particular needs, the JDK or a sort library could provide specific sort
implementations.
To that purpose, I have collected 15 sort (working) implementations:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/src/edu/sorting/perf/IntSorter.java

Best regards,
Laurent


> > What about an Arrays API extension:
> > - Arrays.sort(int[] a, int[] b, low, high) or
> > Indices int[] Arrays.sortIndices(int[] a, low high), a[] left
> untouched
> > - same with preallocated buffers & runs arrays
> > - alternatively, it could be expressed as Comparator
> > Arrays.sort(int[] a, low, high, Comparator cmp) that sorts
> > array a using the given comparator. For example, a[] could be edge
> > indices sorted according to their edge x position...
> > This comparator is specific as it compares anything (external
> > storage) at given indices i,j.
> >
>
>


Re: Re[4]: The new optimized version of Dual-Pivot Quicksort

2018-11-20 Thread Laurent Bourgès
Hi Vladimir,

One short comment, see below:


Le jeu. 15 nov. 2018 à 15:39, Vladimir Yaroslavskiy  a
écrit :

> Hello, Tagir!
>
> I compared Radix sort with Dual-Pivot Quicksort and see
> that Radix sort is faster than DPQ (on 2M elements) on random data in
> twice times,
> but it works slower on repeated elements (in 3-7 times). For highly
> structured arrays merging sort is invoked in both cases. Other
> data types - almost the same.
>

Your test is not really fair: 2m ints ~ 8mb so it is larger than typical
CPU L3 cache size.
RadixSort is doing random accesses and is known to be faster when arrays
fits in L2/L3 caches.
If larger then using radix on sub parts+merge !


> Do we actually need to use RadixSort if it is not faster on all inputs?
>

I will do my own performance tests on int[] arrays with sizes in range [50
- 10] ...

Vladimir, I think Tageer proposed to tune its radixSort to obtain an hybrid
/ adaptive radix sort, as you did in your DPQS.
I can help here.

Cheers,
Laurent


What do you think?
>
> Regards,
> Vladimir
>
> Среда, 14 ноября 2018, 19:17 +03:00 от Tagir Valeev :
>
> Hello, Laurent, Vladimir!
>
> I created a pull request containing my RadixSort implementation:
> https://github.com/bourgesl/nearly-optimal-mergesort-code/pull/1
> On my machine the results produced by Mergesorts.java are like this:
>
> Runs with individual timing (skips first 10 runs):
> adjusted reps: 110 + inner loop: 1
> avg-ms=102.6177(+/- 7 %), algo=PeekSort+iscutoff=24+onlyIncRuns=false,
> n=100 (55000110) (n=100, µ=102.6177, σ=7.4065714)
> avg-ms=95.62607(+/- 4 %),
> algo=TopDownMergesort+iscutoff=24+checkSorted=true, n=100
> (55000110) (n=100, µ=95.62607, σ=3.5302947)
> avg-ms=118.73089(+/- 4 %),
> algo=BottomUpMergesort+minRunLen=24+checkSorted=true, n=100
> (55000110) (n=100, µ=118.73089, σ=4.464908)
> avg-ms=108.36175(+/- 4 %), algo=MarlinSort, n=100 (55000110)
> (n=100, µ=108.36175, σ=4.5998554)
> avg-ms=99.68292(+/- 4 %), algo=MarlinMergeSort, n=100
> (55000110) (n=100, µ=99.68292, σ=3.9944465)
> avg-ms=75.43999(+/- 3 %), algo=DualPivotQuicksort2018, n=100
> (55000110) (n=100, µ=75.43999, σ=2.6127808)
> avg-ms=83.80406(+/- 6 %), algo=DualPivotQuicksort2018Ext, n=100
>  (55000110) (n=100, µ=83.80406, σ=4.734225)
> avg-ms=18.886326(+/- 4 %), algo=RadixSort, n=100 (55000110)
> (n=100, µ=18.886326, σ=0.75667334)
>
> As you can see RadixSort is much faster than anything. Well, probably
> there are some input patterns where it may lose (though it's
> performance is much more predictable and much less depend on input
> data than in DPQS), but I strongly believe that concentrating efforts
> on testing corner cases performance and integrating RadixSort into JDK
> would yield much better performance than DPQS, at least for int[]
> arrays. Also the implementation is much simpler.
>
> With best regards,
> Tagir Valeev.
> On Mon, Nov 12, 2018 at 5:09 PM Laurent Bourgès
>  wrote:
> >
> > Hi,
> >
> > Do you know if someone has written a complete JMH benchmark suite
> dedicated
> > to Arrays.sort() ?
> > with varying array size (trivial) but also testing lots of data
> > distributions: (see Vladimir's tests) and possibly all variants (int,
> long,
> > double, Object[] )
> >
> > It could be part of the standard OpenJDK JMH test suite...
> >
> > For now, I forked the nearly-optimal-mergesort repository on github:
> >
> https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/results
> >
> > Cheers,
> > Laurent
> >
> > Le sam. 10 nov. 2018 à 12:49, Laurent Bourgès 
> a
> > écrit :
> >
> > > Dear Vladimir & other Java sort experts,
> > >
> > > I made the port of the DPQS 2018.2 code last night to support a
> secondary
> > > array to be sorted and use preallocation (aux/run for merge sort):
> > >
> > >
> https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/src/wildinter/net/mergesort/DualPivotQuicksort2018Ext.java
> > >
> > > I succeded in using this variant in Marlin renderer (dev) and it is
> > > promising. I figured out a rendering bug in 1 test wigh 65535 random
> > > segments, certainly due to array swaps in mergesort (my side)...
> > >
> > > I will look again at my changes to check if I miss some permutation (a
> //
> > > b) or made any mistake on their indices... tricky.
> > >
> > > PS: Please share your updated webrev when possible to rebase my
> changes.
> > >
> > > Regards,
> > > Laurent
> > >
> > > Le ven. 9 nov. 2018 à 10:08, Laurent Bourgè

Re: Extending Java Arrays/Collection Sort API

2018-11-19 Thread Laurent Bourgès
Hi,
Any feedback on improving Java Sort API ?

PS: I improved a lot the Array benchmark accuracy (confidence interval ~
5%). Here are EA results:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/results/basher-results-partial.out

Does anybody want to help me on this topic ?
Do you have a JMH test suite dedicated to array sorting ?

Cheers,
Laurent

Le mer. 14 nov. 2018 à 09:01, Laurent Bourgès  a
écrit :

> Hi,
>
> I am a scientist and the author of the Marlin renderer, integrated in
> OpenJDK.
>
> In this renderer, I needed a fast sort algorithm on small almost sorted
> arrays, but more important, to deal with 2 arrays: x values and their
> corresponding edge indices (pointer like). I derived my MergeSort class to
> perform 2 swaps (x & edge ptr) as Arrays.sort() or TimSort() did not
> provide such API.
>
> 1. I started discussing about extending the Java Sort API with Vladimir
> Yaroslavskiy, the author of the famous Java DualPivotQuickSort, as he is
> currently improving it.
>
> His coming DPQS 2018 class provides lot of sort algorithms, I propose to
> make them public API through a facade method in Arrays class:
> insertionSort(), heapSort(), mergeSort(), quickSort()... Of course, his
> current code is dedicated to DPQS, but his algorithm implementation could
> be generalized to propose the Best implementations as a public Java API
> (stable & unstable sorts)
>
> For example in Marlin, I need trivial insertionSort() in the Helpers
> class, my MergeSort could be replaced by another best implementation.
> Let's not reinvent the wheel...
>
> 2. New Sort API to deal with indices...
>
> For now, there is no mean to obtain the indices sorted on another value,
> see my introduction: edge indices sorted by their x positions.
> In data science, this is very useful and general.
>
> What about an Arrays API extension:
> - Arrays.sort(int[] a, int[] b, low, high) or
> Indices int[] Arrays.sortIndices(int[] a, low high), a[] left untouched
> - same with preallocated buffers & runs arrays
> - alternatively, it could be expressed as Comparator
> Arrays.sort(int[] a, low, high, Comparator cmp) that sorts array
> a using the given comparator. For example, a[] could be edge indices sorted
> according to their edge x position...
> This comparator is specific as it compares anything (external storage) at
> given indices i,j.
>
> Finally Marlin renderer would benefit from Value types (array of structs),
> as edge data are currently packed in int[] arrays, but this new Sort API
> seems to me general & needed enough to be discussed here.
>
> Best regards,
> Laurent
>


Re: Re[4]: The new optimized version of Dual-Pivot Quicksort

2018-11-15 Thread Laurent Bourgès
Hi all Sorting experts,

I want to let you know I am building a Sorting benchmark suite myself based
on contributions of Sebastian Wild (forked github repo) & Vladimir on
github (MIT license):
https://github.com/bourgesl/nearly-optimal-mergesort-code

I hope to exploit JMH in a short future to obtain a reliable test suite on
lots of dataset types & lots of Sorting implementation & array sizes.

Once stabilized, its results can be exploited to run the Java Sort Race
2018.12...

If some C.S. researchers are interested, data papers could rely on this
tool to compare (fairly) algorithms.

PS: Early results (unknow confidence yet):
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/results/basher-results-partial.out

Best regards,
FOSS rocks !

😎
Laurent

Le jeu. 15 nov. 2018 à 15:39, Vladimir Yaroslavskiy  a
écrit :

> Hello, Tagir!
>
> I compared Radix sort with Dual-Pivot Quicksort and see
> that Radix sort is faster than DPQ (on 2M elements) on random data in
> twice times,
> but it works slower on repeated elements (in 3-7 times). For highly
> structured arrays merging sort is invoked in both cases. Other
> data types - almost the same.
>
> Do we actually need to use RadixSort if it is not faster on all inputs?
> What do you think?
>
> Regards,
> Vladimir
>
> Среда, 14 ноября 2018, 19:17 +03:00 от Tagir Valeev :
>
> Hello, Laurent, Vladimir!
>
> I created a pull request containing my RadixSort implementation:
> https://github.com/bourgesl/nearly-optimal-mergesort-code/pull/1
> On my machine the results produced by Mergesorts.java are like this:
>
> Runs with individual timing (skips first 10 runs):
> adjusted reps: 110 + inner loop: 1
> avg-ms=102.6177(+/- 7 %), algo=PeekSort+iscutoff=24+onlyIncRuns=false,
> n=100 (55000110) (n=100, µ=102.6177, σ=7.4065714)
> avg-ms=95.62607(+/- 4 %),
> algo=TopDownMergesort+iscutoff=24+checkSorted=true, n=100
> (55000110) (n=100, µ=95.62607, σ=3.5302947)
> avg-ms=118.73089(+/- 4 %),
> algo=BottomUpMergesort+minRunLen=24+checkSorted=true, n=100
> (55000110) (n=100, µ=118.73089, σ=4.464908)
> avg-ms=108.36175(+/- 4 %), algo=MarlinSort, n=100 (55000110)
> (n=100, µ=108.36175, σ=4.5998554)
> avg-ms=99.68292(+/- 4 %), algo=MarlinMergeSort, n=100
> (55000110) (n=100, µ=99.68292, σ=3.9944465)
> avg-ms=75.43999(+/- 3 %), algo=DualPivotQuicksort2018, n=100
> (55000110) (n=100, µ=75.43999, σ=2.6127808)
> avg-ms=83.80406(+/- 6 %), algo=DualPivotQuicksort2018Ext, n=100
>  (55000110) (n=100, µ=83.80406, σ=4.734225)
> avg-ms=18.886326(+/- 4 %), algo=RadixSort, n=100 (55000110)
> (n=100, µ=18.886326, σ=0.75667334)
>
> As you can see RadixSort is much faster than anything. Well, probably
> there are some input patterns where it may lose (though it's
> performance is much more predictable and much less depend on input
> data than in DPQS), but I strongly believe that concentrating efforts
> on testing corner cases performance and integrating RadixSort into JDK
> would yield much better performance than DPQS, at least for int[]
> arrays. Also the implementation is much simpler.
>
> With best regards,
> Tagir Valeev.
> On Mon, Nov 12, 2018 at 5:09 PM Laurent Bourgès
>  wrote:
> >
> > Hi,
> >
> > Do you know if someone has written a complete JMH benchmark suite
> dedicated
> > to Arrays.sort() ?
> > with varying array size (trivial) but also testing lots of data
> > distributions: (see Vladimir's tests) and possibly all variants (int,
> long,
> > double, Object[] )
> >
> > It could be part of the standard OpenJDK JMH test suite...
> >
> > For now, I forked the nearly-optimal-mergesort repository on github:
> >
> https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/results
> >
> > Cheers,
> > Laurent
> >
> > Le sam. 10 nov. 2018 à 12:49, Laurent Bourgès 
> a
> > écrit :
> >
> > > Dear Vladimir & other Java sort experts,
> > >
> > > I made the port of the DPQS 2018.2 code last night to support a
> secondary
> > > array to be sorted and use preallocation (aux/run for merge sort):
> > >
> > >
> https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/src/wildinter/net/mergesort/DualPivotQuicksort2018Ext.java
> > >
> > > I succeded in using this variant in Marlin renderer (dev) and it is
> > > promising. I figured out a rendering bug in 1 test wigh 65535 random
> > > segments, certainly due to array swaps in mergesort (my side)...
> > >
> > > I will look again at my changes to check if I miss some permutation (a
> //
> > > b) or made any mistake on thei

Re:

2018-11-14 Thread Laurent Bourgès
Hi Vladimir,


>
>>
>> The test suite was described in this paper
>>
>> Jon L. Bentley, M. Douglas McILroy
>> “Engineering a Sort Function”, 1993
>>
>> I use Java version (a little bit extended), see attachment.
>> What you need is to specified sorting classes in IntSorter.java
>> and run BentleyBasher.
>>
>
> Thanks, what is the license of this code ? Files do not have any header...
>
> If you agree, I would like to host such test code on github, as an
> opensource repository.
>

What is your decision ?
 Can I put such precious test code on github (MIT license) in my fork of
sebastian wild's nearly-optimal-mergesort ?

https://github.com/bourgesl/nearly-optimal-mergesort-code/

Cheers,
Laurent


Extending Java Arrays/Collection Sort API

2018-11-14 Thread Laurent Bourgès
Hi,

I am a scientist and the author of the Marlin renderer, integrated in
OpenJDK.

In this renderer, I needed a fast sort algorithm on small almost sorted
arrays, but more important, to deal with 2 arrays: x values and their
corresponding edge indices (pointer like). I derived my MergeSort class to
perform 2 swaps (x & edge ptr) as Arrays.sort() or TimSort() did not
provide such API.

1. I started discussing about extending the Java Sort API with Vladimir
Yaroslavskiy, the author of the famous Java DualPivotQuickSort, as he is
currently improving it.

His coming DPQS 2018 class provides lot of sort algorithms, I propose to
make them public API through a facade method in Arrays class:
insertionSort(), heapSort(), mergeSort(), quickSort()... Of course, his
current code is dedicated to DPQS, but his algorithm implementation could
be generalized to propose the Best implementations as a public Java API
(stable & unstable sorts)

For example in Marlin, I need trivial insertionSort() in the Helpers class,
my MergeSort could be replaced by another best implementation.
Let's not reinvent the wheel...

2. New Sort API to deal with indices...

For now, there is no mean to obtain the indices sorted on another value,
see my introduction: edge indices sorted by their x positions.
In data science, this is very useful and general.

What about an Arrays API extension:
- Arrays.sort(int[] a, int[] b, low, high) or
Indices int[] Arrays.sortIndices(int[] a, low high), a[] left untouched
- same with preallocated buffers & runs arrays
- alternatively, it could be expressed as Comparator
Arrays.sort(int[] a, low, high, Comparator cmp) that sorts array
a using the given comparator. For example, a[] could be edge indices sorted
according to their edge x position...
This comparator is specific as it compares anything (external storage) at
given indices i,j.

Finally Marlin renderer would benefit from Value types (array of structs),
as edge data are currently packed in int[] arrays, but this new Sort API
seems to me general & needed enough to be discussed here.

Best regards,
Laurent


Re:

2018-11-12 Thread Laurent Bourgès
Hi Vladimir,

Excellent, I will try your test code asap.

Le lun. 12 nov. 2018 à 12:25, Vladimir Yaroslavskiy  a
écrit :

> Hi Laurent,
>
> The test suite was described in this paper
>
> Jon L. Bentley, M. Douglas McILroy
> “Engineering a Sort Function”, 1993
>
> I use Java version (a little bit extended), see attachment.
> What you need is to specified sorting classes in IntSorter.java
> and run BentleyBasher.
>

Thanks, what is the license of this code ? Files do not have any header...

If you agree, I would like to host such test code on github, as an
opensource repository.

>
> Please let me know if you have questions/comments/improvements.
>
Sure, I will do,
Thanks a lot.

Laurent

> Понедельник, 12 ноября 2018, 14:01 +03:00 от Laurent Bourgès <
> bourges.laur...@gmail.com>:
>
> Dear Vladimir,
>
> No information about JMH benchmarking.
>
>
> I like it as Alexey made the best framework to benchmark short operations,
> like sorting small arrays.
>
> For example, 1000 ints ~ 0.05ms on my i7 laptop at fixed freq = 2ghz.
>
>
> I use Bentley's test suite to compare algorithms.
>
>
> Is it publicly available ? If yes, where ?
>
> Cheers,
> Laurent
>
>
>
> Понедельник, 12 ноября 2018, 13:06 +03:00 от Laurent Bourgès <
> bourges.laur...@gmail.com
> <https://e.mail.ru/compose/?mailto=mailto%3abourges.laur...@gmail.com>>:
>
> Hi,
>
> Do you know if someone has written a complete JMH benchmark suite
> dedicated to Arrays.sort() ?
> with varying array size (trivial) but also testing lots of data
> distributions: (see Vladimir's tests) and possibly all variants (int, long,
> double, Object[] )
>
> It could be part of the standard OpenJDK JMH test suite...
>
> For now, I forked the nearly-optimal-mergesort repository on github:
>
> https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/results
>
> Cheers,
> Laurent
>
>
>
> --
> Vladimir Yaroslavskiy
>


Re: Re[4]: The new optimized version of Dual-Pivot Quicksort

2018-11-12 Thread Laurent Bourgès
Dear Vladimir,

No information about JMH benchmarking.
>

I like it as Alexey made the best framework to benchmark short operations,
like sorting small arrays.

For example, 1000 ints ~ 0.05ms on my i7 laptop at fixed freq = 2ghz.

I use Bentley's test suite to compare algorithms.
>

Is it publicly available ? If yes, where ?

Cheers,
Laurent


> Понедельник, 12 ноября 2018, 13:06 +03:00 от Laurent Bourgès <
> bourges.laur...@gmail.com>:
>
> Hi,
>
> Do you know if someone has written a complete JMH benchmark suite
> dedicated to Arrays.sort() ?
> with varying array size (trivial) but also testing lots of data
> distributions: (see Vladimir's tests) and possibly all variants (int, long,
> double, Object[] )
>
> It could be part of the standard OpenJDK JMH test suite...
>
> For now, I forked the nearly-optimal-mergesort repository on github:
>
> https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/results
>
> Cheers,
> Laurent
>
>


Re: Re[2]: The new optimized version of Dual-Pivot Quicksort

2018-11-12 Thread Laurent Bourgès
Hi,

Do you know if someone has written a complete JMH benchmark suite dedicated
to Arrays.sort() ?
with varying array size (trivial) but also testing lots of data
distributions: (see Vladimir's tests) and possibly all variants (int, long,
double, Object[] )

It could be part of the standard OpenJDK JMH test suite...

For now, I forked the nearly-optimal-mergesort repository on github:
https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/results

Cheers,
Laurent

Le sam. 10 nov. 2018 à 12:49, Laurent Bourgès  a
écrit :

> Dear Vladimir & other Java sort experts,
>
> I made the port of the DPQS 2018.2 code last night to support a secondary
> array to be sorted and use preallocation (aux/run for merge sort):
>
> https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/src/wildinter/net/mergesort/DualPivotQuicksort2018Ext.java
>
> I succeded in using this variant in Marlin renderer (dev) and it is
> promising. I figured out a rendering bug in 1 test wigh 65535 random
> segments, certainly due to array swaps in mergesort (my side)...
>
> I will look again at my changes to check if I miss some permutation (a //
> b) or made any mistake on their indices... tricky.
>
> PS: Please share your updated webrev when possible to rebase my changes.
>
> Regards,
> Laurent
>
> Le ven. 9 nov. 2018 à 10:08, Laurent Bourgès 
> a écrit :
>
>> Hi Vladimir,
>>
>> Thank you for your attention, you are the Sort Master.
>>
>>
>> Le ven. 9 nov. 2018 à 09:02, Vladimir Yaroslavskiy 
>> a écrit :
>>
>>> Hi Laurent,
>>>
>>> The new version is still under review, there were a lot of suggestions
>>> and ideas from Doug Lea.
>>> I needed time to apply and check them. I'm going to send final version
>>> in few days.
>>>
>>
>> Excellent news, so it will ship in jdk12...
>>
>>>
>>> As to new method sort(a[], low, high, indices): do you mean this method
>>> in Arrays class?
>>> Are you going to extend Arrays API?
>>>
>>
>> I benchmarked my MergeSort and adding extra array implies many more
>> moves: thresholds should be adjusted and my sort is sometimes better as it
>> makes half moves (a <-> buffer swaps), so this complementary sort should
>> have its own tuned implementation (tests & benchmarks).
>>
>> I coded my sort as such need did not match any Arrays.sort() methods.
>> Having it publicly in jdk.base would be great for any other data handling
>> algorithm (science, AI ?)
>>
>> If you agree it is useful, I could try proposing an Arrays/Collection API
>> extension like :
>> sort([], low, high, int[]indices)
>>
>>
>>> About new parameters workbase[]: method sort(...) in DualPivotQuicksort
>>> class (version is under review)
>>> has extra parameter - parallel context (Sorter sorter) which has
>>> required workbase[].
>>> If we run sorting sequentially, sorter parameter is null (no any
>>> objects) and temporary buffer
>>> will be created inside in tryMerge method if we go ahead with merging.
>>>
>>
>> I will have a look. For Marlin, parallel sort is not possible as
>> rendering shapes can be parallelized in the application (geoserver map
>> rendering).
>>
>>
>>> I will send new sources in few days and add you in cc, so you can look
>>> at it
>>> and find if new methods are acceptable for you. Then we can discuss
>>> required changes (if any).
>>>
>>
>> Perfect, I started adapting your code and will share soon the link to my
>> github repo.
>>
>>
>>> Thank you,
>>> Vladimir
>>>
>>
>> Thank you very much for your first thoughts,
>> Laurent
>>
>>
>>> Пятница, 9 ноября 2018, 10:44 +03:00 от Laurent Bourgès <
>>> bourges.laur...@gmail.com>:
>>>
>>> Hi,
>>>
>>> I am currently testing many sort algorithms to improve the Marlin
>>> renderer (AA 2D shape rasterizer), I integrated since OpenJDK9 and am still
>>> improving for OpenJDK12 .
>>>
>>> I created my MergeSort (top down, check for sorted parts, array / buffer
>>> swap to minimize moves, isort on small sub arrays) that sort 2 int[]:
>>> crossing + edge indices.
>>>  It is critical as edge crossings are sorted at every scanline, data are
>>> almost sorted/reversed, not really random.
>>>
>>> 3 questions:
>>> - why is this patch dormant ? I checked in openjdk12 repo and your
>>> changes were not integrated.
&g

Re: Re[2]: The new optimized version of Dual-Pivot Quicksort

2018-11-10 Thread Laurent Bourgès
Dear Vladimir & other Java sort experts,

I made the port of the DPQS 2018.2 code last night to support a secondary
array to be sorted and use preallocation (aux/run for merge sort):
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/src/wildinter/net/mergesort/DualPivotQuicksort2018Ext.java

I succeded in using this variant in Marlin renderer (dev) and it is
promising. I figured out a rendering bug in 1 test wigh 65535 random
segments, certainly due to array swaps in mergesort (my side)...

I will look again at my changes to check if I miss some permutation (a //
b) or made any mistake on their indices... tricky.

PS: Please share your updated webrev when possible to rebase my changes.

Regards,
Laurent

Le ven. 9 nov. 2018 à 10:08, Laurent Bourgès  a
écrit :

> Hi Vladimir,
>
> Thank you for your attention, you are the Sort Master.
>
>
> Le ven. 9 nov. 2018 à 09:02, Vladimir Yaroslavskiy  a
> écrit :
>
>> Hi Laurent,
>>
>> The new version is still under review, there were a lot of suggestions
>> and ideas from Doug Lea.
>> I needed time to apply and check them. I'm going to send final version in
>> few days.
>>
>
> Excellent news, so it will ship in jdk12...
>
>>
>> As to new method sort(a[], low, high, indices): do you mean this method
>> in Arrays class?
>> Are you going to extend Arrays API?
>>
>
> I benchmarked my MergeSort and adding extra array implies many more moves:
> thresholds should be adjusted and my sort is sometimes better as it makes
> half moves (a <-> buffer swaps), so this complementary sort should have its
> own tuned implementation (tests & benchmarks).
>
> I coded my sort as such need did not match any Arrays.sort() methods.
> Having it publicly in jdk.base would be great for any other data handling
> algorithm (science, AI ?)
>
> If you agree it is useful, I could try proposing an Arrays/Collection API
> extension like :
> sort([], low, high, int[]indices)
>
>
>> About new parameters workbase[]: method sort(...) in DualPivotQuicksort
>> class (version is under review)
>> has extra parameter - parallel context (Sorter sorter) which has required
>> workbase[].
>> If we run sorting sequentially, sorter parameter is null (no any objects)
>> and temporary buffer
>> will be created inside in tryMerge method if we go ahead with merging.
>>
>
> I will have a look. For Marlin, parallel sort is not possible as rendering
> shapes can be parallelized in the application (geoserver map rendering).
>
>
>> I will send new sources in few days and add you in cc, so you can look at
>> it
>> and find if new methods are acceptable for you. Then we can discuss
>> required changes (if any).
>>
>
> Perfect, I started adapting your code and will share soon the link to my
> github repo.
>
>
>> Thank you,
>> Vladimir
>>
>
> Thank you very much for your first thoughts,
> Laurent
>
>
>> Пятница, 9 ноября 2018, 10:44 +03:00 от Laurent Bourgès <
>> bourges.laur...@gmail.com>:
>>
>> Hi,
>>
>> I am currently testing many sort algorithms to improve the Marlin
>> renderer (AA 2D shape rasterizer), I integrated since OpenJDK9 and am still
>> improving for OpenJDK12 .
>>
>> I created my MergeSort (top down, check for sorted parts, array / buffer
>> swap to minimize moves, isort on small sub arrays) that sort 2 int[]:
>> crossing + edge indices.
>>  It is critical as edge crossings are sorted at every scanline, data are
>> almost sorted/reversed, not really random.
>>
>> 3 questions:
>> - why is this patch dormant ? I checked in openjdk12 repo and your
>> changes were not integrated.
>> - I need a sort() method that works with 2 arrays: data + indices
>> (pointer like). Such sorted indices are useful to use for lookups c[idx[k]]
>> or to perform other array permutations...
>> Would you agree adding such new sort(a[], low, high, indices)
>> - Marlin uses a ThreadLocal context to avoid any allocation. JDK sort()
>> impl do not accept parameters to give any buffer[] and avoid allocations.
>> Would you agree adding such optional parameters (like workbase[]) ?
>>
>> I will experiment adapting the DualPivotQuickSort in Marlin renderer and
>> perform array sort race & rendering benchmarks.
>>
>> Cheers,
>> Laurent
>>
>> Le ven. 19 janv. 2018 à 14:38, Vladimir Yaroslavskiy > <https://e.mail.ru/compose/?mailto=mailto%3avlv.spb...@mail.ru>> a
>> écrit :
>>
>> Hi team,
>>
>> In Sept 2009 Josh Bloch, Jon Bentley and I introduced new sorting
>> alg

Re: Re[2]: The new optimized version of Dual-Pivot Quicksort

2018-11-09 Thread Laurent Bourgès
Hi Vladimir,

Thank you for your attention, you are the Sort Master.


Le ven. 9 nov. 2018 à 09:02, Vladimir Yaroslavskiy  a
écrit :

> Hi Laurent,
>
> The new version is still under review, there were a lot of suggestions and
> ideas from Doug Lea.
> I needed time to apply and check them. I'm going to send final version in
> few days.
>

Excellent news, so it will ship in jdk12...

>
> As to new method sort(a[], low, high, indices): do you mean this method in
> Arrays class?
> Are you going to extend Arrays API?
>

I benchmarked my MergeSort and adding extra array implies many more moves:
thresholds should be adjusted and my sort is sometimes better as it makes
half moves (a <-> buffer swaps), so this complementary sort should have its
own tuned implementation (tests & benchmarks).

I coded my sort as such need did not match any Arrays.sort() methods.
Having it publicly in jdk.base would be great for any other data handling
algorithm (science, AI ?)

If you agree it is useful, I could try proposing an Arrays/Collection API
extension like :
sort([], low, high, int[]indices)


> About new parameters workbase[]: method sort(...) in DualPivotQuicksort
> class (version is under review)
> has extra parameter - parallel context (Sorter sorter) which has required
> workbase[].
> If we run sorting sequentially, sorter parameter is null (no any objects)
> and temporary buffer
> will be created inside in tryMerge method if we go ahead with merging.
>

I will have a look. For Marlin, parallel sort is not possible as rendering
shapes can be parallelized in the application (geoserver map rendering).


> I will send new sources in few days and add you in cc, so you can look at
> it
> and find if new methods are acceptable for you. Then we can discuss
> required changes (if any).
>

Perfect, I started adapting your code and will share soon the link to my
github repo.


> Thank you,
> Vladimir
>

Thank you very much for your first thoughts,
Laurent


> Пятница, 9 ноября 2018, 10:44 +03:00 от Laurent Bourgès <
> bourges.laur...@gmail.com>:
>
> Hi,
>
> I am currently testing many sort algorithms to improve the Marlin renderer
> (AA 2D shape rasterizer), I integrated since OpenJDK9 and am still
> improving for OpenJDK12 .
>
> I created my MergeSort (top down, check for sorted parts, array / buffer
> swap to minimize moves, isort on small sub arrays) that sort 2 int[]:
> crossing + edge indices.
>  It is critical as edge crossings are sorted at every scanline, data are
> almost sorted/reversed, not really random.
>
> 3 questions:
> - why is this patch dormant ? I checked in openjdk12 repo and your changes
> were not integrated.
> - I need a sort() method that works with 2 arrays: data + indices (pointer
> like). Such sorted indices are useful to use for lookups c[idx[k]] or to
> perform other array permutations...
> Would you agree adding such new sort(a[], low, high, indices)
> - Marlin uses a ThreadLocal context to avoid any allocation. JDK sort()
> impl do not accept parameters to give any buffer[] and avoid allocations.
> Would you agree adding such optional parameters (like workbase[]) ?
>
> I will experiment adapting the DualPivotQuickSort in Marlin renderer and
> perform array sort race & rendering benchmarks.
>
> Cheers,
> Laurent
>
> Le ven. 19 janv. 2018 à 14:38, Vladimir Yaroslavskiy  <https://e.mail.ru/compose/?mailto=mailto%3avlv.spb...@mail.ru>> a écrit :
>
> Hi team,
>
> In Sept 2009 Josh Bloch, Jon Bentley and I introduced new sorting
> algorithm, Dual-Pivot Quicksort, for primitives in JDK 7 and later
> I suggested several improvements of Dual-Pivot Quicksort, which
> were integrated into JDK 8.
>
> Now I have more optimized and faster version of Dual-Pivot Quicksort.
> I have been working on it for the last 5 years. Please, find the
> summary of changes below and sources / diff at webrev [1].
>
> All tests and benchmarking were run on the most recent build of JDK 10,
> jdk-10-ea+39. The new version shows the better performance on different
> inputs and guarantees n*log(n) on any data.
>
> The new implementation of Dual-Pivot Quicksort is 'all-in-one' version:
> it contains one code for both parallel and sequential sorting algorithms.
>
> Suggested version is 10-20% faster on random data, 1.5-4 times faster
> on nearly structured arrays, 1.5-2 times faster on period inputs.
>
> Parallel Dual-Pivot Quicksort is 1.5-3 times faster than current
> algorithm based on merge sort.
>
> Benchmarking on the test suite, suggested by Jon Bentley, shows the
> boost of performance in 1.4 times. This test suite contains several
> types of inputs, such as random data, nearly structured a

Re: The new optimized version of Dual-Pivot Quicksort

2018-11-09 Thread Laurent Bourgès
Hi,

One more thing: if you have any advice on my code, datasets, or algorithmic
approach, please give me your feedback/comments.

Here is the Marlin MergeSort:
http://hg.openjdk.java.net/jdk/jdk/file/190b77982361/src/java.desktop/share/classes/sun/java2d/marlin/MergeSort.java

Laurent

Le ven. 9 nov. 2018 à 08:44, Laurent Bourgès  a
écrit :

> Hi,
>
> I am currently testing many sort algorithms to improve the Marlin renderer
> (AA 2D shape rasterizer), I integrated since OpenJDK9 and am still
> improving for OpenJDK12 .
>
> I created my MergeSort (top down, check for sorted parts, array / buffer
> swap to minimize moves, isort on small sub arrays) that sort 2 int[]:
> crossing + edge indices.
>  It is critical as edge crossings are sorted at every scanline, data are
> almost sorted/reversed, not really random.
>
> 3 questions:
> - why is this patch dormant ? I checked in openjdk12 repo and your changes
> were not integrated.
> - I need a sort() method that works with 2 arrays: data + indices (pointer
> like). Such sorted indices are useful to use for lookups c[idx[k]] or to
> perform other array permutations...
> Would you agree adding such new sort(a[], low, high, indices)
> - Marlin uses a ThreadLocal context to avoid any allocation. JDK sort()
> impl do not accept parameters to give any buffer[] and avoid allocations.
> Would you agree adding such optional parameters (like workbase[]) ?
>
> I will experiment adapting the DualPivotQuickSort in Marlin renderer and
> perform array sort race & rendering benchmarks.
>
> Cheers,
> Laurent
>
> Le ven. 19 janv. 2018 à 14:38, Vladimir Yaroslavskiy 
> a écrit :
>
>> Hi team,
>>
>> In Sept 2009 Josh Bloch, Jon Bentley and I introduced new sorting
>> algorithm, Dual-Pivot Quicksort, for primitives in JDK 7 and later
>> I suggested several improvements of Dual-Pivot Quicksort, which
>> were integrated into JDK 8.
>>
>> Now I have more optimized and faster version of Dual-Pivot Quicksort.
>> I have been working on it for the last 5 years. Please, find the
>> summary of changes below and sources / diff at webrev [1].
>>
>> All tests and benchmarking were run on the most recent build of JDK 10,
>> jdk-10-ea+39. The new version shows the better performance on different
>> inputs and guarantees n*log(n) on any data.
>>
>> The new implementation of Dual-Pivot Quicksort is 'all-in-one' version:
>> it contains one code for both parallel and sequential sorting algorithms.
>>
>> Suggested version is 10-20% faster on random data, 1.5-4 times faster
>> on nearly structured arrays, 1.5-2 times faster on period inputs.
>>
>> Parallel Dual-Pivot Quicksort is 1.5-3 times faster than current
>> algorithm based on merge sort.
>>
>> Benchmarking on the test suite, suggested by Jon Bentley, shows the
>> boost of performance in 1.4 times. This test suite contains several
>> types of inputs, such as random data, nearly structured arrays, data
>> with period and so on. Also there are several modifications of inputs
>> and parameters which are used in data building. We run sorting on all
>> combinations to compare two algorithms.
>>
>> Please let me know if you have any questions / comments.
>>
>> Summary of changes:
>>
>> DualPivotQuicksort class
>> 
>> * Pivots are chosen with another step, the 1-st and 5-th candidates
>>are taken as pivots instead of 2-nd and 4-th.
>> * Splitting into parts is related to the golden ratio
>> * Pivot candidates are sorted by combination of 5-element
>>network sorting + insertion sort
>> * New backwards partitioning is simpler and more efficient
>> * Quicksort tuning parameters were updated
>> * Merging sort is invoked on each iteration from Quicksort
>> * Additional steps for float/double were fully rewritten
>> * Heap sort is invoked on the leftmost part
>> * Heap sort is used as a guard against quadratic time
>> * Parallel sorting is based on Dual-Pivot Quicksort
>>instead of merge sort
>>
>> SortingAlgorithms class
>> ---
>> * Merging sort and pair insertion sort were moved from
>>DualPivotQuicksort class
>> * Pair insertion sort was simplified and optimized
>> * New nano insertion sort was introduced for tiny arrays
>> * Merging sort was fully rewritten
>> * Optimized merging partitioning is used
>> * Merging parameters were updated
>> * Merging of runs was fully rewritten
>> * Fast version of heap sort was introduced
>> * Parallel merging sort was also provided
>>
>> Sorting / ParallelSorting 

Re: The new optimized version of Dual-Pivot Quicksort

2018-11-08 Thread Laurent Bourgès
Hi,

I am currently testing many sort algorithms to improve the Marlin renderer
(AA 2D shape rasterizer), I integrated since OpenJDK9 and am still
improving for OpenJDK12 .

I created my MergeSort (top down, check for sorted parts, array / buffer
swap to minimize moves, isort on small sub arrays) that sort 2 int[]:
crossing + edge indices.
 It is critical as edge crossings are sorted at every scanline, data are
almost sorted/reversed, not really random.

3 questions:
- why is this patch dormant ? I checked in openjdk12 repo and your changes
were not integrated.
- I need a sort() method that works with 2 arrays: data + indices (pointer
like). Such sorted indices are useful to use for lookups c[idx[k]] or to
perform other array permutations...
Would you agree adding such new sort(a[], low, high, indices)
- Marlin uses a ThreadLocal context to avoid any allocation. JDK sort()
impl do not accept parameters to give any buffer[] and avoid allocations.
Would you agree adding such optional parameters (like workbase[]) ?

I will experiment adapting the DualPivotQuickSort in Marlin renderer and
perform array sort race & rendering benchmarks.

Cheers,
Laurent

Le ven. 19 janv. 2018 à 14:38, Vladimir Yaroslavskiy  a
écrit :

> Hi team,
>
> In Sept 2009 Josh Bloch, Jon Bentley and I introduced new sorting
> algorithm, Dual-Pivot Quicksort, for primitives in JDK 7 and later
> I suggested several improvements of Dual-Pivot Quicksort, which
> were integrated into JDK 8.
>
> Now I have more optimized and faster version of Dual-Pivot Quicksort.
> I have been working on it for the last 5 years. Please, find the
> summary of changes below and sources / diff at webrev [1].
>
> All tests and benchmarking were run on the most recent build of JDK 10,
> jdk-10-ea+39. The new version shows the better performance on different
> inputs and guarantees n*log(n) on any data.
>
> The new implementation of Dual-Pivot Quicksort is 'all-in-one' version:
> it contains one code for both parallel and sequential sorting algorithms.
>
> Suggested version is 10-20% faster on random data, 1.5-4 times faster
> on nearly structured arrays, 1.5-2 times faster on period inputs.
>
> Parallel Dual-Pivot Quicksort is 1.5-3 times faster than current
> algorithm based on merge sort.
>
> Benchmarking on the test suite, suggested by Jon Bentley, shows the
> boost of performance in 1.4 times. This test suite contains several
> types of inputs, such as random data, nearly structured arrays, data
> with period and so on. Also there are several modifications of inputs
> and parameters which are used in data building. We run sorting on all
> combinations to compare two algorithms.
>
> Please let me know if you have any questions / comments.
>
> Summary of changes:
>
> DualPivotQuicksort class
> 
> * Pivots are chosen with another step, the 1-st and 5-th candidates
>are taken as pivots instead of 2-nd and 4-th.
> * Splitting into parts is related to the golden ratio
> * Pivot candidates are sorted by combination of 5-element
>network sorting + insertion sort
> * New backwards partitioning is simpler and more efficient
> * Quicksort tuning parameters were updated
> * Merging sort is invoked on each iteration from Quicksort
> * Additional steps for float/double were fully rewritten
> * Heap sort is invoked on the leftmost part
> * Heap sort is used as a guard against quadratic time
> * Parallel sorting is based on Dual-Pivot Quicksort
>instead of merge sort
>
> SortingAlgorithms class
> ---
> * Merging sort and pair insertion sort were moved from
>DualPivotQuicksort class
> * Pair insertion sort was simplified and optimized
> * New nano insertion sort was introduced for tiny arrays
> * Merging sort was fully rewritten
> * Optimized merging partitioning is used
> * Merging parameters were updated
> * Merging of runs was fully rewritten
> * Fast version of heap sort was introduced
> * Parallel merging sort was also provided
>
> Sorting / ParallelSorting classes
> -
> * New test cases were added
> * Sources of these classes were unified
>
> Arrays class
> 
> * Calls of Dual-Pivot Quicksort were updated
> * Parallel sorting of primitives was switched to parallel
>Dual-Pivot Quicksort
> * Javadoc was modified
>
> ArraysParallelSortHelpers class
> ---
> * Old implementation of parallel sorting for primitives was removed
> * Javadoc was updated
>
> Thank you,
> Vladimir
>
> 
> [1] http://cr.openjdk.java.net/~alanb/DualPivotSortUpdate/webrev.01/
> 
>


Re: Can @Stable (or something similar) be made accessible?

2018-01-12 Thread Laurent Bourgès
Hi,

I am the author of the Marlin renderer, integrated in both java.desktop &
javafx.graphics modules.

I wonder if such core annotations @Stable @ForceInline ... could be allowed
to such openjdk modules in order to improve performance.

Marlin already uses jdk.internal.Unsafe but @Stable looks promising as
Marlin uses (and recycles) lots of int/double arrays.

I already made my fields (& constants) final but it is not enough to let
hotspot trust totally my code, isn't it ?

I think these annotations are confined to java.lang for now.
I will check thatasap.

Cheers,
Laurent

Le 12 janv. 2018 9:50 AM, "Andrew Haley"  a écrit :

> On 12/01/18 04:33, Jason Greene wrote:
>
> > The internal @Stable facility provides the desired semantics and
> > precision, but it is heavily locked down, with privileged code
> > checks and export restrictions. Could this be made more accessible
> > (or perhaps a variant restricted to just final fields)? Informing
> > the compiler that a final field is a true lazy initialized constant,
> > with no store-to-final seems a pretty useful construct in general. I
> > can understand that the long term desire is that this shouldn’t be
> > necessary, and should be inferred [3], but at that point the
> > annotation is still useful as documentation and legacy
> > compatibility. If nothing else could it be allowed in non-privileged
> > code via some flag?
>
> I don't know of any way to do that without compromising the integrity
> of the JVM.  All that anybody would have to do to break the VM is to
> define a field as @Stable and then change the field.  I suppose you
> could say that Unsafe is just as bad and people still are allowed to
> use it, but there's understandable reluctance any further to transform
> a safe runtime into an unsafe one.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. 
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>


Re: Faster Math ?

2017-11-17 Thread Laurent Bourgès
s.openjdk.java.net/browse/JDK-8145688
>   (From these you can track related issues.)
>
> Other Math functions are not intrinsic like cbrt (non-native) and acos
> (native). There is ongoing work to turn native implementations into Java
> implementations (i don’t know if there would be any follow up on
> intrinsification).
>
>https://bugs.openjdk.java.net/browse/JDK-8134780
>https://bugs.openjdk.java.net/browse/JDK-8171407
>
> Joe knows more.
>
> —
>
> As part of the Vector API effort we will likely need to investigate the
> support for less accurate but faster math functions. It’s too early to tell
> if something like a FastMath class will pop out of that, but FWIW i am
> sympathetic to that :-)
>
> I liked this tweet:
>
> https://twitter.com/FioraAeterna/status/926150700836405248
>
>life as a gpu compiler dev is basically just fielding repeated
> complaints that
>"fast math" isn't precise and "precise math" isn't fast
>
> as an indication of what we could be getting into :-)
>
> Paul.
>
> On 9 Nov 2017, at 01:00, Laurent Bourgès 
>> wrote:
>>
>> Hi,
>>
>> The Marlin renderer (JEP265) uses few Math functions: sqrt, cbrt, acos...
>>
>> Could you check if the current JDK uses C2 intrinsics or libfdm (native /
>> JNI overhead?) and tell me if such functions are already highly optimized
>> in jdk9 or 10 ?
>>
>> Some people have implemented their own fast Math like Apache Commons Math
>> or JaFaMa libraries that are 10x faster for acos / cbrt.
>>
>> I wonder if I should implement my own cbrt function (cubics) in pure java
>> as I do not need the highest accuracy but SPEED.
>>
>> Would it sound possible to have a JDK FastMath public API (lots faster but
>> less accurate?)
>>
>> Do you know if recent CPU (intel?) have dedicated instructions for such
>> math operations ?
>> Why not use it instead?
>> Maybe that's part of the new Vectorization API (panama) ?
>>
>> Cheers,
>> Laurent Bourges
>>
>


Re: Faster Math ?

2017-11-09 Thread Laurent Bourgès
Hi Paul,

Thank you so much for this complete summary !

I will still perform some benchmarks and could port acos native code into
java code as it is used by marlin.

Anyway I will backport the Cbrt java code into Marlin @ github for JDK8
users (GPL v2).

Thanks,
Laurent

Le 9 nov. 2017 18:19, "Paul Sandoz"  a écrit :

> Hi Laurent,
>
> A Java method is a candidate for intrinsification if it is annotated with
> @HotSpotIntrinsicCandidate. When running Java code you can also use the
> HotSpot flags -XX:+PrintCompilarion -XX:+PrintInlining to show methods that
> are intrinsic (JIT watch, as mentioned, is also excellent in this regard).
>
> I recommend cloning OpenJDK and browsing the source.
>
> Some of the math functions are intrinsic in the interpreter and all the
> runtime compilers to ensure consistent results across interpretation and
> compilation.
>
> Work was done by Intel to improve many of the math functions. See:
>
>   Update for x86 sin and cos in the math lib
>   https://bugs.openjdk.java.net/browse/JDK-8143353
>
>   Update for x86 pow in the math lib
>   https://bugs.openjdk.java.net/browse/JDK-8145688
>
>   (From these you can track related issues.)
>
> Other Math functions are not intrinsic like cbrt (non-native) and acos
> (native). There is ongoing work to turn native implementations into Java
> implementations (i don’t know if there would be any follow up on
> intrinsification).
>
>   https://bugs.openjdk.java.net/browse/JDK-8134780
>   https://bugs.openjdk.java.net/browse/JDK-8171407
>
> Joe knows more.
>
> —
>
> As part of the Vector API effort we will likely need to investigate the
> support for less accurate but faster math functions. It’s too early to tell
> if something like a FastMath class will pop out of that, but FWIW i am
> sympathetic to that :-)
>
> I liked this tweet:
>
> https://twitter.com/FioraAeterna/status/926150700836405248
>
>   life as a gpu compiler dev is basically just fielding repeated
> complaints that
>   "fast math" isn't precise and "precise math" isn't fast
>
> as an indication of what we could be getting into :-)
>
> Paul.
>
> > On 9 Nov 2017, at 01:00, Laurent Bourgès 
> wrote:
> >
> > Hi,
> >
> > The Marlin renderer (JEP265) uses few Math functions: sqrt, cbrt, acos...
> >
> > Could you check if the current JDK uses C2 intrinsics or libfdm (native /
> > JNI overhead?) and tell me if such functions are already highly optimized
> > in jdk9 or 10 ?
> >
> > Some people have implemented their own fast Math like Apache Commons Math
> > or JaFaMa libraries that are 10x faster for acos / cbrt.
> >
> > I wonder if I should implement my own cbrt function (cubics) in pure java
> > as I do not need the highest accuracy but SPEED.
> >
> > Would it sound possible to have a JDK FastMath public API (lots faster
> but
> > less accurate?)
> >
> > Do you know if recent CPU (intel?) have dedicated instructions for such
> > math operations ?
> > Why not use it instead?
> > Maybe that's part of the new Vectorization API (panama) ?
> >
> > Cheers,
> > Laurent Bourges
>
>


Re: Faster Math ?

2017-11-09 Thread Laurent Bourgès
Thanks, andrew.

I searched on the web and I understand now:
Fdlibm native library has been ported in Java code for jdk9 (like the
jafama library).

Cbrt changeset:
http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/7dc9726cfa82

I will anyway compare jdk9 with latest jafama 2.2 to have an up-to-date
comparison.

Does somebody else need a faster but less accurate math library ?
JaFaMa has such alternative methods...

Preserving double-precision may be very costly in terms of performance and
sometimes such accuracy is not required.

Last question:
Is there a sin & cos function returning both values for the same angle ?
It is very useful to compute exact fourrier transform...
It is called sinAndCos(double wrappers) in jafama.

Cheers,
Laurent

Le 9 nov. 2017 17:08, "Andrew Haley"  a écrit :

> On 09/11/17 15:02, Laurent Bourgès wrote:
> > --- testing cbrt(double) = pow(double, 1/3) ---
> > Loop on Math.pow(double, 1/3), args in [-10.0,10.0], took 0.739 s
> > Loop on FastMath.cbrt(double), args in [-10.0,10.0], took 0.166 s
> > Loop on Math.pow(double, 1/3), args in [-0.7,0.7], took 0.746 s
> > Loop on FastMath.cbrt(double), args in [-0.7,0.7], took 0.166 s
> > Loop on Math.pow(double, 1/3), args in [-0.1,0.1], took 0.742 s
> > Loop on FastMath.cbrt(double), args in [-0.1,0.1], took 0.165 s
> > Loop on Math.pow(double, 1/3), args in all magnitudes, took 0.753 s
> > Loop on FastMath.cbrt(double), args in all magnitudes, took 0.244
> >
> > Conclusion:
> > - acos / asin / atan functions are quite slow: it confirms these are not
> > optimized by hotspot intrinsics.
> >
> > - cbrt() is slower than sqrt() : 1.1s vs 0.1 => 10x slower
> > - cbrt() is slower than pow(1/3) : 1.1s vs 0.7s => 50% slower
>
> No.  cbrt() is faster than pow(1/3) : 0.24 vs 0.75
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>


Re: Faster Math ?

2017-11-09 Thread Laurent Bourgès
Hi,

Here are very basic benchmark results from (JaFaMa 2 - FastMathPerf) made
on my laptop (i7-6820HQ set @ 2Ghz + JDK8):
--- testing asin(double) ---
Loop on Math.asin(double) took 6.675 s
Loop on FastMath.asin(double) took 0.162 s

--- testing acos(double) ---
Loop on Math.acos(double) took 6.332 s
Loop on FastMath.acos(double) took 0.16 s

--- testing atan(double) ---
Loop on Math.atan(double) took 0.766 s
Loop on FastMath.atan(double) took 0.167

--- testing sqrt(double) ---
Loop on Math.sqrt(double), args in [0.0,10.0], took 0.095 s
Loop on FastMath.sqrt(double), args in [0.0,10.0], took 0.097 s
Loop on Math.sqrt(double), args in [0.0,1.0E12], took 0.109 s
Loop on FastMath.sqrt(double), args in [0.0,1.0E12], took 0.093 s
Loop on Math.sqrt(double), args in all magnitudes (>=0), took 0.091 s
Loop on FastMath.sqrt(double), args in all magnitudes (>=0), took 0.092

--- testing cbrt(double) ---
Loop on Math.cbrt(double), args in [-10.0,10.0], took 1.152 s
Loop on FastMath.cbrt(double), args in [-10.0,10.0], took 0.195 s
Loop on Math.cbrt(double), args in [-1.0E12,1.0E12], took 1.153 s
Loop on FastMath.cbrt(double), args in [-1.0E12,1.0E12], took 0.193 s
Loop on Math.cbrt(double), args in all magnitudes, took 1.154 s
Loop on FastMath.cbrt(double), args in all magnitudes, took 0.272

--- testing cbrt(double) = pow(double, 1/3) ---
Loop on Math.pow(double, 1/3), args in [-10.0,10.0], took 0.739 s
Loop on FastMath.cbrt(double), args in [-10.0,10.0], took 0.166 s
Loop on Math.pow(double, 1/3), args in [-0.7,0.7], took 0.746 s
Loop on FastMath.cbrt(double), args in [-0.7,0.7], took 0.166 s
Loop on Math.pow(double, 1/3), args in [-0.1,0.1], took 0.742 s
Loop on FastMath.cbrt(double), args in [-0.1,0.1], took 0.165 s
Loop on Math.pow(double, 1/3), args in all magnitudes, took 0.753 s
Loop on FastMath.cbrt(double), args in all magnitudes, took 0.244

Conclusion:
- acos / asin / atan functions are quite slow: it confirms these are not
optimized by hotspot intrinsics.

- cbrt() is slower than sqrt() : 1.1s vs 0.1 => 10x slower
- cbrt() is slower than pow(1/3) : 1.1s vs 0.7s => 50% slower

Any plan to enhance these specific math operations ?

Laurent


2017-11-09 14:33 GMT+01:00 Laurent Bourgès :

> I checked in the latest jdk master and both cbrt / acos are NOT intrinsics.
>
> However, cbrt(x) = pow(x, 1/3) so it may be optmized...
>
> Could someone tell me how cbrt() is concretely implemented ?
>
> In native libfdm, there is no e_cbrt.c !
>
> Thanks for your help,
> Laurent
>
> Le 9 nov. 2017 10:52 AM, "Jonas Konrad"  a écrit :
>
>> Hey,
>>
>> Most functions in the Math class are intrinsic (
>> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/tip/src/
>> share/vm/classfile/vmSymbols.hpp#l664 ) and will use native instructions
>> where available. You can also test this yourself using jitwatch. There is
>> no native call overhead.
>>
>> The standard library does not currently include less accurate but faster
>> Math functions, maybe someone else can answer if that is something to be
>> considered.
>>
>> - Jonas Konrad
>>
>> On 11/09/2017 10:00 AM, Laurent Bourgès wrote:
>>
>>> Hi,
>>>
>>> The Marlin renderer (JEP265) uses few Math functions: sqrt, cbrt, acos...
>>>
>>> Could you check if the current JDK uses C2 intrinsics or libfdm (native /
>>> JNI overhead?) and tell me if such functions are already highly optimized
>>> in jdk9 or 10 ?
>>>
>>> Some people have implemented their own fast Math like Apache Commons Math
>>> or JaFaMa libraries that are 10x faster for acos / cbrt.
>>>
>>> I wonder if I should implement my own cbrt function (cubics) in pure java
>>> as I do not need the highest accuracy but SPEED.
>>>
>>> Would it sound possible to have a JDK FastMath public API (lots faster
>>> but
>>> less accurate?)
>>>
>>> Do you know if recent CPU (intel?) have dedicated instructions for such
>>> math operations ?
>>> Why not use it instead?
>>> Maybe that's part of the new Vectorization API (panama) ?
>>>
>>> Cheers,
>>> Laurent Bourges
>>>
>>>


-- 
-- 
Laurent Bourgès


Re: Faster Math ?

2017-11-09 Thread Laurent Bourgès
I checked in the latest jdk master and both cbrt / acos are NOT intrinsics.

However, cbrt(x) = pow(x, 1/3) so it may be optmized...

Could someone tell me how cbrt() is concretely implemented ?

In native libfdm, there is no e_cbrt.c !

Thanks for your help,
Laurent

Le 9 nov. 2017 10:52 AM, "Jonas Konrad"  a écrit :

> Hey,
>
> Most functions in the Math class are intrinsic (
> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/tip/src/
> share/vm/classfile/vmSymbols.hpp#l664 ) and will use native instructions
> where available. You can also test this yourself using jitwatch. There is
> no native call overhead.
>
> The standard library does not currently include less accurate but faster
> Math functions, maybe someone else can answer if that is something to be
> considered.
>
> - Jonas Konrad
>
> On 11/09/2017 10:00 AM, Laurent Bourgès wrote:
>
>> Hi,
>>
>> The Marlin renderer (JEP265) uses few Math functions: sqrt, cbrt, acos...
>>
>> Could you check if the current JDK uses C2 intrinsics or libfdm (native /
>> JNI overhead?) and tell me if such functions are already highly optimized
>> in jdk9 or 10 ?
>>
>> Some people have implemented their own fast Math like Apache Commons Math
>> or JaFaMa libraries that are 10x faster for acos / cbrt.
>>
>> I wonder if I should implement my own cbrt function (cubics) in pure java
>> as I do not need the highest accuracy but SPEED.
>>
>> Would it sound possible to have a JDK FastMath public API (lots faster but
>> less accurate?)
>>
>> Do you know if recent CPU (intel?) have dedicated instructions for such
>> math operations ?
>> Why not use it instead?
>> Maybe that's part of the new Vectorization API (panama) ?
>>
>> Cheers,
>> Laurent Bourges
>>
>>


Faster Math ?

2017-11-09 Thread Laurent Bourgès
Hi,

The Marlin renderer (JEP265) uses few Math functions: sqrt, cbrt, acos...

Could you check if the current JDK uses C2 intrinsics or libfdm (native /
JNI overhead?) and tell me if such functions are already highly optimized
in jdk9 or 10 ?

Some people have implemented their own fast Math like Apache Commons Math
or JaFaMa libraries that are 10x faster for acos / cbrt.

I wonder if I should implement my own cbrt function (cubics) in pure java
as I do not need the highest accuracy but SPEED.

Would it sound possible to have a JDK FastMath public API (lots faster but
less accurate?)

Do you know if recent CPU (intel?) have dedicated instructions for such
math operations ?
Why not use it instead?
Maybe that's part of the new Vectorization API (panama) ?

Cheers,
Laurent Bourges


Re: [OpenJDK 2D-Dev] JDK 9 RFR of JDK-8149896: Remove unnecessary values in FloatConsts and DoubleConsts

2016-02-16 Thread Laurent Bourgès
Hello,

Joe, I tested your changes in the Marlin renderer as it works well !


> A quick note on the 2d changes, several constants (and a copy from a
> package-private method from java.lang) were used to initialize a double
> value POWER_2_TO_32; this can be accomplished more directly using a
> hexadecimal floating-point literal.
> >
> > Please review the webrev
> >
> >http://cr.openjdk.java.net/~darcy/8149896.0/
> ...
> My favourite change is this one:
>
> -private static final double POWER_2_TO_32 = FloatMath.powerOfTwoD(32);
> +private static final double POWER_2_TO_32 = 0x1.0p32;
>
> and the removal of the method on FloatMath.
>

I agree it is simpler and exactly what I needed; I wasn't aware of such
literals:
https://blogs.oracle.com/darcy/entry/hexadecimal_floating_point_literals

(I am not a reviewer)

Thanks for the fix,
Laurent


Re: RFR: JDK-8066859 java/lang/ref/OOMEInReferenceHandler.java failed with java.lang.Exception: Reference Handler thread died

2015-05-07 Thread Laurent Bourgès
Peter,

Thanks for long and detailled answer.

I know now better why OOME should not happen. However any application may
also use phantom references and the ReferenceHandler thread will call
Cleaner.run () which could catch OOME from the application code
implementing thunk.run (). Am I right ?

>> If this block also throws a new oome2 due to the first oome1 (no memory
left), it will work but I would have prefered a more explicit solution and
check oome1 first ...

I looked back at your patch and it is fine. Howevdr I wonder if it would be
possible to avoid any allocation in the catch(Throwable) block:
- preallocate the PriviledgeAction
- avoid new Error(x) to get its stack trace ? Do you know any trick like
ones in SharedSecrets that could dump the stack without any allocation in
case of urgency ?

> You have a point and I asked myself the same question. The question is
how to treat OOME thrown from thunk.run(). Current behavior is to exit()
JVM for any exception (Throwable). I maintained that semantics. I only
added a handler for OOME thrown in the handler of the 1st exception. I
might have just exit()-ed the VM if OOME is thrown, but leaving no trace
and just exiting VM would not help anyone diagnose what went wrong. So I
opted for keeping the VM running for a while by delaying the handling of
1st exception to "better times". If better times never come, then the
application is probably stuck anyway.

Seems very good after a 2nd look.
However, it could loop for a while if no more memory left ?
For example: oome1 => oome2 (catch) => throw x=> oome2 (catch) => 

> An alternative would be to catch OOME from thunk.run() and ignore it
(printing it out would be ugly if VM is left to run), but that would
silently ignore OOMEs thrown from thunk.run() and noone would notice that
Cleaner(s) might not have clean-ed up the resources they should.

I am a bit lost but I like logging such exceptional case but if no
allocation can happen, how to ensure logging such case anyway ?
...

> Anyway. If none of the Cleaner.thunk's run() methods can throw any
exception, then my handling of OOME is redundant and a code-path never
taken. But I would still leave it there in case some new Cleaner use comes
along which is not verified yet...

Agreed. It is always better to handle such exceptional case if you can at
least log them...

Best regards,
Laurent


Re: RFR: JDK-8066859 java/lang/ref/OOMEInReferenceHandler.java failed with java.lang.Exception: Reference Handler thread died

2015-05-07 Thread Laurent Bourgès
Peter,

I looked at Cleaner by curiosity and it seems to be not catching the oome
from thunk.run !

If oome1 is thrown by thunk.run at line 150 then it is catched at line 157
but your new try/catch block (oome2) only encapsulates the doPriviledge
block.

If this block also throws a new oome2 due to the first oome1 (no memory
left), it will work but I would have prefered a more explicit solution and
check oome1 first ...

My 2 cents (I am not a reviewer).

Laurent


Fwd: Re: Fwd: Improve large array allocation / gc & intrinsics

2014-02-13 Thread Laurent Bourgès
Comments & opinions from core-libs, please ?

Laurent

-- Message transféré --
De : "Thomas Schatzl" 
Date : 11 févr. 2014 11:28
Objet : Re: Fwd: Improve large array allocation / gc & intrinsics
À : "Laurent Bourgès" 
Cc : 

> Hi,
>
>   just a few comments...
>
> > De : "Laurent Bourgès" 
> > Date : 10 févr. 2014 10:24
> > Objet : Improve large array allocation / gc & intrinsics
> > À : "core-libs-dev" , "discuss" <
> > disc...@openjdk.java.net>
> > Cc :
> >
> > > Dear all,
> > >
> > > I would like to propose a JDK9 RFE to improve JVM efficiency when
> > > dealing with large arrays (allocation + gc).
> > >
> > > In several scientific applications (and my patched java2d pisces),
> > > many large arrays are needed, created on the fly and it becomes very
> > > painful to recycle them using an efficient array cache (concurrency,
> > > cache size tuning, clear + cache eviction issues).
>
> Why do you think that a one-size fits all approach that any library in
> the JDK will not have the same issues to deal with? How do you know that
> a generic library in the JDK (as in your proposal) or hacking things
> into the VM can deal with this problem better than you who knows the
> application allocation patterns?
>
> Do you have a prototype (library) for that?
>
> > > In this case, the GC overhead leads to a big performance penalty
> > > (several hundred megabytes per seconds).
>
> I do not understand what the problem is. Typically I would not specify
> performance (throughput?) penalty in megabytes per seconds.
>
> Do the GCs take too long, or do you feel there too much memory wastage
> somewhere?
>
> > > As such array cache are very efficient in an application context, I am
> > > wondering if that approach could be promoted at the JDK level itself:
> > >
> > > - provide a new array allocator for large arrays that can return
> > > larger arrays than expected (size = 4 or 8 multiples) using array
> > > caches (per thread ?) stored in a dedicated JVM large memory area
>
> The behavior you propose seems very particular to your application.
> Others may have completely different requirements. The mentioned
> per-thread caches do not seem to be problematic to do in a library
> either.
>
> > > (GC aware) and providing efficient cache eviction policies
>
> Did you every try one of the other available garbage collectors with
> your application? Both CMS and G1 never move large objects around, i.e.
> there is almost no direct GC overhead associated with them.
>
> Reclaiming them is almost zero cost in these collectors. Keeping them
> alive seems to be best handled by logic in a library.
>
> Can you give examples where the VM has significant advantages over a
> dedicated library, or a particular use case with measurements that show
> this could be the case?
>
> > > - may support for both clean arrays (zero filled) or dirty arrays
> > > because some algorithms does not need zero-filled arrays.
> > >
> > > - improve JVM intrinsics (array clear, fill) to achieve maximum
> > > performance ie take into account the data alignment (4 or 8 multiples)
>
> I think the compiler already uses specialized methods for that, using
> the "best" instructions that are available. It should also already be
> able to detect "fill" loops, and vectorize them.
>
> Objects are always 8 byte aligned - I think you can force higher
> alignments by setting ObjectAlignmentInBytes or so.
>
> Otherwise these changes could be simply added, i.e. seems to not need
> any RFE.
>
> > > - upgrade GC to recycle such 'cached' arrays (clean), update usage
> > > statistics and manage cache eviction policy (avoid wasting memory)
>
> The GCs already automatically recycle the freed space. Everything else
> seems to be more complicated to implement at VM level than at library
> level, with the added drawback that you add a VM dependency.
>
> > > Please give me your feedback & opinion and evaluate if this feature
> > > seems possible to implement into the JDK (hotspot, gc, core-libs)...
>
> Thanks,
> Thomas
>
>


Fwd: Improve large array allocation / gc & intrinsics

2014-02-11 Thread Laurent Bourgès
Please, could you give me your opinions on the following ideas for a JDK9
RFE ?

Is it worth? Interesting or totally stupid with current hotspot / gc ?

How hard / long would it be to study and make a prototype ?

Any other ideas to improve array performance like improving again the array
bound check elimination or fill / clear intrinsics ...

Who would be interested by this topics ?

Laurent

-- Message transféré --
De : "Laurent Bourgès" 
Date : 10 févr. 2014 10:24
Objet : Improve large array allocation / gc & intrinsics
À : "core-libs-dev" , "discuss" <
disc...@openjdk.java.net>
Cc :

> Dear all,
>
> I would like to propose a JDK9 RFE to improve JVM efficiency when
> dealing with large arrays (allocation + gc).
>
> In several scientific applications (and my patched java2d pisces),
> many large arrays are needed, created on the fly and it becomes very
> painful to recycle them using an efficient array cache (concurrency,
> cache size tuning, clear + cache eviction issues).
>
> In this case, the GC overhead leads to a big performance penalty
> (several hundred megabytes per seconds).
>
> As such array cache are very efficient in an application context, I am
> wondering if that approach could be promoted at the JDK level itself:
>
> - provide a new array allocator for large arrays that can return
> larger arrays than expected (size = 4 or 8 multiples) using array
> caches (per thread ?) stored in a dedicated JVM large memory area (GC
> aware) and providing efficient cache eviction policies
>
> - may support for both clean arrays (zero filled) or dirty arrays
> because some algorithms does not need zero-filled arrays.
>
> - improve JVM intrinsics (array clear, fill) to achieve maximum
> performance ie take into account the data alignment (4 or 8 multiples)
>
> - upgrade GC to recycle such 'cached' arrays (clean), update usage
> statistics and manage cache eviction policy (avoid wasting memory)
>
> Please give me your feedback & opinion and evaluate if this feature
> seems possible to implement into the JDK (hotspot, gc, core-libs) ...
>
> What is the procedure to create such JDK9 RFE ?
>
> Best regards,
>
> Laurent Bourgès
 -- Message transféré --
De : "Laurent Bourgès" 
Date : 10 févr. 2014 10:24
Objet : Improve large array allocation / gc & intrinsics
À : "core-libs-dev" , "discuss" <
disc...@openjdk.java.net>
Cc :

Dear all,
>
> I would like to propose a JDK9 RFE to improve JVM efficiency when
> dealing with large arrays (allocation + gc).
>
> In several scientific applications (and my patched java2d pisces),
> many large arrays are needed, created on the fly and it becomes very
> painful to recycle them using an efficient array cache (concurrency,
> cache size tuning, clear + cache eviction issues).
>
> In this case, the GC overhead leads to a big performance penalty
> (several hundred megabytes per seconds).
>
> As such array cache are very efficient in an application context, I am
> wondering if that approach could be promoted at the JDK level itself:
>
> - provide a new array allocator for large arrays that can return
> larger arrays than expected (size = 4 or 8 multiples) using array
> caches (per thread ?) stored in a dedicated JVM large memory area (GC
> aware) and providing efficient cache eviction policies
>
> - may support for both clean arrays (zero filled) or dirty arrays
> because some algorithms does not need zero-filled arrays.
>
> - improve JVM intrinsics (array clear, fill) to achieve maximum
> performance ie take into account the data alignment (4 or 8 multiples)
>
> - upgrade GC to recycle such 'cached' arrays (clean), update usage
> statistics and manage cache eviction policy (avoid wasting memory)
>
> Please give me your feedback & opinion and evaluate if this feature
> seems possible to implement into the JDK (hotspot, gc, core-libs) ...
>
> What is the procedure to create such JDK9 RFE ?
>
> Best regards,
>
> Laurent Bourgès
>


Improve large array allocation / gc & intrinsics

2014-02-10 Thread Laurent Bourgès
Dear all,

I would like to propose a JDK9 RFE to improve JVM efficiency when
dealing with large arrays (allocation + gc).

In several scientific applications (and my patched java2d pisces),
many large arrays are needed, created on the fly and it becomes very
painful to recycle them using an efficient array cache (concurrency,
cache size tuning, clear + cache eviction issues).

In this case, the GC overhead leads to a big performance penalty
(several hundred megabytes per seconds).

As such array cache are very efficient in an application context, I am
wondering if that approach could be promoted at the JDK level itself:

- provide a new array allocator for large arrays that can return
larger arrays than expected (size = 4 or 8 multiples) using array
caches (per thread ?) stored in a dedicated JVM large memory area (GC
aware) and providing efficient cache eviction policies

- may support for both clean arrays (zero filled) or dirty arrays
because some algorithms does not need zero-filled arrays.

- improve JVM intrinsics (array clear, fill) to achieve maximum
performance ie take into account the data alignment (4 or 8 multiples)

- upgrade GC to recycle such 'cached' arrays (clean), update usage
statistics and manage cache eviction policy (avoid wasting memory)

Please give me your feedback & opinion and evaluate if this feature
seems possible to implement into the JDK (hotspot, gc, core-libs) ...

What is the procedure to create such JDK9 RFE ?

Best regards,

Laurent Bourgès


Re: Conflicts between JVM application and j.u.l logging shutdown hooks

2013-12-12 Thread Laurent Bourgès
Alan,

Thanks for creating the bug 9005822 !

As I spotted in my initial email, both shutdown hook problems (JavaWS and
JUL) are due to the concurrent execution of shutdown hooks :

com.sun.imageio.stream.StreamCloser.java
101:  Runtime.getRuntime().addShutdownHook(streamCloser);

java.util.logging.LogManager.java
255:  Runtime.getRuntime().addShutdownHook(new Cleaner());
For example, the JavaWS bug is caused by a closed jar file (unable to load
an class during shutdown) because (I guess) the StreamCloser closed all
opened jar files or JUL Streams.

As I said, these 2 important hooks (StreamCloser and jul.Cleaner) should be
executed "later" and the StreamCloser as last.
As jul.Handlers can use sockets or files, it is important to flush / close
first handlers (Cleaner) and then close any remaining opened stream ...

I think this bug should be converted into a more general shutdown hook
issue:
- execute first application hooks (not JVM ones)
- fix priorities / order between all JVM hooks (~ 20 now)

Regards,
Laurent


Conflicts between JVM application and j.u.l logging shutdown hooks

2013-12-09 Thread Laurent Bourgès
Anybody wants to look at this logging issue I reported several months ago?

Le 27 juin 2013 11:57, "Laurent Bourgès"  a
écrit :
>
> Dear members,
>
> I have a problem within an external library using one JVM Shutdown hook
to perform resource cleanup (socket, lock file deletion ...) that seems
buggy in java web start environment: sometimes the JVM reports an
IllegalStateException happenning during that shutdown hook.
>
> This library uses java.util.logging to log warnings and exceptions but
none is logged (console or java trace files).
>
> The 'lost' log messages is related to the LogManager's shutdown hook:
> // This private class is used as a shutdown hook.
> // It does a "reset" to close all open handlers.
> private class Cleaner extends Thread {
>
> private Cleaner() {
> /* Set context class loader to null in order to avoid
>  * keeping a strong reference to an application classloader.
>  */
> this.setContextClassLoader(null);
> }
>
> public void run() {
> // This is to ensure the LogManager. is completed
> // before synchronized block. Otherwise deadlocks are
possible.
> LogManager mgr = manager;
>
> // If the global handlers haven't been initialized yet, we
> // don't want to initialize them just so we can close them!
> synchronized (LogManager.this) {
> // Note that death is imminent.
> deathImminent = true;
> initializedGlobalHandlers = true;
> }
>
> // Do a reset to close all active handlers.
> reset();
> }
> }
>
> Without any log handler, the log messages are lost during shutdown !
> I think it is annoying as it avoids me getting log and exceptions ...
>
> FYI, the ApplicationShutdownHooks class executes all 'application' hooks
concurrently:
> static void runHooks() {
> Collection threads;
> synchronized(ApplicationShutdownHooks.class) {
> threads = hooks.keySet();
> hooks = null;
> }
>
> for (Thread hook : threads) {
> hook.start();
> }
> for (Thread hook : threads) {
> try {
> hook.join();
> } catch (InterruptedException x) { }
> }
> }
> So it is not possible to define hook priorities ...
>
> However, I looked at the JVM Shutdown class which supports hook
priorities and I think that the J.U.L Cleaner hook should run after all
application hooks.
> Here are the current Shutdown priorities:
> // The system shutdown hooks are registered with a predefined slot.
> // The list of shutdown hooks is as follows:
> // (0) Console restore hook
> // (1) Application hooks
> // (2) DeleteOnExit hook
>
> Moreover, as a RFE, it could be useful for applications to be able to
define hook priorities within an application:
> I did that several times registering only 1 shutdown hook and handling
correct ordering on my own ... but could be supported by the JDK itself.
>
> Finally, here are (below) the Open JDK8 usages of Runtime.addShutdownHook
[19 occurrences] which are more system hooks than application hooks:
> com.sun.imageio.stream
> StreamCloser.java
> StreamCloser
> addToQueue
> 
> run
>   101:  Runtime.getRuntime().addShutdownHook(streamCloser);
> java.util.logging
> LogManager.java
> LogManager
> LogManager
>   255:  Runtime.getRuntime().addShutdownHook(new Cleaner());
> sun.font
> SunFontManager.java
> SunFontManager
> createFont2D
> 
> run
> 2538:  Runtime.getRuntime().addShutdownHook(fileCloser);
> sun.rmi.server
> Activation.java
> Activation
> init
>   240:  Runtime.getRuntime().addShutdownHook(shutdownHook);
> sun.tools.jstat
>
> sun.awt.shell
> Win32ShellFolderManager2.java
> Win32ShellFolderManager2
> ComInvoker
> ComInvoker
> >
> run
>   493:  Runtime.getRuntime().addShutdownHook(
> sun.awt.windows
> WToolkit.java
> WToolkit
> registerShutdownHook
> >
> run
>   285:  Runtime.getRuntime().addShutdownHook(shutdown);
> sun.java2d.d3d
> D3DScreenUpdateManager.java
> D3DScreenUpdateManager
> D3DScreenUpdateManager
> 
> run
>   112:  Runtime.getRuntime().addShutdownHook(shutdown);
> java.util.prefs
> FileSystemPreferences.java
> FileSystemPreferences
> >
> run
>   439:  Runtime.getRuntime().addShutdownHook(new Thread() {
> sun.awt.X11
> XToolkit.java
> XToolkit
> init
> >
> run
>   350:  Runtime.getRuntime().addShutdownHook(shutdownThread

Re: JVM application shutdown hooks

2013-08-07 Thread Laurent Bourgès
Dear core-libs members,

I finally succeed in diagnosing my shutdown hook issue in Java Web Start
environment: see Bug ID: 9005822.

Could you please give ma feedback to my questions related to LogManager and
StreamCloser shutdown hooks ?


This library uses java.util.logging to log warnings and exceptions but none
> is logged (console or java trace files).
>
> The 'lost' log messages is related to the LogManager's shutdown hook:
> // This private class is used as a shutdown hook.
> // It does a "reset" to close all open handlers.
> private class Cleaner extends Thread {
>
> private Cleaner() {
> /* Set context class loader to null in order to avoid
>  * keeping a strong reference to an application classloader.
>  */
> this.setContextClassLoader(null);
> }
>
> public void run() {
> // This is to ensure the LogManager. is completed
> // before synchronized block. Otherwise deadlocks are possible.
> LogManager mgr = manager;
>
> // If the global handlers haven't been initialized yet, we
> // don't want to initialize them just so we can close them!
> synchronized (LogManager.this) {
> // Note that death is imminent.
> deathImminent = true;
> initializedGlobalHandlers = true;
> }
>
> // Do a reset to close all active handlers.
> reset();
> }
> }
>
> I think it is annoying as it avoids me getting log and exceptions ...
>

Exactly what happened to me in JavaWS env !! I had to patch JSamp library
to add System.out and printStackTrace statements whereas there are plenty
of j.u.l. log statements !!

Please postpone that f*** hook after 'true' application hooks !!!


Moreover, as a RFE, it could be useful for applications to be able to
> define hook priorities within an application:
> I did that several times registering only 1 shutdown hook and handling
> correct ordering on my own ... but could be supported by the JDK itself.
>

Do you want me to submit a patch ? I think that the JVM specs should be
modified to change the addShutdownHook() signature => RFE ?


> Finally, here are (below) the Open JDK8 usages of Runtime.addShutdownHook
> [19 occurrences] which are more system hooks than application hooks:
> com.sun.imageio.stream
> StreamCloser.java
> StreamCloser
> addToQueue
> 
> run
>   101:  Runtime.getRuntime().addShutdownHook(streamCloser);
> java.util.logging
> LogManager.java
> LogManager
> LogManager
>   255:  Runtime.getRuntime().addShutdownHook(new Cleaner());
> sun.font
> SunFontManager.java
> SunFontManager
> createFont2D
> 
> run
> 2538:  Runtime.getRuntime().addShutdownHook(fileCloser);
> sun.rmi.server
> Activation.java
> Activation
> init
>   240:  Runtime.getRuntime().addShutdownHook(shutdownHook);
> sun.tools.jstat
>
> sun.awt.shell
> Win32ShellFolderManager2.java
> Win32ShellFolderManager2
> ComInvoker
> ComInvoker
> >
> run
>   493:  Runtime.getRuntime().addShutdownHook(
> sun.awt.windows
> WToolkit.java
> WToolkit
> registerShutdownHook
> >
> run
>   285:  Runtime.getRuntime().addShutdownHook(shutdown);
> sun.java2d.d3d
> D3DScreenUpdateManager.java
> D3DScreenUpdateManager
> D3DScreenUpdateManager
> 
> run
>   112:  Runtime.getRuntime().addShutdownHook(shutdown);
> java.util.prefs
> FileSystemPreferences.java
> FileSystemPreferences
> >
> run
>   439:  Runtime.getRuntime().addShutdownHook(new Thread() {
> sun.awt.X11
> XToolkit.java
> XToolkit
> init
> >
> run
>   350:  Runtime.getRuntime().addShutdownHook(shutdownThread);
> sun.awt
> X11GraphicsDevice.java
> X11GraphicsDevice
> setDisplayMode
> >
> run
>   445:  Runtime.getRuntime().addShutdownHook(t);
> java.util.prefs
> MacOSXPreferencesFile.java
> MacOSXPreferencesFile
> timer
>   356:  Runtime.getRuntime().addShutdownHook(flushThread);
> sun.font
> CFontManager.java
> CFontManager
> createFont2D
> >
> run
>   232:  Runtime.getRuntime().addShutdownHook(fileCloser);
> sun.lwawt
> LWToolkit.java
> LWToolkit
> init
> 83:  Runtime.getRuntime().addShutdownHook(
>
> I hope that these hooks are already supporting concurrency execution (awt
> ?) but someone should check if there is no dependencies between them (like
> systemd or rc scripts does).
>

Could someone check if there is no dependencies between all these shutdown
hooks when they are executed in parallel with random order ?

Best regards,
Laurent


JVM application shutdown hooks

2013-06-27 Thread Laurent Bourgès
ent, it may be the cause of my problems as I do not know if the
hook thread is still within the application class loader ...

Laurent Bourgès


Re: RFR (XS) CR 8014233: java.lang.Thread should be @Contended

2013-05-09 Thread Laurent Bourgès
Peter,

you're absolutely right: I was thinking about thread local values (object
instances) and not ThreadLocal keys !
I think ThreadLocal name is confusing as it does not correspond to values !

Several times I wonder if false sharing can happen between my thread local
values (i.e. different Thread context classes) and any other object
including other Thread contexts).

Is the GC (old gen) able to place objects in thread dedicated area: it
would so avoid any false sharing between object graphs dedicated to each
thread = thread isolation.

I think that TLAB does so for allocation / short lived objects but for the
old generation (long lived objects) it is not the case: maybe G1 can
provide different partitioning and maybe take into acccount the thread
affinity ?

Laurent

2013/5/9 Peter Levart 

>
> On 05/09/2013 04:59 PM, Laurent Bourgès wrote:
>
> Hi all,
>
> A stupid question:
> any ThreadLocal subclass should be marked @Contended to be sure that false
> sharing never happens between ThreadLocal instance and any other object on
> the heap ?
>
>
> Hi Laurent,
>
> ThreadLocal object is just a key (into a ThreadLocalMap). It's usually not
> subclassed to add any state but to override initialValue method.
> ThreadLocal contains a single final field 'threadLocalHashCode', which is
> read at every call to ThreadLocal.get() (usually by multiple threads). This
> can contend with a frequent write of a field in some other object, placed
> into it's proximity, yes, but I don't think we should put @Contended on
> every class that has frequently read fields. @Contended should be reserved
> for classes with fields that are frequently written, if I understand the
> concept correctly.
>
> Regards, Peter
>
>
> Laurent
>
>  2013/5/9 Peter Levart 
>
>> Hi Aleksey,
>>
>> Wouldn't it be even better if just threadLocalRandom* fields were
>> annotated with @Contended("ThreadLocal") ?
>> Some fields within the Thread object are accessed from non-local threads.
>> I don't know how frequently, but isolating just threadLocalRandom* fields
>> from all possible false-sharing scenarios would seem even better, no?
>>
>> Regards, Peter
>>
>>
>> On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:
>>
>>> Hi,
>>>
>>> This is from our backlog after JDK-8005926. After ThreadLocalRandom
>>> state was merged into Thread, we now have to deal with the false sharing
>>> induced by heavily-updated fields in Thread. TLR was padded before, and
>>> it should make sense to make Thread bear @Contended annotation to
>>> isolate its fields in the same manner.
>>>
>>> The webrev is here:
>>> http://cr.openjdk.java.net/~shade/8014233/webrev.00/
>>>
>>> Testing:
>>>   - microbenchmarks (see below)
>>>   - JPRT cycle against jdk8-tl
>>>
>>> The extended rationale for the change follows.
>>>
>>> If we look at the current Thread layout, we can see the TLR state is
>>> buried within the Thread instance. TLR state are by far the mostly
>>> updated fields in Thread now:
>>>
>>>  Running 64-bit HotSpot VM.
>>>> Using compressed references with 3-bit shift.
>>>> Objects are 8 bytes aligned.
>>>>
>>>> java.lang.Thread
>>>>offset  size type description
>>>> 012  (assumed to be the object
>>>> header + first field alignment)
>>>>12 4  int Thread.priority
>>>>16 8 long Thread.eetop
>>>>24 8 long Thread.stackSize
>>>>32 8 long Thread.nativeParkEventPointer
>>>>40 8 long Thread.tid
>>>>48 8 long Thread.threadLocalRandomSeed
>>>>56 4  int Thread.threadStatus
>>>>60 4  int Thread.threadLocalRandomProbe
>>>>64 4  int
>>>> Thread.threadLocalRandomSecondarySeed
>>>>68 1  boolean Thread.single_step
>>>>69 1  boolean Thread.daemon
>>>>70 1  boolean Thread.stillborn
>>>>71 1  (alignment/padding gap)
>>>>72 4   char[] Thread.name
>>>>76 4   Thread Thread.threadQ
>>>&

Re: RFR (XS) CR 8014233: java.lang.Thread should be @Contended

2013-05-09 Thread Laurent Bourgès
Hi all,

A stupid question:
any ThreadLocal subclass should be marked @Contended to be sure that false
sharing never happens between ThreadLocal instance and any other object on
the heap ?

Laurent

2013/5/9 Peter Levart 

> Hi Aleksey,
>
> Wouldn't it be even better if just threadLocalRandom* fields were
> annotated with @Contended("ThreadLocal") ?
> Some fields within the Thread object are accessed from non-local threads.
> I don't know how frequently, but isolating just threadLocalRandom* fields
> from all possible false-sharing scenarios would seem even better, no?
>
> Regards, Peter
>
>
> On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:
>
>> Hi,
>>
>> This is from our backlog after JDK-8005926. After ThreadLocalRandom
>> state was merged into Thread, we now have to deal with the false sharing
>> induced by heavily-updated fields in Thread. TLR was padded before, and
>> it should make sense to make Thread bear @Contended annotation to
>> isolate its fields in the same manner.
>>
>> The webrev is here:
>> 
>> http://cr.openjdk.java.net/~**shade/8014233/webrev.00/
>>
>> Testing:
>>   - microbenchmarks (see below)
>>   - JPRT cycle against jdk8-tl
>>
>> The extended rationale for the change follows.
>>
>> If we look at the current Thread layout, we can see the TLR state is
>> buried within the Thread instance. TLR state are by far the mostly
>> updated fields in Thread now:
>>
>>  Running 64-bit HotSpot VM.
>>> Using compressed references with 3-bit shift.
>>> Objects are 8 bytes aligned.
>>>
>>> java.lang.Thread
>>>offset  size type description
>>> 012  (assumed to be the object
>>> header + first field alignment)
>>>12 4  int Thread.priority
>>>16 8 long Thread.eetop
>>>24 8 long Thread.stackSize
>>>32 8 long Thread.nativeParkEventPointer
>>>40 8 long Thread.tid
>>>48 8 long Thread.threadLocalRandomSeed
>>>56 4  int Thread.threadStatus
>>>60 4  int Thread.threadLocalRandomProbe
>>>64 4  int Thread.**
>>> threadLocalRandomSecondarySeed
>>>68 1  boolean Thread.single_step
>>>69 1  boolean Thread.daemon
>>>70 1  boolean Thread.stillborn
>>>71 1  (alignment/padding gap)
>>>72 4   char[] Thread.name
>>>76 4   Thread Thread.threadQ
>>>80 4 Runnable Thread.target
>>>84 4  ThreadGroup Thread.group
>>>88 4  ClassLoader Thread.contextClassLoader
>>>92 4 AccessControlContext Thread.**
>>> inheritedAccessControlContext
>>>96 4   ThreadLocalMap Thread.threadLocals
>>>   100 4   ThreadLocalMap Thread.inheritableThreadLocals
>>>   104 4   Object Thread.parkBlocker
>>>   108 4Interruptible Thread.blocker
>>>   112 4   Object Thread.blockerLock
>>>   116 4 UncaughtExceptionHandler Thread.**
>>> uncaughtExceptionHandler
>>>   120(object boundary, size estimate)
>>>   VM reports 120 bytes per instance
>>>
>>
>> Assuming current x86 hardware with 64-byte cache line sizes and current
>> class layout, we can see the trailing fields in Thread are providing
>> enough insulation from the false sharing with an adjacent object. Also,
>> the Thread itself is large enough so that two TLRs belonging to
>> different threads will not collide.
>>
>> However the leading fields are not enough: we have a few words which can
>> occupy the same cache line, but belong to another object. This is where
>> things can get worse in two ways: a) the TLR update can make the field
>> access in adjacent object considerably slower; and much worse b) the
>> update in the adjacent field can disturb the TLR state, which is
>> critical for j.u.concurrent performance relying heavily on fast TLR.
>>
>> To illustrate both points, there is a simple benchmark driven by JMH
>> (http://openjdk.java.net/**projects/code-tools/jmh/
>> ):
>>
>> http://cr.openjdk.java.net/~**shade/8014233/threadbench.zip
>>
>> On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
>> Thread with/without @Contended that microbenchmark yields the following
>> results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:
>>
>> Accessing ThreadLocalRandom.current().**nextInt():
>>baseline:932 +-  4 ops/usec
>>@Contended:  927 +- 10 ops/usec
>>
>> Acc

Re: [OpenJDK 2D-Dev] sun.java2D.Pisces renderer Performance and Memory enhancements

2013-04-17 Thread Laurent Bourgès
Jim,

thanks for having some interest for my efforts !
As I got almost no feedback, I felt quite disappointed and was thinking
that improving pisces was not important ...

Here are ductus results and comparison (open office format):
http://jmmc.fr/~bourgesl/share/java2d-pisces/ductus_det.log
http://jmmc.fr/~bourgesl/share/java2d-pisces/compareRef_Patch.ods

   test threads ops Tavg Tmed stdDev rms *Med+Stddev* min max  boulder_17 1
20 73,92% 69,34% 27,98% 69,34% *69,14%* 69,81% 146,89%  boulder_17 2 20
110,86% 110,48% 613,80% 112,01% *125,43%* 94,71% 136,35%  boulder_17 4 20
135,28% 135,86% 226,61% 136,46% *141,85%* 125,14% 111,32%  shp_alllayers_47
1 20 82,25% 82,49% 47,50% 82,48% *82,30%* 82,64% 78,08%  shp_alllayers_47 2
20 115,87% 115,67% 315,45% 115,85% *119,89%* 109,33% 128,71%
shp_alllayers_47 4 20 218,59% 218,76% 169,38% 218,65% *216,45%* 220,17%
206,17%
Ductus vs Patch
   *1* *80,74%*  *2* *120,69%*  *4* *205,92%*
Reminder: Ref vs Patch
   *1* *237,71%*  *2* *271,68%*  *4* *286,15%*

Note: I only have 2 cores + HT on my laptop and do not test with more
threads (64 like andrea).

2013/4/16 Jim Graham 

> If I'm reading this correctly, your patch is faster even for a single
> thread?  That's great news.
>

Not yet, but ductus is now only 20% faster than my patch and 20% and 2x
slower with 2 and 4 threads :
I still hope to beat it applying few more optimizations:
- Renderer.ScanLine iterator / Renderer.endRendering can be improved
- avoid few more array zero fill (reset) if possible
- adding statistics to better set initial array sizes ...
- use SunGraphics2D to hold an hard reference (temporarly) on
RendererContext (to avoid many ThreadLocal.get())
- cache eviction (WeakReference or SoftReference) ?

Why not use divide and conquer (thread pool) to boost single thread
rendering if the machine has more cpu cores ?
It would be helpful if the AATileGenerator has access to SunGraphics2D to
get rendering hints or share variables (cache ...)

For the moment, I did not modify the algorithms itself but I will do it to
enhance clipping (dasher segments / stroker) ...


> One of the problems we've had with replacing Ductus is that it has been
> faster in a single thread situation than the open source versions we've
> created.  One of its drawbacks is that it had been designed to take
> advantage of some AA-accelerating hardware that never came to be.  With the
> accelerator it would have been insanely fast, but hardware went in a
> different direction.  The problem was that this early design goal caused
> the entire library to be built around an abstraction layer that allowed for
> a single "tile producer" internally (because there would be only one -
> insanely fast - hardware chip available) and the software version of the
> abstraction layer thus had a lot of native "static" data structures
> (there's only one of me, right?) that prevented MT access.  It was probably
> solvable, but I'd be happier if someone could come up with a faster
> rasterizer, imagining that there must have been some sort of advancements
> in the nearly 2 decades since the original was written.
>
> If I'm misinterpreting and single thread is still slower than Ductus (or
> if it is still slower on some other platforms), then .
>

Not yet: slower than ductus by 20% but faster than current pisces by 2
times !


> Also, this is with the Java version, right?


Yes, my patch is pure java given as webrev:
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/


> We got a decent 2x speedup in FX by porting the version of Open Pisces
> that you started with to C code (all except on Linux where we couldn't find
> the right gcc options to make it faster than Hotspot).  So, we have yet to
> investigate a native version in the JDK which might provide even more
> gains...
>

Personally I prefer working on java code as hotspot can perform so much
optimizations for free and no pointers to deal with and more important:
concurrent primitives (thread local, collections) !

Laurent


>
> On 4/15/13 3:01 AM, Laurent Bourgčs wrote:
>
>> Jim, Andrea,
>>
>> I updated MapBench to provide test statistics (avg, median, stddev, rms,
>> med + stddev, min, max) and CSV output (tab separator):
>> http://jmmc.fr/~bourgesl/**share/java2d-pisces/MapBench/
>> 
>> >
>>
>>
>>
>> Here are the results (OpenJDK8 Ref vs Patched):
>> http://jmmc.fr/~bourgesl/**share/java2d-pisces/ref_det.**log
>> http://jmmc.fr/~bourgesl/**share/java2d-pisces/patch_det.**log
>>
>> testthreads ops TavgTmedstdDev  rms
>> Med+Stddev  min max
>> boulder_17  1   20  180,22% 181,08% 1186,01%
>>181,17% 185,92%
>> 176,35%

Re: [OpenJDK 2D-Dev] sun.java2D.Pisces renderer Performance and Memory enhancements

2013-04-15 Thread Laurent Bourgès
Jim, Andrea,

I updated MapBench to provide test statistics (avg, median, stddev, rms,
med + stddev, min, max) and CSV output (tab separator):
http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench/


Here are the results (OpenJDK8 Ref vs Patched):
http://jmmc.fr/~bourgesl/share/java2d-pisces/ref_det.log
http://jmmc.fr/~bourgesl/share/java2d-pisces/patch_det.log

   test threads ops Tavg Tmed stdDev rms Med+Stddev min max  boulder_17 1 20
180,22% 181,08% 1186,01% 181,17% 185,92% 176,35% 170,36%  boulder_17 2 20
183,15% 183,80% 162,68% 183,78% 183,17% 174,01% 169,89%  boulder_17 4 20
216,62% 218,03% 349,31% 218,87% 226,68% 172,15% 167,54%  shp_alllayers_47 1
20 243,90% 244,86% 537,92% 244,87% 246,39% 240,64% 231,00%  shp_alllayers_47
2 20 286,42% 287,07% 294,87% 287,07% 287,23% 277,19% 272,23%
shp_alllayers_47 4 20 303,08% 302,15% 168,19% 301,90% 295,90% 462,70%
282,41%

PATCH:
   test threads ops Tavg Tmed stdDev rms Med+Stddev min max  boulder_17 1 20
110,196 109,244 0,529 109,246 109,773 108,197 129,327  boulder_17 2 40
127,916 127,363 3,899 127,423 131,262 125,262 151,561  boulder_17 4 80
213,085 212,268 14,988 212,796 227,256 155,512 334,407  shp_alllayers_47 1
20 1139,452 1134,858 5,971 1134,873 1140,829 1125,859 1235,746
shp_alllayers_47 2 40 1306,889 1304,598 28,157 1304,902 1332,755 1280,49
1420,351  shp_alllayers_47 4 80 2296,487 2303,81 112,816 2306,57 2416,626
1390,31 2631,455

REF:
   test threads ops Tavg Tmed stdDev rms Med+Stddev min max  boulder_17 1 20
198,591 197,816 6,274 197,916 204,091 190,805 220,319  boulder_17 2 40
234,272 234,09 6,343 234,176 240,433 217,967 257,485  boulder_17 4 80
461,579 462,8 52,354 465,751 515,153 267,712 560,254  shp_alllayers_47 1 20
2779,133 2778,823 32,119 2779,009 2810,943 2709,285 2854,557
shp_alllayers_47 2 40 3743,255 3745,111 83,027 3746,031 3828,138 3549,364
3866,612  shp_alllayers_47 4 80 6960,23 6960,948 189,75 6963,533 7150,698
6432,945 7431,541
Linux 64 server vm
JVM: -Xms128m -Xmx128m (low mem)

Laurent

2013/4/14 Andrea Aime 

> On Tue, Apr 9, 2013 at 3:02 PM, Laurent Bourgès  > wrote:
>
>> Dear Java2D members,
>>
>> Could someone review the following webrev concerning Java2D Pisces to
>> enhance its performance and reduce its memory footprint (RendererContext
>> stored in thread local or concurrent queue):
>> http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/
>>
>> FYI I fixed file headers in this patch and signed my OCA 3 weeks ago.
>>
>> Remaining work:
>> - cleanup (comments ...)
>> - statistics to perform auto-tuning
>> - cache / memory cleanup (SoftReference ?): use hints or System
>> properties to adapt it to use cases
>> - another problem: fix clipping performance in Dasher / Stroker for
>> segments out of bounds
>>
>> Could somebody support me ? ie help me working on these tasks or just to
>> discuss on Pisces algorithm / implementation ?
>>
>
> Hi,
> I would like to express my support for this patch.
> Given that micro-benchmarks have already been run, I took the patch for a
> spin in a large, real world benchmark instead,
> the OSGeo WMS Shootout 2010 benchmark, for which you can see the results
> here:
>
> http://www.slideshare.net/gatewaygeomatics.com/wms-performance-shootout-2010
>
> The presentation is long, but suffice it to say all Java based
> implementations took quite the beating due to the
> poor scalability of Ductus with antialiased rendering of vector data (for
> an executive summary just look
> at slide 27 and slide 66, where GeoServer, Oracle MapViewer and
> Constellation SDI were the
> Java based ones)
>
> I took the same tests and run them again on my machine (different hardware
> than the tests, don't try to compare
> the absolute values), using Oracle JDK 1.7.0_17, OpenJDK 8 (a checkout a
> couple of weeks old) and the
> same, but with Laurent's patches applied.
> Here are the results, throughput (in maps generated per second) with the
> load generator (JMeter) going
> up from one client to 64 concurrent clients:
>
>*Threads* *JDK 1.7.0_17* *OpenJDK 8, vanilla* *OpenJDK 8 + pisces
> renderer improvements* *Pisces renderer performance gain, %*  1 13,97
> 12,43 13,03 4,75%  2 22,08 20,60 20,77 0,81%  4 34,36 33,15 33,36 0,62%  8
> 39,39 40,51 41,71 2,96%  16 40,61 44,57 46,98 5,39%  32 41,41 44,73 48,16
> 7,66%  64 37,09 42,19 45,28 7,32%
>
> Well, first of all, congratulations to the JDK developers, don't know what
> changed in JDK 8, but
> GeoServer seems to like it quite a bit :-).
> That said, Laurent's patch also gives a visible boost, especially when
> several concurrent clients are
> asking for the maps.
>
> Mind, as I said, this is no micro-benchmark, it is a real application
> loading doing a lot of I/O
> (from the operating s

Re: sun.awt.X11 logs still using String + (waste)

2013-04-11 Thread Laurent Bourgès
Anthony,

Here is the updated webrev:
http://jmmc.fr/~bourgesl/share/webrev-8010297.5/


Laurent

2013/4/11 Mandy Chung 

>  On 4/11/13 8:43 AM, Laurent Bourgès wrote:
>
>
> I don't understand if I should fix it or not ?
>
>   src/solaris/classes/sun/awt/X11/XListPeer.java
>> Nit: line 1906 you remove isLoggable call here.  Was it intentional (as
>> it doesn't call concatenate any string?)?  I think it's better to use the
>> pattern consistently.
>>
>
> it's a mistake (cookie).
>
>
> Please fix it.
>
>
>
>
>>  Approved and no need to regenerate a new webrev if you fix the above nit.
>>
>
> To fix it, I need to send you files as a new webrev ?
>
>
> Anthony is going to sponsor for you and I think he asks for a webrev.  So
> please send the latest webrev with this fix then.
>
> Mandy
>
>
> Laurent
>
>
>


Re: sun.awt.X11 logs still using String + (waste)

2013-04-11 Thread Laurent Bourgès
Mandy,

I'd really like to see if we can avoid the boilerplate "if
(isLoggable(...)) logger.fine()" and help ease of development and I
file a RFE (8012006).

Agreed but there is no easy way to have clear code and performance:
- maybe JDK 8 Supplier may help
- or a shorter syntax: logger.isDebug() or testing its effective level
directly: if (logger.isDebug)
- annotation syntax: @logCheck() but it is not straightforward to guess the
correct level of the log statement ...


I don't understand if I should fix it or not ?

src/solaris/classes/sun/awt/X11/XListPeer.java
> Nit: line 1906 you remove isLoggable call here.  Was it intentional (as it
> doesn't call concatenate any string?)?  I think it's better to use the
> pattern consistently.
>

it's a mistake (cookie).


> Approved and no need to regenerate a new webrev if you fix the above nit.
>

To fix it, I need to send you files as a new webrev ?

Laurent


Re: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste

2013-04-11 Thread Laurent Bourgès
Last idea: I will enhance Andrea's mapBench benchmark to have statistics
per threads: number of loops, avg, min, max, stddev;

I guess that the total bench time is not so representative as the thread
pool can distribute the work load differently at each test => statistics
will help to have better timing / comparison between bench runs.

Regards,
Laurent

2013/4/11 Laurent Bourgès 

> Jim and Sergey,
>
> 1/ Here are few benchmarks (based on mapBench again) running several
> modified versions of AAShapePipe:
> http://jmmc.fr/~bourgesl/share/AAShapePipe/mapBench/
> - ref:
> 1 threads and 20 loops per thread, time: 3742 ms
> 2 threads and 20 loops per thread, time: 4756 ms
> 4 threads and 20 loops per thread, time: 8528 ms
>
> 1 threads and 20 loops per thread, time: 56264 ms
> 2 threads and 20 loops per thread, time: 75566 ms
> 4 threads and 20 loops per thread, time: 141546 ms
>
> - int4:
> 1 threads and 20 loops per thread, time: 3653 ms
> 2 threads and 20 loops per thread, time: 4684 ms
> 4 threads and 20 loops per thread, time: 8291 ms
>
> 1 threads and 20 loops per thread, time: 55950 ms
> 2 threads and 20 loops per thread, time: 74796 ms
> 4 threads and 20 loops per thread, time: 139924 ms
>
> - byte[]:
> 1 threads and 20 loops per thread, time: 3795 ms
> 2 threads and 20 loops per thread, time: 4605 ms
> 4 threads and 20 loops per thread, time: 8246 ms
>
> 1 threads and 20 loops per thread, time: 54961 ms
> 2 threads and 20 loops per thread, time: 72768 ms
> 4 threads and 20 loops per thread, time: 139430 ms
>
> - int4 / byte[] / rectangle cached in TileState:
> 1 threads and 20 loops per thread, time: 3610 ms
> 2 threads and 20 loops per thread, time: 4481 ms
> 4 threads and 20 loops per thread, time: 8225 ms
>
> 1 threads and 20 loops per thread, time: 54651 ms
> 2 threads and 20 loops per thread, time: 74516 ms
> 4 threads and 20 loops per thread, time: 140153 ms
>
> So you may be right, results are varying depending on the optimizations
> (int4, byte or all) !
> Maybe I should test different versions on optimized pisces renderer ...
>
> Here is an updated patch:
> http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-2/
>
>
> 2/ Thanks for your comments: actually a refactoring is possible to use a
> (shared) TileState instance replacing int[] bbox, rectangle bbox):
> - RenderingEngine.AATileGenerator getAATileGenerator(... int[] abox)
>
> it is very interesting here to propose an extensible tile state: maybe
> created by the renderer engine to cache other data ?
>
> - Rectangle and Rectangle2D are only used as the shape s and device
> rectangle given to CompositePipe.startSequence():
> public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
> int[] abox);
>
> Changing this interface may become difficult:
> AlphaColorPipe.java:
>
> 41:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
> OK, [s, dev, abox] unused
>
> AlphaPaintPipe.java
>
> 81:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle 
> devR,
> create a paint context:
> PaintContext paintContext =
> sg.paint.createContext(sg.getDeviceColorModel(),
>devR,
>s.getBounds2D(),
>sg.cloneTransform(),
>sg.getRenderingHints());
>
> GeneralCompositePipe.java:
>
> 62:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle 
> devR,
> abox unused and create a paint context:
> PaintContext paintContext =
> sg.paint.createContext(model, devR, s.getBounds2D(),
>sg.cloneTransform(),
>hints);
>
> SpanClipRenderer.java
>
> 68:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle 
> devR,
> Forward to another composite pipe
> return new SCRcontext(ri, outpipe.startSequence(sg, s, devR, abox));
>
> It could be possible to use TileState into PaintContext interface / fix
> implementations but it may become a tricky change (API change).
>
> What do you think ?
>
> Laurent
>
>
> 2013/4/11 Jim Graham 
>
>> I'm pretty familiar with all of this code and there aren't any places
>> that save the tile array that I remember.  The embedded code that Pisces
>> was taken from had some caching of alpha arrays, but we didn't use or keep
>> that when we converted it for use in the JDK...
>>
>> It occurs to me that since you are collecting the various pieces of
>> information into an object to store in the thread local storage, perhaps we
>> should convert

Fwd: sun.awt.X11 logs still using String + (waste)

2013-04-11 Thread Laurent Bourgès
Anthony, Mandy,

here the 4th patch:
http://jmmc.fr/~bourgesl/share/webrev-8010297.4/

It only contains awt / swing / net classes that use PlatformLogger (no code
modification).

Laurent


Re: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste

2013-04-11 Thread Laurent Bourgès
Jim and Sergey,

1/ Here are few benchmarks (based on mapBench again) running several
modified versions of AAShapePipe:
http://jmmc.fr/~bourgesl/share/AAShapePipe/mapBench/
- ref:
1 threads and 20 loops per thread, time: 3742 ms
2 threads and 20 loops per thread, time: 4756 ms
4 threads and 20 loops per thread, time: 8528 ms

1 threads and 20 loops per thread, time: 56264 ms
2 threads and 20 loops per thread, time: 75566 ms
4 threads and 20 loops per thread, time: 141546 ms

- int4:
1 threads and 20 loops per thread, time: 3653 ms
2 threads and 20 loops per thread, time: 4684 ms
4 threads and 20 loops per thread, time: 8291 ms

1 threads and 20 loops per thread, time: 55950 ms
2 threads and 20 loops per thread, time: 74796 ms
4 threads and 20 loops per thread, time: 139924 ms

- byte[]:
1 threads and 20 loops per thread, time: 3795 ms
2 threads and 20 loops per thread, time: 4605 ms
4 threads and 20 loops per thread, time: 8246 ms

1 threads and 20 loops per thread, time: 54961 ms
2 threads and 20 loops per thread, time: 72768 ms
4 threads and 20 loops per thread, time: 139430 ms

- int4 / byte[] / rectangle cached in TileState:
1 threads and 20 loops per thread, time: 3610 ms
2 threads and 20 loops per thread, time: 4481 ms
4 threads and 20 loops per thread, time: 8225 ms

1 threads and 20 loops per thread, time: 54651 ms
2 threads and 20 loops per thread, time: 74516 ms
4 threads and 20 loops per thread, time: 140153 ms

So you may be right, results are varying depending on the optimizations
(int4, byte or all) !
Maybe I should test different versions on optimized pisces renderer ...

Here is an updated patch:
http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-2/


2/ Thanks for your comments: actually a refactoring is possible to use a
(shared) TileState instance replacing int[] bbox, rectangle bbox):
- RenderingEngine.AATileGenerator getAATileGenerator(... int[] abox)

it is very interesting here to propose an extensible tile state: maybe
created by the renderer engine to cache other data ?

- Rectangle and Rectangle2D are only used as the shape s and device
rectangle given to CompositePipe.startSequence():
public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
int[] abox);

Changing this interface may become difficult:
AlphaColorPipe.java:
41:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
OK, [s, dev, abox] unused

AlphaPaintPipe.java
81:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR,
create a paint context:
PaintContext paintContext =
sg.paint.createContext(sg.getDeviceColorModel(),
   devR,
   s.getBounds2D(),
   sg.cloneTransform(),
   sg.getRenderingHints());

GeneralCompositePipe.java:
62:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR,
abox unused and create a paint context:
PaintContext paintContext =
sg.paint.createContext(model, devR, s.getBounds2D(),
   sg.cloneTransform(),
   hints);

SpanClipRenderer.java
68:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR,
Forward to another composite pipe
return new SCRcontext(ri, outpipe.startSequence(sg, s, devR, abox));

It could be possible to use TileState into PaintContext interface / fix
implementations but it may become a tricky change (API change).

What do you think ?

Laurent

2013/4/11 Jim Graham 

> I'm pretty familiar with all of this code and there aren't any places that
> save the tile array that I remember.  The embedded code that Pisces was
> taken from had some caching of alpha arrays, but we didn't use or keep that
> when we converted it for use in the JDK...
>
> It occurs to me that since you are collecting the various pieces of
> information into an object to store in the thread local storage, perhaps we
> should convert to a paradigm where an entire Tile Generation sequence uses
> that object "TileState?" as its main way to communicate info around the
> various stages.  Thus, you don't really need an int[4] to store the 4
> parameters, they could be stored directly in the TileState object. This
> would require more sweeping changes to the pipeline, but it might make the
> code a bit more readable (and make the hits to convert over more moot as
> they would be improving readability and give more focus to the
> relationships between all of the various bits of data).  Then it simply
> becomes a matter of managing the lifetime and allocation of the TileState
> objects which is a minor update to the newly refactored code.
>
> ...jim
>
> On 4/10/13 3:59 PM, Sergey Bylokhov wrote:
>
>  On 4/10/13 11:46 PM, Laurent Bourgčs wrote:
>>
>>> I see that some methods which take it as argument doesn't use them. And
 most of the time we pass AATileGenerator and abo

Re: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste

2013-04-10 Thread Laurent Bourgès
Sergey,

I am not an expert here but just my 50 cents.
> This optimization shall take place only if it is really hotspot. But if it
> is a really hotspot - probably it would be better to remove these
> array/object allocation at all and use plane bytes?
>

Java2D calls AAShapePipe for each shape (line, rectangle ...) rendering so
it is an hotspot for me for big drawings as it will depends on the drawing
complexity (for example, Andrea MapBench can produce maps having more than
100 000 shapes per image ...)


> I see that some methods which take it as argument doesn't use them. And
> most of the time we pass AATileGenerator and abox[] to the same methods, so
> it could be merged?
>

For now I did not want to modify the AAShapePipe signatures: abox[] is
filled by AATileGenerator implementations (ductus, pisces, others) in order
to have the shape bounds and render only tiles covering this area.


>
> Also I suggest to use jmh for java micrbenchmarks.
> http://openjdk.java.net/**projects/code-tools/jmh
> So your test will be:
> http://cr.openjdk.java.net/~**serb/AAShapePipeBenchmark.java
>

Thanks,
I will try it asap

Laurent

>
>
>
> --
> Best regards, Sergey.
>
>


Re: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste

2013-04-10 Thread Laurent Bourgès
Peter,

1/ I modified your TestRunner class to print total ops and perform explicit
GC before runTests:
http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/

2/ I applied your advice but it does not change much:

 -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
-XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
 >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM
25.0-b24]
 #-
 # ContextGetInt4: run duration: 10 000 ms
 #
 # Warm up:
 #   4 threads, Tavg = 13,84 ns/op (σ =   0,23
ns/op), Total ops =   2889056179 [13,93 (717199825), 13,87
(720665624), 13,48 (741390545), 14,09 (709800185)]
 #   4 threads, Tavg = 14,25 ns/op (σ =   0,57
ns/op), Total ops =   2811615084 [15,21 (658351236), 14,18
(706254551), 13,94 (718202949), 13,74 (728806348)]
 cleanup (explicit Full GC) ...
 cleanup done.
 # Measure:
 1 threads, Tavg =  5,96 ns/op (σ =   0,00 ns/op), Total
ops =   1678357614 [ 5,96 (1678357614)]
 2 threads, Tavg =  7,33 ns/op (σ =   0,03 ns/op), Total
ops =   2729723450 [ 7,31 (1369694121),  7,36 (1360029329)]
 3 threads, Tavg = 10,65 ns/op (σ =   2,73 ns/op), Total
ops =   2817154340 [13,24 (755190111), 13,23 (755920429),  7,66
(1306043800)]
 4 threads, Tavg = 15,44 ns/op (σ =   3,33 ns/op), Total
ops =   2589897733 [17,05 (586353618), 19,23 (519345153), 17,88
(559401974), 10,81 (924796988)]

 -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
-XX:-TieredCompilation -XX:+UseCompressedKlassPointers
-XX:+UseCompressedOops -XX:+UseParallelGC
 >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM
25.0-b24]
 #-
 # GetInt4: run duration: 10 000 ms
 #
 # Warm up:
 #   4 threads, Tavg = 31,56 ns/op (σ =   0,43
ns/op), Total ops =   1267295706 [31,30 (319512554), 31,02
(32229), 32,12 (311334550), 31,82 (314155269)]
 #   4 threads, Tavg = 30,75 ns/op (σ =   1,81
ns/op), Total ops =   1302123211 [32,21 (310949394), 32,37
(309275124), 27,87 (359125007), 31,01 (322773686)]
 cleanup (explicit Full GC) ...
 cleanup done.
 # Measure:
 1 threads, Tavg =  8,36 ns/op (σ =   0,00 ns/op), Total
ops =   1196238323 [ 8,36 (1196238323)]
 2 threads, Tavg = 14,95 ns/op (σ =   0,04 ns/op), Total
ops =   1337648720 [15,00 (666813210), 14,91 (670835510)]
 3 threads, Tavg = 20,65 ns/op (σ =   0,99 ns/op), Total
ops =   1453119707 [19,57 (511100480), 21,97 (455262170), 20,54
(486757057)]
 4 threads, Tavg = 30,76 ns/op (σ =   0,54 ns/op), Total
ops =   1301090278 [31,51 (317527231), 30,79 (324921525), 30,78
(325063322), 29,99 (333578200)]
 #
 << JVM END

3/ I tried several heap settings: without Xms/Xmx ... but it has almost no
effect.

*Should I play with TLAB resize / initial size ? or different GC collector
(G1 ...) ?

Does anybody can explain me what PrintTLAB mean ?*

Laurent

2013/4/10 Peter Levart 

>  Hi Laurent,
>
> Could you disable tiered compilation for performance tests? Tiered
> compilation is usually a source of jitter in the results. Pass
> -XX:-TieredCompilation to the VM.
>
> Regards, Peter
>
>
>
> On 04/10/2013 10:58 AM, Laurent Bourgès wrote:
>
>  Dear Jim,
>
> 2013/4/9 Jim Graham 
>
>>
>> The allocations will always show up on a heap profiler, I don't know of
>> any way of having them not show up if they are stack allocated, but I don't
>> think that stack allocation is the issue here - small allocations come out
>> of a fast generation that costs almost nothing to allocate from and nearly
>> nothing to clean up.  They are actually getting allocated and GC'd, but the
>> process is optimized.
>>
>> The only way to tell is to benchmark and see which changes make a
>> difference and which are in the noise (or, in some odd counter-intuitive
>> cases, counter-productive)...
>>
>> ...jim
>>
>
> I advocate I like GC because it avoids in Java dealing with pointers like
> C/C++ does; however, I prefer GC clean real garbage (application...) than
> wasted memory:
> I prefer not count on GC when I can avoid w

Re: sun.awt.X11 logs still using String + (waste)

2013-04-10 Thread Laurent Bourgès
Anthony or Mandy,
Could you ask JMX / Security groups for me to review my patch ?

I am currently not registered to these mailing lists.

Do you ask me to split the patch in two part: PlatformLogger on one side
and Logger on the other side ?

Laurent

2013/4/9 Mandy Chung 

>  On 4/9/13 12:37 AM, Laurent Bourgès wrote:
>
> Mandy,
>
> first I would like to have the given patch applied to OpenJDK 8 (+ JDK7u)
> as it fixes many problems:
> - string concatenations
> - object creations (Object[] or other objects given as params)
> - method calls in log statements
>
>  This is the patch you refer to:
> http://jmmc.fr/~bourgesl/share/webrev-8010297.3/
>
> I agree that we should separate the fix to reduce the memory usage from
> the fix to convert JUL to PlatformLogger.  I skimmed on your patch -
> awt/swing uses PlatformLogger and your change looks fine.  You have also
> got Anthony's approval.  Other component security, jmx, snmp, etc are using
> JUL.  I would suggest to fix the use of PlatformLogger in your first patch
> and AWT/Swing is the main area that 8010297 is concerned about so that you
> can resolve 8010297 soon.
>
> If you want to move ahead to fix use of JUL, you can send your patch to
> serviceability-...@openjdk.java.net and security-...@openjdk.java.net for
> the other areas to review and sponsor.
>
> Hope this helps.
> Mandy
>
>
> In a second step, I can help somebody migrating JUL usages to
> PlatformLogger but it requires PlatformLogger API changes (logp to give
> class and method names)...
>
> Other comments below:
>
>>
>> Peter's idea is a good one to add a couple of convenient methods to take
>> Object parameters that will avoid the creation of array instances.  I'd
>> also like to know what variants being used in the jdk to determine the pros
>> and cons of the alternatives (this issue also applies to java.util.logging).
>>
>
> 50% of given object parameters are created via method calls so it will not
> be enough.
>
> Moreover, I prefer having always isLoggable() calls to avoid me looking
> and understanding special cases (more than 2 args or no string concat ...);
> besides, it would be easier to implement code checks testing missing
> isLoggable() calls vs conditional and specific checks (string concat, more
> then 2 args, method calls ...)
>
> Finally, I prefer having shorter and clearer method names like isFine(),
> isFinest() instead of isLoggable(PlatformLogger.FINE) that is quite "ugly"
> ...
>
>
>>  It was intentional to have PlatformLogger define only the useful methods
>> necessary for jdk use.   The goal of PlatformLogger was to provide a
>> light-weight utility for jdk to log debugging messages and eliminate the
>> dependency to java.util.logging and avoid the unnecessary j.u.logging
>> initialization (as disabled by default).   It was not a goal for
>> PlatformLogger to be an exact mirror as java.util.logging.Logger.  My
>> advice is to tune PlatformLogger to provide API that helps the platform
>> implementation while avoid bloating the API.
>>
>
> Agreed.
>
> Maybe JUL Logger.logp(classname, methodname) usages should be rewritten to
> use PlatformLogger.getCallerInfo() instead:
> // Returns the caller's class and method's name; best effort
> // if cannot infer, return the logger's name.
> private String getCallerInfo() {
> String sourceClassName = null;
> String sourceMethodName = null;
>
> *JavaLangAccess access = SharedSecrets.getJavaLangAccess();
> *Throwable throwable = new Throwable();
> int depth = access.getStackTraceDepth(throwable);
>
> String logClassName = "sun.util.logging.PlatformLogger";
> boolean lookingForLogger = true;
> for (int ix = 0; ix < depth; ix++) {
> // Calling getStackTraceElement directly prevents the VM
> // from paying the cost of building the entire stack frame.
> StackTraceElement frame =
> access.getStackTraceElement(throwable, ix);
> String cname = frame.getClassName();
> if (lookingForLogger) {
> // Skip all frames until we have found the first
> logger frame.
> if (cname.equals(logClassName)) {
> lookingForLogger = false;
> }
> } else {
> if (!cname.equals(logClassName)) {
> // We've found the relevant frame.
> sourceClassName = cname;
>

Re: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste

2013-04-10 Thread Laurent Bourgès
Dear Jim,

2013/4/9 Jim Graham 

>
> The allocations will always show up on a heap profiler, I don't know of
> any way of having them not show up if they are stack allocated, but I don't
> think that stack allocation is the issue here - small allocations come out
> of a fast generation that costs almost nothing to allocate from and nearly
> nothing to clean up.  They are actually getting allocated and GC'd, but the
> process is optimized.
>
> The only way to tell is to benchmark and see which changes make a
> difference and which are in the noise (or, in some odd counter-intuitive
> cases, counter-productive)...
>
> ...jim
>

I advocate I like GC because it avoids in Java dealing with pointers like
C/C++ does; however, I prefer GC clean real garbage (application...) than
wasted memory:
I prefer not count on GC when I can avoid wasting memory that gives GC more
work = reduce useless garbage (save the planet) !

Moreover, GC and / or Thread local allocation (TLAB) seems to have more
overhead than you think = "fast generation that costs almost nothing to
allocate from and nearly nothing to clean up".

Here are my micro-benchmark results related to int[4] allocation where I
mimic the AAShapePipe.fillParallelogram() method:
   Patch Ref Gain  5,96 8,27 138,76%  7,31 14,96 204,65%  10,65 20,4 191,55%
15,44 29,83 193,20%
Test environment:
Linux64 with OpenJDK8 (2 real cpu cores, 4 virtual cpus)
JVM settings:
-XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -Xms128m  -Xmx128m

Benchmark code (using Peter Levart microbench classes):
http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/

My conclusion is:  "nothing" > zero (allocation + cleanup) and it is very
noticeable in multi threading tests.

I advocate that I use a dirty int[4] array (no cleanup) but it is not
necessary : maybe the performance gain come from that reason.


Finally here is the output with  -XX:+PrintTLAB flag:
TLAB: gc thread: 0x7f105813d000 [id: 4053] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 20
waste  1,2% gc: 323712B slow: 600B fast: 0B
TLAB: gc thread: 0x7f105813a800 [id: 4052] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 7 waste
7,9% gc: 745568B slow: 176B fast: 0B
TLAB: gc thread: 0x7f1058138800 [id: 4051] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 15
waste  3,1% gc: 618464B slow: 448B fast: 0B
TLAB: gc thread: 0x7f1058136800 [id: 4050] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 7 waste
0,0% gc: 0B slow: 232B fast: 0B
TLAB: gc thread: 0x7f1058009000 [id: 4037] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 1 waste
27,5% gc: 369088B slow: 0B fast: 0B
TLAB totals: thrds: 5  refills: 50 max: 20 slow allocs: 0 max 0 waste:
3,1% gc: 2056832B max: 745568B slow: 1456B max: 600B fast: 0B max: 0B

I would have expected that TLAB can recycle all useless int[4] arrays as
fast as possible => waste = 100% ???

*Is there any bug in TLAB (core-libs) ?
Should I send such issue to hotspot team ?
*

*Test using ThreadLocal AAShapePipeContext:*
{
AAShapePipeContext ctx = getThreadContext();
int abox[] = ctx.abox;

// use array:
// mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y,
dx1, dy1, dx2, dy2, 0, 0, clip, abox);
abox[0] = 7;
abox[1] = 11;
abox[2] = 13;
abox[3] = 17;

// mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox);
devNull1.yield(abox);

if (!useThreadLocal) {
restoreContext(ctx);
}
}

-XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
-XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
>> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24]
#-
# ContextGetInt4: run duration: 10 000 ms
#
# Warm up:
#   4 threads, Tavg = 13,84 ns/op (σ =   0,23 ns/op), Total ops
=   2889056179 [13,93 (717199825), 13,87 (720665624), 13,48
(741390545), 14,09 (709800185)]
#   4 threads, Tavg = 14,25 ns/op (σ =   0,57 ns/op), Total ops
=   2811615084 [15,21 (658351236), 14,18 (706254551), 13,94
(718202949), 13,74 (728806348)]
cleanup (explicit Full GC) ...
cleanup done.
# Measure:
*1 threads, Tavg =  5,96 ns/op (σ =   0,00 ns/op), Total ops =
1678357614 [ 5,96 (1678357614)]
2 threads, Tavg =  7,33 ns/op (σ =   0,03 ns/op), Total ops =
2729723450 [ 7,31 (1369694121),  7,36 (1360029329)]
3 threads, Tavg = 10,65 ns/op (σ =   2,73 ns/op), Total ops =
2817154340 [13,24 (755190111), 13,23 (755920429),  7,66
(1306043800)]
**4 threads, Tavg = 15,44 ns/op (σ =   3,33 ns/op), Total ops =
2589897733 [17,05 (586353618), 19,23 (519

Re: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste

2013-04-10 Thread Laurent Bourgès
Andrea,
I am running benchmarks on my laptop (i7 - 2 core 2.8Ghz + HT => 4 virtual
cpus) on linux 64 (fedora 14).
Note: I always use cpufrequtils to set the cpu governor to performance and
use fixed frequency = 2.8Ghz:
[bourgesl@jmmc-laurent ~]$ uname -a
Linux jmmc-laurent.obs.ujf-grenoble.fr 2.6.35.14-106.fc14.x86_64 #1 SMP Wed
Nov 23 13:07:52 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

2013/4/10 Andrea Aime 

> On Tue, Apr 9, 2013 at 7:34 PM, Laurent Bourgès  > wrote:
>
>> Also, this should be tested on multiple platforms, preferably Linux,
>> Windows and Mac to see how it is affected by differences in the platform
>> runtimes and threading (hopefully minimal).
>>
>> It appears more difficult for me: I can use at work a mac 10.8 and I can
>> run Windows XP within virtual box (but it is not very representative).
>>
>
> I believe I can run MapBench on my Linux 64bit box during the next
> weekend, that would add a platform, and one were the
> server side behavior is enabled by default. And hopefully run the other
> benchmark as well.
>

I also run j2DBench but I can try also Java2D.demos to perform regression
tests.


>
> Laurent, have you made any changes to MapBench since I've sent it to you?
>

Yes I fixed a bit (cached BasicStroke, reused BufferedImage / Graphics) and
added explicit GC before tests (same initial conditions):
http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench/

Look at 
MapBench-src.zip<http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/MapBench/MapBench-src.zip>for
test changes.

Thanks for your efforts,
Laurent


Re: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste

2013-04-09 Thread Laurent Bourgès
Dear Jim,

I advocated I only looked at the netbeans memory profiler's output: no more
megabytes allocated !

The main question is: how to know how GC / hotspot deals with such small
allocations ? Is there any JVM flag to enable to see real allocations as
does jmap -histo.

>
> Quick questions - which benchmarks were run before/after?  I see a lot of
> benchmark running in your Pisces improvement thread, but but none here.
>

Agreed; I can try running j2dBench on this fix only. I generally run
Andrea's MapBench as I appeared more complex and using multiple threads.


> Also, this should be tested on multiple platforms, preferably Linux,
> Windows and Mac to see how it is affected by differences in the platform
> runtimes and threading (hopefully minimal).
>

It appears more difficult for me: I can use at work a mac 10.8 and I can
run Windows XP within virtual box (but it is not very representative).

Don't you have at oracle any test platform to perform such tests /
benchmark ?


> Finally, Hotspot is supposed to deal very well for small thread-local
> allocations like the int[4] and Rectangle2D that you optimized.  Was it
> necessary to cache those at all?  I'm sure the statistics for the
> allocations show up in a memory profile, but that doesn't mean it is
> costing us anything - ideally such small allocations are as fast as free
> and having to deal with caching them in a context will actually lose
> performance.  It may be that the tile caching saved enough that it might
> have masked unnecessary or detrimental changes for the smaller objects...


I repeat my question: how can I know at runtime how hotspot optimizes
AAShapePipe code (allocations ...) ? Does hotspot can do stack allocation ?
is it explained somewhere (allocation size threshold) ?

Maybe verbose:gc output may help ?

Finally I spent a lot of time on pisces renderer and running MapBench to
show performance gains.

Thanks for your interesting feedback,

Laurent

On 4/5/2013 5:20 AM, Laurent Bourgčs wrote:

> Dear java2d members,
>
> I figured out some troubles in java2d.pipe.AAShapePipe related to both
> concurrency & memory usage:
> - concurrency issue related to static theTile field: only 1 tile is cached
> so a new byte[] is created for other threads at each call to renderTile()
> - excessive memory usage (byte[] for tile, int[] and rectangle): at each
> call to renderPath / renderTiles, several small objects are created (never
> cached) that leads to hundreds megabytes that GC must deal with
>
> Here are profiling screenshots:
> - 4 threads drawing on their own buffered image (MapBench test):
> http://jmmc.fr/~bourgesl/**share/AAShapePipe/AAShapePipe_**byte_tile.png
>
> - excessive int[] / Rectangle creation:
> http://jmmc.fr/~bourgesl/**share/AAShapePipe/AAShapePipe_**int_bbox.png
> http://jmmc.fr/~bourgesl/**share/AAShapePipe/AAShapePipe_**
> rectangle_bbox.png
>
> Here is the proposed patch:
> http://jmmc.fr/~bourgesl/**share/AAShapePipe/webrev-1/
>
> I applied a simple solution = use a ThreadLocal or ConcurrentLinkedQueue
> (see useThreadLocal flag) to cache one AAShapePipeContext per thread (2K
> max).
> As its memory footprint is very small, I recommend using ThreadLocal.
>
> Is it necessary to use Soft/Weak reference to avoid excessive memory usage
> for such cache ?
>
> Is there any class dedicated to such cache (ThreadLocal with cache
> eviction or ConcurrentLinkedQueue using WeakReference ?) ?
> I think it could be very useful at the JDK level to have such feature (ie
> a generic "GC friendly"cache )
>
> Regards,
> Laurent
>


[OpenJDK 2D-Dev] sun.java2D.Pisces renderer Performance and Memory enhancements

2013-04-09 Thread Laurent Bourgès
Dear Java2D members,

Could someone review the following webrev concerning Java2D Pisces to
enhance its performance and reduce its memory footprint (RendererContext
stored in thread local or concurrent queue):
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/

FYI I fixed file headers in this patch and signed my OCA 3 weeks ago.

Remaining work:
- cleanup (comments ...)
- statistics to perform auto-tuning
- cache / memory cleanup (SoftReference ?): use hints or System properties
to adapt it to use cases
- another problem: fix clipping performance in Dasher / Stroker for
segments out of bounds

Could somebody support me ? ie help me working on these tasks or just to
discuss on Pisces algorithm / implementation ?

Best regards,
Laurent Bourgès


Re: sun.awt.X11 logs still using String + (waste)

2013-04-09 Thread Laurent Bourgès
Mandy,

first I would like to have the given patch applied to OpenJDK 8 (+ JDK7u)
as it fixes many problems:
- string concatenations
- object creations (Object[] or other objects given as params)
- method calls in log statements

In a second step, I can help somebody migrating JUL usages to
PlatformLogger but it requires PlatformLogger API changes (logp to give
class and method names)...

Other comments below:

>
> Peter's idea is a good one to add a couple of convenient methods to take
> Object parameters that will avoid the creation of array instances.  I'd
> also like to know what variants being used in the jdk to determine the pros
> and cons of the alternatives (this issue also applies to java.util.logging).
>

50% of given object parameters are created via method calls so it will not
be enough.

Moreover, I prefer having always isLoggable() calls to avoid me looking and
understanding special cases (more than 2 args or no string concat ...);
besides, it would be easier to implement code checks testing missing
isLoggable() calls vs conditional and specific checks (string concat, more
then 2 args, method calls ...)

Finally, I prefer having shorter and clearer method names like isFine(),
isFinest() instead of isLoggable(PlatformLogger.FINE) that is quite "ugly"
...


> It was intentional to have PlatformLogger define only the useful methods
> necessary for jdk use.   The goal of PlatformLogger was to provide a
> light-weight utility for jdk to log debugging messages and eliminate the
> dependency to java.util.logging and avoid the unnecessary j.u.logging
> initialization (as disabled by default).   It was not a goal for
> PlatformLogger to be an exact mirror as java.util.logging.Logger.  My
> advice is to tune PlatformLogger to provide API that helps the platform
> implementation while avoid bloating the API.
>

Agreed.

Maybe JUL Logger.logp(classname, methodname) usages should be rewritten to
use PlatformLogger.getCallerInfo() instead:
// Returns the caller's class and method's name; best effort
// if cannot infer, return the logger's name.
private String getCallerInfo() {
String sourceClassName = null;
String sourceMethodName = null;

*JavaLangAccess access = SharedSecrets.getJavaLangAccess();
*Throwable throwable = new Throwable();
int depth = access.getStackTraceDepth(throwable);

String logClassName = "sun.util.logging.PlatformLogger";
boolean lookingForLogger = true;
for (int ix = 0; ix < depth; ix++) {
// Calling getStackTraceElement directly prevents the VM
// from paying the cost of building the entire stack frame.
StackTraceElement frame =
access.getStackTraceElement(throwable, ix);
String cname = frame.getClassName();
if (lookingForLogger) {
// Skip all frames until we have found the first logger
frame.
if (cname.equals(logClassName)) {
lookingForLogger = false;
}
} else {
if (!cname.equals(logClassName)) {
// We've found the relevant frame.
sourceClassName = cname;
sourceMethodName = frame.getMethodName();
break;
}
}
}

if (sourceClassName != null) {
return sourceClassName + " " + sourceMethodName;
} else {
return name;
}
}
}


>
> Since you are touching some jdk files that use java.util.logging, would
> you like to contribute to convert them to use PlatformLogger:
>http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7054233
>
> It's perfectly fine to convert only some of them if not all.
>

I want to help but as you said it should be done in collaboration because
it requires API changes (JMX, RMI ...) but I prefer after this patch is
reviewed and submitted and in coming weeks.


>
> Jim Gish is the owner of logging and will check with him if he has cycles
> to work with you on this.
>

Great.

Cheers,
Laurent


Re: sun.awt.X11 logs still using String + (waste)

2013-04-08 Thread Laurent Bourgès
Anthony,

here is an updated patch concerning both PlatformLogger and Logger 'bad'
usages:
http://jmmc.fr/~bourgesl/share/webrev-8010297.3/

It impacts now awt, swing, JMX, security, network, and apache xml.

Thanks for the review,
Laurent

2013/4/8 Laurent Bourgès 

> Peter, Mandy,
>
> I think it would be great to make PlatformLogger very similar to Logger
> API:
> I mean JUL Logger already has only 1 convenient method with 1 param so it
> may be great to have the same API in PlatformLogger too: maybe extend it to
> 2 or 3 params ...
>
> Peter, could you look at the API differences between Logger /
> PlatformLogger to make PlatformLogger API more similar to JUL Logger ?
>
> Laurent
>
>
> 2013/4/8 Peter Levart 
>
>>  On 04/07/2013 07:11 PM, Laurent Bourgès wrote:
>>
>> Hi
>>
>> I started fixing All PlatformlLogger "bad" usages and I am fixing also
>> many jul Logger usages without isLoggable ...
>> That represents a lot of changes and a very boring job.
>>
>> I think sending a webrew tomorrow.
>>
>>
>> Hi Laurent,
>>
>> Since you're already deep in the task, you might have a rough feeling
>> what part of such unguarded log statements are of the following type:
>>
>> logger.[fine,info,...]("a message with {0} and {1} placeholders",
>> someArg1, someArg2);
>>
>> where someArg1, ... are some values that are already present in the
>> context of the logging statement and don't have to be computed just for
>> satisfying the logging (not even boxing). Such usages could be effectively
>> optimized by adding some overloaded methods to PlatformLogger that take 1,
>> 2, 3, ... arguments:
>>
>> class PlatformLogger {
>>
>> ...
>>
>> public void finest(String msg, Object param1) {
>> if (isLoggable(Level.FINEST)) {
>> loggerProxy.doLog(Level.FINEST, msg, param1);
>> }
>> }
>>
>> public void finest(String msg, Object param1, Object param2) {
>> if (isLoggable(Level.FINEST)) {
>> loggerProxy.doLog(Level.FINEST, msg, param1, param2);
>> }
>> }
>>
>> public void finest(String msg, Object param1, Object param2, Object
>> param3) {
>> if (isLoggable(Level.FINEST)) {
>> loggerProxy.doLog(Level.FINEST, msg, param1, param2, param3);
>> }
>> }
>>
>> ...
>>
>> This would effectively guard creation of the arguments array with an
>> isLoggable check for some common usages, eliminating the need to explicitly
>> guard such statements. Of course, the user would have to be aware of when
>> such unguarded logging statement is without overhead and when not...
>>
>> How do you feel about such API extension?
>>
>>
>> Regards, Peter
>>
>>
>>
>>  Laurent
>> Le 4 avr. 2013 14:08, "Laurent Bourgès"  a
>> écrit :
>>
>>> Ok, I can provide an updated patch after finding / fixing all usages.
>>>
>>> Probably in 1 or 2 days,
>>>
>>> Laurent
>>>
>>> 2013/4/4 Anthony Petrov 
>>>
>>>> Yes, this makes sense. Do you want to update your fix for 8010297 to
>>>> include these changes as well?
>>>>
>>>> --
>>>> best regards,
>>>> Anthony
>>>>
>>>>
>>>> On 04/04/13 15:47, Laurent Bourgès wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>>  I just realized there is another problem with PlatformLogger log
>>>>> statements:
>>>>> XBaseWindow:
>>>>>  public boolean grabInput() {
>>>>>  grabLog.fine("Grab input on {0}", this);
>>>>> ...
>>>>> }
>>>>>
>>>>> This calls the PlatformLogger.fine( varargs):
>>>>>  public void fine(String msg, Object... params) {
>>>>>  logger.doLog(FINE, msg, params);
>>>>>  }
>>>>>
>>>>> Doing so, the JVM creates a new Object[] instance to provide params as
>>>>> varargs.
>>>>>
>>>>> I would recommend using isLoggable() test to avoid such waste if the
>>>>> log
>>>>> is disabled (fine, finer, finest ...).
>>>>>
>>>>> Do you want me to provide a new patch related to this problem ?
>>>>>
>>>>> Does somebody have an idea to automatically analyze the JDK code an

Re: sun.awt.X11 logs still using String + (waste)

2013-04-08 Thread Laurent Bourgès
Peter, Mandy,

I think it would be great to make PlatformLogger very similar to Logger
API:
I mean JUL Logger already has only 1 convenient method with 1 param so it
may be great to have the same API in PlatformLogger too: maybe extend it to
2 or 3 params ...

Peter, could you look at the API differences between Logger /
PlatformLogger to make PlatformLogger API more similar to JUL Logger ?

Laurent

2013/4/8 Peter Levart 

>  On 04/07/2013 07:11 PM, Laurent Bourgès wrote:
>
> Hi
>
> I started fixing All PlatformlLogger "bad" usages and I am fixing also
> many jul Logger usages without isLoggable ...
> That represents a lot of changes and a very boring job.
>
> I think sending a webrew tomorrow.
>
>
> Hi Laurent,
>
> Since you're already deep in the task, you might have a rough feeling what
> part of such unguarded log statements are of the following type:
>
> logger.[fine,info,...]("a message with {0} and {1} placeholders",
> someArg1, someArg2);
>
> where someArg1, ... are some values that are already present in the
> context of the logging statement and don't have to be computed just for
> satisfying the logging (not even boxing). Such usages could be effectively
> optimized by adding some overloaded methods to PlatformLogger that take 1,
> 2, 3, ... arguments:
>
> class PlatformLogger {
>
> ...
>
> public void finest(String msg, Object param1) {
> if (isLoggable(Level.FINEST)) {
> loggerProxy.doLog(Level.FINEST, msg, param1);
> }
> }
>
> public void finest(String msg, Object param1, Object param2) {
> if (isLoggable(Level.FINEST)) {
> loggerProxy.doLog(Level.FINEST, msg, param1, param2);
> }
> }
>
> public void finest(String msg, Object param1, Object param2, Object
> param3) {
> if (isLoggable(Level.FINEST)) {
> loggerProxy.doLog(Level.FINEST, msg, param1, param2, param3);
> }
> }
>
> ...
>
> This would effectively guard creation of the arguments array with an
> isLoggable check for some common usages, eliminating the need to explicitly
> guard such statements. Of course, the user would have to be aware of when
> such unguarded logging statement is without overhead and when not...
>
> How do you feel about such API extension?
>
>
> Regards, Peter
>
>
>
>  Laurent
> Le 4 avr. 2013 14:08, "Laurent Bourgès"  a
> écrit :
>
>> Ok, I can provide an updated patch after finding / fixing all usages.
>>
>> Probably in 1 or 2 days,
>>
>> Laurent
>>
>> 2013/4/4 Anthony Petrov 
>>
>>> Yes, this makes sense. Do you want to update your fix for 8010297 to
>>> include these changes as well?
>>>
>>> --
>>> best regards,
>>> Anthony
>>>
>>>
>>> On 04/04/13 15:47, Laurent Bourgès wrote:
>>>
>>>> Dear all,
>>>>
>>>>  I just realized there is another problem with PlatformLogger log
>>>> statements:
>>>> XBaseWindow:
>>>>  public boolean grabInput() {
>>>>  grabLog.fine("Grab input on {0}", this);
>>>> ...
>>>> }
>>>>
>>>> This calls the PlatformLogger.fine( varargs):
>>>>  public void fine(String msg, Object... params) {
>>>>  logger.doLog(FINE, msg, params);
>>>>  }
>>>>
>>>> Doing so, the JVM creates a new Object[] instance to provide params as
>>>> varargs.
>>>>
>>>> I would recommend using isLoggable() test to avoid such waste if the log
>>>> is disabled (fine, finer, finest ...).
>>>>
>>>> Do you want me to provide a new patch related to this problem ?
>>>>
>>>> Does somebody have an idea to automatically analyze the JDK code and
>>>> detect missing isLoggable statements ...
>>>>
>>>> Regards,
>>>> Laurent
>>>>
>>>> 2013/4/3 Laurent Bourgès >>>  <mailto:bourges.laur...@gmail.com>>
>>>>
>>>>
>>>> Anthony,
>>>>
>>>> could you tell me once it is in the OpenJDK8 repository ?
>>>> I would then perform again profiling tests to check if there is no
>>>> more missing isLoggable() statements.
>>>>
>>>> Once JMX and other projects switch to PlatformLogger, I could check
>>>> again.
>>>>
>>>> Maybe I could write a small java code checker (pmd rule) to test if
>>>&

Re: [OpenJDK 2D-Dev] sun.java2D.pisces big memory usage (waste ?)

2013-04-08 Thread Laurent Bourgès
Andrea,

Any feedback from anybody else ?

Here are J2D Bench results:
http://jmmc.fr/~bourgesl/share/java2d-pisces/j2DBench/

Depending on the test, performance gains varies from 20% to 100% !

I think it could be nice if you can perform tests (regression and
benchmarks using MapBench, J2DBench, Java2D demos).
I still have to enhance cleanup / tuning my code (stats, initial array
sizes ...) and cache eviction (memory cleanup) using Weak references or
manual cleanup using timestamps ...

To test regression, I propose you to enhance your MapBench class to save
image results (randomly in tests) and compare them after the test run
(image diff) even when multiple threads are in use.

If you apply my patch, take care of the following main settings:
useThreadLocal should be disabled in a web container (to avoid too much
memory waste): a RendererContext represents ~ 2Mb:
rowAARLE = new int[INITIAL_Y_ARRAY][INITIAL_ARRAY]; // ~2Mb +1 to avoid
recycling such shared arrays

PiscesConst:
/** use ThreadLocal or ConcurrentLinkedQueue to get the thread
RendererContext */
static final boolean useThreadLocal = true;

Initial RendererContext array capacity:
// +1 to avoid recycling such shared arrays
static final int INITIAL_ARRAY = 256 + 1;
static final int INITIAL_Y_ARRAY = 2048 + 1;
static final int INITIAL_LARGE_ARRAY = 16384 + 1; // large

Laurent

2013/4/7 Andrea Aime 

> On Fri, Apr 5, 2013 at 4:32 PM, Laurent Bourgès  > wrote:
>
>> Dear all,
>>
>> Here is my first pisces (beta) patch (webrev):
>> http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/
>>
>> I succeed in fixing memory usage by having only 1 pisces instance
>> (Renderer, stroker, iterator ...) per RendererContext (GC friendly).
>>
>> Impressive results between small and large drawings:
>>
>
> Good stuff. Is there anything I can do to help? I do have an up to date
> version of JDK 8 sources on my disk, maybe I could
> try to apply the patch and take it for a spin in a real GeoServer and see
> how it goes.
>
> By the way, wondering, is there any other benchmark to try out?
> A possibly interesting one is this, where the author re-coded some
> selected methods of Graphics2D for maximum performance
> (but sometimes lower antialiasing quality) and created a applet to compare
> the results in a number of synthetic test cases:
> http://www.randelshofer.ch/oop/graphics/
>
> Cheers
> Andrea
>
>
> --
> ==
> Our support, Your Success! Visit http://opensdi.geo-solutions.it for more
> information.
> ==
>
> Ing. Andrea Aime
> @geowolf
> Technical Lead
>
> GeoSolutions S.A.S.
> Via Poggio alle Viti 1187
> 55054  Massarosa (LU)
> Italy
> phone: +39 0584 962313
> fax: +39 0584 1660272
> mob: +39  339 8844549
>
> http://www.geo-solutions.it
> http://twitter.com/geosolutions_it
>
> ---
>


Re: sun.awt.X11 logs still using String + (waste)

2013-04-07 Thread Laurent Bourgès
Hi

I started fixing All PlatformlLogger "bad" usages and I am fixing also many
jul Logger usages without isLoggable ...
That represents a lot of changes and a very boring job.

I think sending a webrew tomorrow.

Laurent
Le 4 avr. 2013 14:08, "Laurent Bourgès"  a
écrit :

> Ok, I can provide an updated patch after finding / fixing all usages.
>
> Probably in 1 or 2 days,
>
> Laurent
>
> 2013/4/4 Anthony Petrov 
>
>> Yes, this makes sense. Do you want to update your fix for 8010297 to
>> include these changes as well?
>>
>> --
>> best regards,
>> Anthony
>>
>>
>> On 04/04/13 15:47, Laurent Bourgès wrote:
>>
>>> Dear all,
>>>
>>> I just realized there is another problem with PlatformLogger log
>>> statements:
>>> XBaseWindow:
>>>  public boolean grabInput() {
>>>  grabLog.fine("Grab input on {0}", this);
>>> ...
>>> }
>>>
>>> This calls the PlatformLogger.fine( varargs):
>>>  public void fine(String msg, Object... params) {
>>>  logger.doLog(FINE, msg, params);
>>>  }
>>>
>>> Doing so, the JVM creates a new Object[] instance to provide params as
>>> varargs.
>>>
>>> I would recommend using isLoggable() test to avoid such waste if the log
>>> is disabled (fine, finer, finest ...).
>>>
>>> Do you want me to provide a new patch related to this problem ?
>>>
>>> Does somebody have an idea to automatically analyze the JDK code and
>>> detect missing isLoggable statements ...
>>>
>>> Regards,
>>> Laurent
>>>
>>> 2013/4/3 Laurent Bourgès >> <mailto:bourges.laurent@gmail.**com >>
>>>
>>>
>>> Anthony,
>>>
>>> could you tell me once it is in the OpenJDK8 repository ?
>>> I would then perform again profiling tests to check if there is no
>>> more missing isLoggable() statements.
>>>
>>> Once JMX and other projects switch to PlatformLogger, I could check
>>> again.
>>>
>>> Maybe I could write a small java code checker (pmd rule) to test if
>>> there is missing isLoggable() statements wrapping PlatformLogger log
>>> statements. Any idea about how to reuse java parser to do so ?
>>>
>>> Regards,
>>>
>>> Laurent
>>>
>>> 2013/4/2 Anthony Petrov >> <mailto:anthony.petrov@oracle.**com >>
>>>
>>>
>>> Looks fine to me as well. Thanks for fixing this, Laurent.
>>>
>>> Let's wait a couple more days in case Net or Swing folks want to
>>> review the fix. After that I'll push it to the repository.
>>>
>>> --
>>> best regards,
>>> Anthony
>>>
>>>
>>> On 4/2/2013 15:35, Laurent Bourgès wrote:
>>>
>>> Here is the updated patch:
>>> 
>>> http://jmmc.fr/~bourgesl/__**share/webrev-8010297.2/<http://jmmc.fr/%7Ebourgesl/__share/webrev-8010297.2/>
>>> 
>>> <http://jmmc.fr/%7Ebourgesl/**share/webrev-8010297.2/<http://jmmc.fr/%7Ebourgesl/share/webrev-8010297.2/>
>>> >
>>>
>>>
>>> Fixed inconsistencies between FINE / FINER log statements:
>>> - XScrollbarPeer
>>> - XWindowPeer
>>>
>>> Laurent
>>>
>>> 2013/4/2 Anthony Petrov >> <mailto:anthony.petrov@oracle.**com
>>> >
>>> <mailto:anthony.petrov@oracle.**__com
>>>
>>> <mailto:anthony.petrov@oracle.**com
>>> >>>
>>>
>>>
>>>  1. Sergey: I believe this is for purposes of better
>>> formating the
>>>  log output and making it more readable by separating or
>>> highlighting
>>>  some sections. I don't think this should be changed.
>>>
>>>  2. Laurent: can you please address this issue and send
>>> us a new patch?
>>>
>>>  --
>>>  best regards,
>>>  Anthony
>>>
>>>
>>>  On 4/1/2013 16:08, Sergey Bylokhov wrote:
>>>
>>>
>>>  Hi, A

Re: Review request: 8011380: FX dependency on PlatformLogger broken

2013-04-06 Thread Laurent Bourgès
Mandy,

I believe the awt/net code started logging in 1.4 and 1.5 using j.u.logging
> that was really early taker before the performance overhead was considered.
>
> I filed a bug 8011557: improve the logging documentation to advice on
> performance consideration as we may want to mention this in the tutorial or
> other docs as well.  This should belong to java.util.logging instead of
> sun.util.logging.
>

That's a good idea to update both javadoc (JUL, PlatformLogger) and maybe
the java code convention related to OpenJDK code.

I insist on pointing this performance issue in the PlatformLogger class too
because it belongs to JDK and I expect it to be as fast as possible !

Moreover, it would be very great if openjdk's build can perform code
quality tests and report such incorrect JUL / PlatformLogger usages. Any
idea to do so ?


>
> Having a second thought, Supplier may not address the memory overhead
> concern you raise.  Worth considering any API improvement to address both
> the runtime and memory concern.
>

I looked at JDK8 Logger doc and Supplier may help (but maybe the underlying
implementation may be slow or create objects too):
"A set of methods alternatively take a "msgSupplier" instead of a "msg"
argument. These methods take a
Supplier<http://download.java.net/jdk8/docs/api/java/util/function/Supplier.html>
 function which is invoked to construct the desired log message
only when the message actually is to be logged based on the effective log
level thus eliminating unnecessary message construction. For example, if
the developer wants to log system health status for diagnosis, with the
String-accepting version, the code would look like:

   class DiagnosisMessages {
 static String systemHealthStatus() {
   // collect system health information
   ...
 }
   }
   ...
   logger.log(Level.FINER, DiagnosisMessages.systemHealthStatus());

*With the above code, the health status is collected unnecessarily even
when the log level FINER is disabled. With the Supplier-accepting version
as below, the status will only be collected when the log level FINER is
enabled. *

*   logger.log(Level.FINER, DiagnosisMessages::systemHealthStatus);*

"


> It would also be helpful if IDE and code analysis tool can hint the
> developers of such pattern.
>

Netbeans has a warning saying "string concatenation in log statements" but
it does not propose to fix it by adding proper isLoggable statement: maybe
we should submit a RFE to netbeans project ?
http://wiki.netbeans.org/Java_Hints#Logging

"*String concatenation in logger* It is not performance efficient to
concatenate strings in logger messages. It is better to use a template
message with placeholders that are replaced by concrete values only when
the message is really going to be logged. *Since NetBeans 6.9* "

Cheers,
Laurent


> Mandy
> P.S. I'll be pushing the changeset today.
>
>
> On 4/5/2013 9:02 AM, Laurent Bourgès wrote:
>
> Mandy,
>
> I agree it should be well known; but I fixed several cases in awt/net code
> where isLoggable() calls were missing.
>
> I think it is quite cheap to remind good practices in the PlatformLogger /
> jul Logger javadoc ...
>
> PS: maybe some quality control tools could check such missing tests (PMD
> can do it)...
>
>   I believe this was mentioned somewhere in j.u.logging.  A better
>> solution may be to take java.util.function.Supplier parameter that
>> constructs the log message lazily (see
>> http://download.java.net/jdk8/docs/api/java/util/logging/Logger.html#fine(java.util.function.Supplier).
>> I can file a RFE to investigate the use of Supplier as in j.u.l.Logger.
>>
>
> Very interesting feature, but I am still not aware of such JDK 8 features
> (lambda) ... it seems nice but maybe the syntax will be more complicated /
> verbose than our usual log statements:
> log.fine(""...)
>
> Laurent
>
>
>>
>> On 4/5/2013 1:55 AM, Laurent Bourgès wrote:
>>
>> Mandy,
>>
>> I would like to add few performance advices in the PlatformLogger Javadoc:
>> "
>> NOTE: For performance reasons, PlatformLogger usages should take care of
>> avoiding useless / wasted object creation and method calls related to *
>> disabled* log statements:
>> Always use isLoggable(level) wrapping logs at levels (FINEST, FINER,
>> FINE, CONFIG):
>>
>> Bad practices:
>> - string concat:
>> log.fine("message" + value); // means
>> StringBuilder(message).append(String.valueOf(value)).toString(): 2 objects
>> created and value.toString() called
>> - varags:
>> log.fine("message {0}", this); // create an Object[]
>>
>> Best practices:
>> if (log.isLoggable(PlatformLogger.FINE) {
>> log.fine("message" + value);
>> }
>>
>> if (log.isLoggable(PlatformLogger.FINE) {
>> log.fine("message {0}", this);
>> }
>> "
>>
>> What is your opinion ?
>>
>> Thanks for the given explanations and I hope that this patch will be
>> submitted soon to JDK8 repository.
>>
>> Laurent
>>
>>
>
>


Re: Review request: 8011380: FX dependency on PlatformLogger broken

2013-04-05 Thread Laurent Bourgès
Mandy,

I agree it should be well known; but I fixed several cases in awt/net code
where isLoggable() calls were missing.

I think it is quite cheap to remind good practices in the PlatformLogger /
jul Logger javadoc ...

PS: maybe some quality control tools could check such missing tests (PMD
can do it)...

I believe this was mentioned somewhere in j.u.logging.  A better solution
> may be to take java.util.function.Supplier parameter that constructs the
> log message lazily (see
> http://download.java.net/jdk8/docs/api/java/util/logging/Logger.html#fine(java.util.function.Supplier).
> I can file a RFE to investigate the use of Supplier as in j.u.l.Logger.
>

Very interesting feature, but I am still not aware of such JDK 8 features
(lambda) ... it seems nice but maybe the syntax will be more complicated /
verbose than our usual log statements:
log.fine(""...)

Laurent


>
> On 4/5/2013 1:55 AM, Laurent Bourgès wrote:
>
> Mandy,
>
> I would like to add few performance advices in the PlatformLogger Javadoc:
> "
> NOTE: For performance reasons, PlatformLogger usages should take care of
> avoiding useless / wasted object creation and method calls related to *
> disabled* log statements:
> Always use isLoggable(level) wrapping logs at levels (FINEST, FINER, FINE,
> CONFIG):
>
> Bad practices:
> - string concat:
> log.fine("message" + value); // means
> StringBuilder(message).append(String.valueOf(value)).toString(): 2 objects
> created and value.toString() called
> - varags:
> log.fine("message {0}", this); // create an Object[]
>
> Best practices:
> if (log.isLoggable(PlatformLogger.FINE) {
> log.fine("message" + value);
> }
>
> if (log.isLoggable(PlatformLogger.FINE) {
> log.fine("message {0}", this);
> }
> "
>
> What is your opinion ?
>
> Thanks for the given explanations and I hope that this patch will be
> submitted soon to JDK8 repository.
>
> Laurent
>
>


Re: [OpenJDK 2D-Dev] sun.java2D.pisces big memory usage (waste ?)

2013-04-05 Thread Laurent Bourgès
Dear all,

Here is my first pisces (beta) patch (webrev):
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/

I succeed in fixing memory usage by having only 1 pisces instance
(Renderer, stroker, iterator ...) per RendererContext (GC friendly).

Impressive results between small and large drawings:


dc_boulder_2013-13-30-06-13-17.ser dc_shp_alllayers_2013-00-30-07-00-47.ser
JVM 1T 2T 4T 1T 2T 4T  OpenJDK8 Patch Beta 2556 3190 5106 22952 26552
46010  OpenJDK8
Ref 3982 4852 8842 55889 77691 141981  Oracle JDK8 (ductus) 1894 3905 7485
20911 39297 103392






 Patch / REF *155,79%* *152,10%* *173,17%* *243,50%* *292,60%* *308,59%*  Patch
/ Ductus 74,10% *122,41%* *146,59%* 91,11% *148,00%* *224,72%*
Low mem tests: -Xmx128m to maximize GC overhead

http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench_PATCH_V1.log (beta)
http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench_REF.log
http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench_OracleDuctus.log

Andrea, I modify the MapBench to perform explicit GC between tests (+
creating graphics / image out of the test loop):
http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench/

Note: these results were performed with the AAShapePipe proposed patch
(small impact).

Is there somebody interested to enhance Pisces renderer code and support me
as sponsor ?

Look at PiscesConst to enable/disable several flags (useThreadLocal,
doChecks to check array cleanup ...); please ask me if you have any
question.

TBD:
- cleanup / statistics / profile cpu performance ...
- enhance Dasher / Stroker to deal with shape out or partially out the
clipping area.

Cheers,
Laurent


AAShapePipe concurrency & memory waste

2013-04-05 Thread Laurent Bourgès
Dear java2d members,

I figured out some troubles in java2d.pipe.AAShapePipe related to both
concurrency & memory usage:
- concurrency issue related to static theTile field: only 1 tile is cached
so a new byte[] is created for other threads at each call to renderTile()
- excessive memory usage (byte[] for tile, int[] and rectangle): at each
call to renderPath / renderTiles, several small objects are created (never
cached) that leads to hundreds megabytes that GC must deal with

Here are profiling screenshots:
- 4 threads drawing on their own buffered image (MapBench test):
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_byte_tile.png

- excessive int[] / Rectangle creation:
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_int_bbox.png
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_rectangle_bbox.png

Here is the proposed patch:
http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-1/

I applied a simple solution = use a ThreadLocal or ConcurrentLinkedQueue
(see useThreadLocal flag) to cache one AAShapePipeContext per thread (2K
max).
As its memory footprint is very small, I recommend using ThreadLocal.

Is it necessary to use Soft/Weak reference to avoid excessive memory usage
for such cache ?

Is there any class dedicated to such cache (ThreadLocal with cache eviction
or ConcurrentLinkedQueue using WeakReference ?) ?
I think it could be very useful at the JDK level to have such feature (ie a
generic "GC friendly"cache )

Regards,
Laurent


Re: Review request: 8011380: FX dependency on PlatformLogger broken

2013-04-05 Thread Laurent Bourgès
Mandy,

I would like to add few performance advices in the PlatformLogger Javadoc:
"
NOTE: For performance reasons, PlatformLogger usages should take care of
avoiding useless / wasted object creation and method calls related to *
disabled* log statements:
Always use isLoggable(level) wrapping logs at levels (FINEST, FINER, FINE,
CONFIG):

Bad practices:
- string concat:
log.fine("message" + value); // means
StringBuilder(message).append(String.valueOf(value)).toString(): 2 objects
created and value.toString() called
- varags:
log.fine("message {0}", this); // create an Object[]

Best practices:
if (log.isLoggable(PlatformLogger.FINE) {
log.fine("message" + value);
}

if (log.isLoggable(PlatformLogger.FINE) {
log.fine("message {0}", this);
}
"

What is your opinion ?

Thanks for the given explanations and I hope that this patch will be
submitted soon to JDK8 repository.

Laurent

2013/4/4 Mandy Chung 

> Alan - can you review this change?
>
> I have changed Level.valueOf(int) to return the nearest Level as Peter
> suggests using binary search:
>
> http://cr.openjdk.java.net/~**mchung/jdk8/webrevs/8011380/**webrev.01/
>
> I want to push the changeset tomorrow since we need this fix before the TL
> integration.
>
> Several questions were brought up and I'm answering them in one shot:
> 1. The PlatformLogger static final fields have to retain the same since
> existing code can call:
> int level = PlatformLogger.FINE;
> logger.setLevel(level);
>
> There is also native code accessing PlatformLogger fields (will need to
> investigate more).  Once we fix this type of incompatibility references, we
> can change them.  Or we could remove these static final constants
> completely but it's less preferable since it touches many of the JDK files.
>  I would keep these static fields as a shortcut.
>
> 2. New convenient methods (isInfoLoggable, isWarningLoggable) to minimize
> the dependency to the constants
>
> It's not a problem to have the dependencies.  This is an issue this time
> since we were not aware of such dependency.  The isLoggable method is
> simple enough.
>
> 3. 3 methods with two versions (one with int and one with Level parameter)
>
> As I said, I'll file a bug to remove the 3 deprecated methods in jdk8. I'm
> certain I can do so but just take some time to synchronize the changes.
>
> 4. It's true that you can set a PlatformLogger with a custom level via
> PlatformLogger API.  But you can access a platform logger using
> java.util.logging API.
>
>Logger.getLogger("sun.awt.**someLogger").setLevel(MyLevel.**
> CUSTOM_LEVEL);
>
> PlatformLogger.getLevel() has to return some thing.  Laurent suggests to
> remove (deprecate) PlatformLogger.getLevel() or level() method.  I have to
> understand how the FX code uses getLevel().  We have to keep it due to the
> regression and for testing.  For API perspective, having a method to find
> what level it's at is reasonable.  Since we now have a clean solution to
> deal with custom level, I don't see any issue keeping it.
>
> 5. JavaFX 8 is likely to switch to build with JDK8 in a few weeks (really
> soon).
>
> 6. Level.intValue() public or not
>
> It mirrors java.util.logging.Level but may be able to do without it.
>  Let's leave it public for now until FX converts to use the new code.  I
> can clean this up at the same time.
>
> Mandy
>
>
>
> On 4/3/13 9:52 PM, Mandy Chung wrote:
>
>> Peter, Laurent,
>>
>> History and details are described below.
>>
>> Webrev at:
>> http://cr.openjdk.java.net/~**mchung/jdk8/webrevs/8011380/**webrev.00/
>>
>> I add back the getLevel(int), setLevel(int) and isLoggable(int) methods
>> but marked deprecated and also revert the static final variables to resolve
>> the regression. They can be removed when JavaFX transitions to use the
>> Level enums (I'll file one issue for FX to use PlatformLogger.Level and one
>> for core-libs to remove the deprecated methods).  The performance overhead
>> is likely small since it's direct mapping from int to the Level enum that
>> was used in one of your previous performance measurement.
>>
>> Laurent - you have a patch to add isLoggable calls in the awt/swing code.
>> Can you check if there is any noticeable performance difference?
>>
>> I also take the opportunity to reconsider what JavaLoggerProxy.getLevel()
>> should return when it's a custom Level. Use of logging is expected not to
>> cause any fatal error to the running application.  The previous patch
>> throwing IAE needs to be fixed.  I propose to return the first
>> Level.intValue() >= the custom level value since the level value is used to
>> evaluate if it's loggable.
>>
>> History and details:
>> JavaFX 8 has converted to use sun.util.logging.**PlatformLogger (
>> https://javafx-jira.kenai.**com/browse/RT-24458).
>> I was invo

Re: Review request: 8011380: FX dependency on PlatformLogger broken

2013-04-04 Thread Laurent Bourgès
Well done: binary search to find the closest matching level !

However, I still do not see any concrete use case for custom Levels in such
closed API as PlatformLogger is: KISS, please !

I mean as PlatformLogger is only used (available) to JDK and related
projects, is there any RFE or will to use custom JUL levels here ?

Laurent

2013/4/4 Peter Levart 

> Hi Mandy, Laurent,
>
> What do you think of this variation (changed just PlatformLogger, other
> files exactly equal to your webrev):
>
> http://dl.dropbox.com/u/**101777488/jdk8-tl/**
> PlatformLogger/webrev.enumapi.**02/index.html
>
> Moved the nearest >= intValue search to PlatformLogger.Level.valueOf(**int)
> since it is used also by deprecated API. With this change the same logic is
> used for mapping from j.u.l.Level to PlatformLogger.Level in all cases.
>
> I wonder if PlatformLogger.Level.intValue(**) should be exposed as public
> API. Currently it is used by the test (which is not in the same package).
> But this could be circumvented with reflection. I only see the use of it if
> we also make the PlatformLogger.Level.valueOf(**int) public with
> anticipation of enabling PlatformLogger to use custom
> PlatformLogger.Level(s) like j.u.logging. This extension could be performed
> compatibly by transforming Level enum into a plain class like j.u.l.Level.
> But until then, I don't see the use of Level.intValue() as a public API.
> What do you think?
>
> Regards, Peter
>
>
> On 04/04/2013 06:52 AM, Mandy Chung wrote:
>
>> Peter, Laurent,
>>
>> History and details are described below.
>>
>> Webrev at:
>> http://cr.openjdk.java.net/~**mchung/jdk8/webrevs/8011380/**webrev.00/
>>
>> I add back the getLevel(int), setLevel(int) and isLoggable(int) methods
>> but marked deprecated and also revert the static final variables to resolve
>> the regression. They can be removed when JavaFX transitions to use the
>> Level enums (I'll file one issue for FX to use PlatformLogger.Level and one
>> for core-libs to remove the deprecated methods).  The performance overhead
>> is likely small since it's direct mapping from int to the Level enum that
>> was used in one of your previous performance measurement.
>>
>> Laurent - you have a patch to add isLoggable calls in the awt/swing code.
>> Can you check if there is any noticeable performance difference?
>>
>> I also take the opportunity to reconsider what JavaLoggerProxy.getLevel()
>> should return when it's a custom Level. Use of logging is expected not to
>> cause any fatal error to the running application.  The previous patch
>> throwing IAE needs to be fixed.  I propose to return the first
>> Level.intValue() >= the custom level value since the level value is used to
>> evaluate if it's loggable.
>>
>> History and details:
>> JavaFX 8 has converted to use sun.util.logging.**PlatformLogger (
>> https://javafx-jira.kenai.**com/browse/RT-24458).
>> I was involved in the early discussion but wasn't aware of the decision
>> made. Thanks to Alan for catching this regression out before it's
>> integrated to jdk8.  jfxrt.jar is cobundled with jdk8 during the installer
>> build step.  My build doesn't build installer and thus we didn't see this
>> regression.
>>
>> I ran jdeps on jdk8/lib/ext/jfxrt.jar (windows-i586) that shows 112
>> references to PlatformLogger and on jdk7/lib/jfxrt.jar that shows no
>> reference to sun.util.logging.
>>
>> I have a method finder tool (planning to include it in jdk8) to search
>> for use of PlatformLogger.getLevel and it did find 2 references from
>> jdk8/lib/ext/jfxrt.jar.
>>
>> JavaFX 8 is going to upgrade to use JDK 8 for JavaFX build (
>> https://javafx-jira.kenai.**com/browse/RT-27794)
>> soon (currently it's built with JDK 7).  As soon as JavaFX code are changed
>> to reference PlatformLogger.Level enum instead, we can remove the
>> deprecated methods and change the PlatformLogger constants.
>>
>> JavaFX 2.2.x doesn't use sun.util.logging and so this has no impact to
>> it.  In any case, JavaFX 2.2.x only runs either bundled with a
>> corresponding JDK 7u release, or as a standalone library for JDK 6 only.
>>
>> Thanks
>> Mandy
>>
>>
>


Re: [OpenJDK 2D-Dev] sun.java2D.pisces big memory usage (waste ?)

2013-04-04 Thread Laurent Bourgès
I updated both patched pisces code and benchmarks:
http://jmmc.fr/~bourgesl/share/java2d-pisces/

Few results comparing ThreadLocal vs ConcurrentLinkedQueue usage:

OpenJDK 8 PATCH ThreadLocal mode:
Testing file
/home/bourgesl/libs/openjdk/mapbench/test/dc_boulder_2013-13-30-06-13-17.ser
1 threads and 20 loops per thread, time: 2671 ms
2 threads and 20 loops per thread, time: 3239 ms
4 threads and 20 loops per thread, time: 6043 ms

OpenJDK 8 PATCH ConcurrentLinkedQueue mode:
Testing file
/home/bourgesl/libs/openjdk/mapbench/test/dc_boulder_2013-13-30-06-13-17.ser
1 threads and 20 loops per thread, time: 2779 ms
2 threads and 20 loops per thread, time: 3416 ms
4 threads and 20 loops per thread, time: 6153 ms

Oracle JDK8 Ductus:
Testing file
/home/bourgesl/libs/openjdk/mapbench/dc_boulder_2013-13-30-06-13-17.ser
1 threads and 20 loops per thread, time: 1894 ms
2 threads and 20 loops per thread, time: 3905 ms
4 threads and 20 loops per thread, time: 7485 ms


OpenJDK 8 PATCH ThreadLocal mode:
Testing file
/home/bourgesl/libs/openjdk/mapbench/test/dc_shp_alllayers_2013-00-30-07-00-47.ser
1 threads and 20 loops per thread, time: 24211 ms
2 threads and 20 loops per thread, time: 30955 ms
*4 threads and 20 loops per thread, time: 67715 ms*

OpenJDK 8 PATCH ConcurrentLinkedQueue mode:
Testing file
/home/bourgesl/libs/openjdk/mapbench/test/dc_shp_alllayers_2013-00-30-07-00-47.ser
1 threads and 20 loops per thread, time: 25984 ms
2 threads and 20 loops per thread, time: 33131 ms
*4 threads and 20 loops per thread, time: 75343 ms
*
Oracle JDK8 Ductus:
Loading drawing commands from file:
/home/bourgesl/libs/openjdk/mapbench/dc_shp_alllayers_2013-00-30-07-00-47.ser
Loaded DrawingCommands: DrawingCommands{width=1400, height=800,
commands=135213}
1 threads and 20 loops per thread, time: 20911 ms
2 threads and 20 loops per thread, time: 39297 ms
4 threads and 20 loops per thread, time: 103392 ms

ConcurrentLinkedQueue add a small overhead but not too much vs ThreadLocal.

Is it possible to test efficiently if the current thread is EDT then I
could use ThreadLocal for EDT at least ? it must be very fast because
getThreadContext() is called once per rendering operation so it is a
performance bottleneck.

For example:
Testing file
/home/bourgesl/libs/openjdk/mapbench/test/dc_shp_alllayers_2013-00-30-07-00-47.ser
TL:  4 threads and 20 loops per thread, time: 67715 ms
CLQ: 4 threads and 20 loops per thread, time: 75343 ms

Changes:
- use ThreadLocal or ConcurrentLinkedQueue to get a
renderer context (vars / cache)
- use first RendererContext (dirty / clean arrays) members instead of using
IntArrayCache / FloatArrayCache for performance reasons (dedicated to large
dynamic arrays)

TBD:
- recycle pisces class i.e. keep only one instance per class (Renderer,
Stroker ...) to avoid totally GC overhead (several thousands per MapBench
test).

Moreover, these are very small objects / short lived i.e. l so it should
stay in ThreadLocalAllocator (TLAB) but when I use verbose:gc or jmap
-histo these are present and represents megabytes:
[bourgesl@jmmc-laurent ~]$ jmap -histo:live 21628 | grep pisces
   5: 505536470784  sun.java2d.pisces.Renderer
   9: 298203578400  sun.java2d.pisces.Stroker
  11: 497953186880  sun.java2d.pisces.PiscesCache
  12: 497941991760  sun.java2d.pisces.PiscesTileGenerator
  13: 497931991720
sun.java2d.pisces.Renderer$ScanlineIterator
  14: 298201431360
sun.java2d.pisces.PiscesRenderingEngine$NormalizingPathIterator
  52:40   1280  sun.java2d.pisces.IntArrayCache
  94:20640  sun.java2d.pisces.FloatArrayCache
 121: 8320  [Lsun.java2d.pisces.IntArrayCache;
 127: 4320  sun.java2d.pisces.RendererContext
 134: 4256  sun.java2d.pisces.Curve
 154: 4160  [Lsun.java2d.pisces.FloatArrayCache;
 155: 4160
sun.java2d.pisces.RendererContext$RendererData
 156: 4160
sun.java2d.pisces.RendererContext$StrokerData
 157: 4160  sun.java2d.pisces.Stroker$PolyStack
 208: 3 72
sun.java2d.pisces.PiscesRenderingEngine$NormMode
 256: 1 32
[Lsun.java2d.pisces.PiscesRenderingEngine$NormMode;
 375: 1 16  sun.java2d.pisces.PiscesRenderingEngine
 376: 1 16  sun.java2d.pisces.RendererContext$1

Regards,
Laurent

2013/4/3 Laurent Bourgès 

> Thanks for your valueable feedback!
>
> Here is the current status of my patch alpha version:
>>> http://jmmc.fr/~bourgesl/share/java2d-pisces/
>>>
>>> There is still a lot to be done: clean-up, stats, pisces class instance
>>> recycling (renderer, stroker ...) and of course sizing correctly initial
>>> arrays (dirty or clean) in the RendererContext (th

Re: sun.awt.X11 logs still using String + (waste)

2013-04-04 Thread Laurent Bourgès
Ok, I can provide an updated patch after finding / fixing all usages.

Probably in 1 or 2 days,

Laurent

2013/4/4 Anthony Petrov 

> Yes, this makes sense. Do you want to update your fix for 8010297 to
> include these changes as well?
>
> --
> best regards,
> Anthony
>
>
> On 04/04/13 15:47, Laurent Bourgès wrote:
>
>> Dear all,
>>
>> I just realized there is another problem with PlatformLogger log
>> statements:
>> XBaseWindow:
>>  public boolean grabInput() {
>>  grabLog.fine("Grab input on {0}", this);
>> ...
>> }
>>
>> This calls the PlatformLogger.fine( varargs):
>>  public void fine(String msg, Object... params) {
>>  logger.doLog(FINE, msg, params);
>>  }
>>
>> Doing so, the JVM creates a new Object[] instance to provide params as
>> varargs.
>>
>> I would recommend using isLoggable() test to avoid such waste if the log
>> is disabled (fine, finer, finest ...).
>>
>> Do you want me to provide a new patch related to this problem ?
>>
>> Does somebody have an idea to automatically analyze the JDK code and
>> detect missing isLoggable statements ...
>>
>> Regards,
>> Laurent
>>
>> 2013/4/3 Laurent Bourgès > <mailto:bourges.laurent@gmail.**com >>
>>
>>
>> Anthony,
>>
>> could you tell me once it is in the OpenJDK8 repository ?
>> I would then perform again profiling tests to check if there is no
>> more missing isLoggable() statements.
>>
>> Once JMX and other projects switch to PlatformLogger, I could check
>> again.
>>
>> Maybe I could write a small java code checker (pmd rule) to test if
>> there is missing isLoggable() statements wrapping PlatformLogger log
>> statements. Any idea about how to reuse java parser to do so ?
>>
>> Regards,
>>
>> Laurent
>>
>> 2013/4/2 Anthony Petrov > <mailto:anthony.petrov@oracle.**com >>
>>
>>
>> Looks fine to me as well. Thanks for fixing this, Laurent.
>>
>> Let's wait a couple more days in case Net or Swing folks want to
>> review the fix. After that I'll push it to the repository.
>>
>> --
>> best regards,
>> Anthony
>>
>>
>> On 4/2/2013 15:35, Laurent Bourgès wrote:
>>
>> Here is the updated patch:
>> 
>> http://jmmc.fr/~bourgesl/__**share/webrev-8010297.2/<http://jmmc.fr/%7Ebourgesl/__share/webrev-8010297.2/>
>> 
>> <http://jmmc.fr/%7Ebourgesl/**share/webrev-8010297.2/<http://jmmc.fr/%7Ebourgesl/share/webrev-8010297.2/>
>> >
>>
>>
>> Fixed inconsistencies between FINE / FINER log statements:
>> - XScrollbarPeer
>> - XWindowPeer
>>
>> Laurent
>>
>> 2013/4/2 Anthony Petrov > <mailto:anthony.petrov@oracle.**com
>> >
>> <mailto:anthony.petrov@oracle.**__com
>>
>> <mailto:anthony.petrov@oracle.**com
>> >>>
>>
>>
>>  1. Sergey: I believe this is for purposes of better
>> formating the
>>  log output and making it more readable by separating or
>> highlighting
>>  some sections. I don't think this should be changed.
>>
>>  2. Laurent: can you please address this issue and send
>> us a new patch?
>>
>>  --
>>  best regards,
>>  Anthony
>>
>>
>>  On 4/1/2013 16:08, Sergey Bylokhov wrote:
>>
>>
>>  Hi, Anthony
>>  Only two comments:
>>  1 Why we need some special text in the log output
>> like "***" and
>> "###"
>>  2 XScrollbarPeer.java:
>>
>>  +if
>> (log.isLoggable(**PlatformLogger.FINEST)) {
>>
>>
>>  +log.finer("KeyEvent on scrollbar: " +
>> event);
>>  +}
>>
>>
>>
>>  On 4/1/13 12:20 PM, Anthony Petrov wrote:
>>
>>  Awt, Swing, Net engineers,
>>
>>  Could 

Re: sun.awt.X11 logs still using String + (waste)

2013-04-04 Thread Laurent Bourgès
Dear all,

I just realized there is another problem with PlatformLogger log statements:
XBaseWindow:
public boolean grabInput() {
grabLog.fine("Grab input on {0}", this);
...
}

This calls the PlatformLogger.fine( varargs):
public void fine(String msg, Object... params) {
logger.doLog(FINE, msg, params);
}

Doing so, the JVM creates a new Object[] instance to provide params as
varargs.

I would recommend using isLoggable() test to avoid such waste if the log is
disabled (fine, finer, finest ...).

Do you want me to provide a new patch related to this problem ?

Does somebody have an idea to automatically analyze the JDK code and detect
missing isLoggable statements ...

Regards,
Laurent

2013/4/3 Laurent Bourgès 

> Anthony,
>
> could you tell me once it is in the OpenJDK8 repository ?
> I would then perform again profiling tests to check if there is no more
> missing isLoggable() statements.
>
> Once JMX and other projects switch to PlatformLogger, I could check again.
>
> Maybe I could write a small java code checker (pmd rule) to test if there
> is missing isLoggable() statements wrapping PlatformLogger log statements.
> Any idea about how to reuse java parser to do so ?
>
> Regards,
>
> Laurent
>
> 2013/4/2 Anthony Petrov 
>
>> Looks fine to me as well. Thanks for fixing this, Laurent.
>>
>> Let's wait a couple more days in case Net or Swing folks want to review
>> the fix. After that I'll push it to the repository.
>>
>> --
>> best regards,
>> Anthony
>>
>>
>> On 4/2/2013 15:35, Laurent Bourgès wrote:
>>
>>> Here is the updated patch:
>>> http://jmmc.fr/~bourgesl/**share/webrev-8010297.2/<http://jmmc.fr/%7Ebourgesl/share/webrev-8010297.2/>
>>>
>>> Fixed inconsistencies between FINE / FINER log statements:
>>> - XScrollbarPeer
>>> - XWindowPeer
>>>
>>> Laurent
>>>
>>> 2013/4/2 Anthony Petrov >> anthony.petrov@oracle.**com >>
>>>
>>>
>>> 1. Sergey: I believe this is for purposes of better formating the
>>> log output and making it more readable by separating or highlighting
>>> some sections. I don't think this should be changed.
>>>
>>> 2. Laurent: can you please address this issue and send us a new
>>> patch?
>>>
>>> --
>>> best regards,
>>> Anthony
>>>
>>>
>>> On 4/1/2013 16:08, Sergey Bylokhov wrote:
>>>
>>>
>>> Hi, Anthony
>>> Only two comments:
>>> 1 Why we need some special text in the log output like "***" and
>>> "###"
>>> 2 XScrollbarPeer.java:
>>>
>>> +if (log.isLoggable(__**PlatformLogger.FINEST)) {
>>>
>>> +log.finer("KeyEvent on scrollbar: " + event);
>>> +}
>>>
>>>
>>>
>>> On 4/1/13 12:20 PM, Anthony Petrov wrote:
>>>
>>> Awt, Swing, Net engineers,
>>>
>>> Could anyone review the fix please? For your convenience:
>>>
>>> Bug: 
>>> http://bugs.sun.com/view_bug._**_do?bug_id=8010297<http://bugs.sun.com/view_bug.__do?bug_id=8010297>
>>> 
>>> <http://bugs.sun.com/view_bug.**do?bug_id=8010297<http://bugs.sun.com/view_bug.do?bug_id=8010297>
>>> >
>>>
>>> Fix:
>>> http://cr.openjdk.java.net/~__**anthony/8-55-isLoggable-__**
>>> 8010297.0/<http://cr.openjdk.java.net/%7E__anthony/8-55-isLoggable-__8010297.0/>
>>> <http://cr.openjdk.java.net/%**7Eanthony/8-55-isLoggable-**
>>> 8010297.0/<http://cr.openjdk.java.net/%7Eanthony/8-55-isLoggable-8010297.0/>
>>> >
>>>
>>> -- best regards,
>>> Anthony
>>>
>>> On 3/22/2013 2:26, Anthony Petrov wrote:
>>>
>>> Hi Laurent,
>>>
>>> The fix looks great to me. Thank you very much.
>>>
>>> We still need at least one review, though. Hopefully
>>> net-dev@ and/or swing-dev@ folks might help us out a
>>> bit.
>>>
>>> -- best regards,
>>> Anthony
>>>
>>> On 3/20/2013 15:10, Anthony Petrov wrote:
>>>
>>> Hi Lauren

Re: Review request: 8011380: FX dependency on PlatformLogger broken

2013-04-04 Thread Laurent Bourgès
Mandy, Peter,

here are my comments:

On 04/04/2013 06:52 AM, Mandy Chung wrote:
>
> Peter, Laurent,
>
> History and details are described below.
>
> Webrev at:
> http://cr.openjdk.java.net/~mchung/jdk8/webrevs/8011380/webrev.00/
>
> I add back the getLevel(int), setLevel(int) and isLoggable(int) methods
> but marked deprecated and also revert the static final variables to resolve
> the regression. They can be removed when JavaFX transitions to use the
> Level enums (I'll file one issue for FX to use PlatformLogger.Level and one
> for core-libs to remove the deprecated methods).  The performance overhead
> is likely small since it's direct mapping from int to the Level enum that
> was used in one of your previous performance measurement.
>
> Look OK for me; the Level valueOf(int level) is back and is basically an
efficient switch case that performs well (performance overhead is minor). I
hope other projects will be built again soon to remove such workarrounds.

>
> I think that it is not strictly necessary to restore the PlatformLogger
> static final fields back to int type. They are compile-time constants and
> so are already "baked" into the classes compiled with older version of
> PlatformLogger. Am I right? The only problem I see if fields are kept as
> Level type is when JFX is compiled with JDK8 and then run on JDK7... But
> changing them temporarily back to int is a conservative approach and I
> support it.
>

I think you're right: static constants are copied into class bytecode;
maybe the old API (int level in method signatures) could be kept and marked
deprecated but the class will become quite big (double API: one with int
and one with Level) !!

>
>
> Laurent - you have a patch to add isLoggable calls in the awt/swing code.
> Can you check if there is any noticeable performance difference?
>
> The awt patch consists in adding missing isLoggable statements to avoid
object creation (string concatenations) and method calls
(Component.toString() ...).

Maybe I should also check that calls to log(msg, varargs) are using
isLoggable() tests to avoid Object[] creation due to varargs: what is your
opinion ?


>
> I also take the opportunity to reconsider what JavaLoggerProxy.getLevel()
> should return when it's a custom Level. Use of logging is expected not to
> cause any fatal error to the running application.  The previous patch
> throwing IAE needs to be fixed.  I propose to return the first
> Level.intValue() >= the custom level value since the level value is used to
> evaluate if it's loggable.
>
>
> That's a good compromise.
>

I think custom logger are ILLEGAL in the PlatformLogger API and must be
discarded. Maybe comments should explain such design decision and returning
an IllegalStateException is OK for me.


>
>
> History and details:
> JavaFX 8 has converted to use sun.util.logging.PlatformLogger (
> https://javafx-jira.kenai.com/browse/RT-24458). I was involved in the
> early discussion but wasn't aware of the decision made.  Thanks to Alan for
> catching this regression out before it's integrated to jdk8.  jfxrt.jar is
> cobundled with jdk8 during the installer build step.  My build doesn't
> build installer and thus we didn't see this regression.
>
> I ran jdeps on jdk8/lib/ext/jfxrt.jar (windows-i586) that shows 112
> references to PlatformLogger and on jdk7/lib/jfxrt.jar that shows no
> reference to sun.util.logging.
>
> I have a method finder tool (planning to include it in jdk8) to search for
> use of PlatformLogger.getLevel and it did find 2 references from
> jdk8/lib/ext/jfxrt.jar.
>
> Personally I doubt getLevel() method is useful: why not use isLoggable()
instead !! maybe getLevel() should become @deprecated too.


>
> JavaFX 8 is going to upgrade to use JDK 8 for JavaFX build (
> https://javafx-jira.kenai.com/browse/RT-27794) soon (currently it's built
> with JDK 7).  As soon as JavaFX code are changed to reference
> PlatformLogger.Level enum instead, we can remove the deprecated methods and
> change the PlatformLogger constants.
>
> Great, any idea about their roadmap ?
do you think other projects are also depending on PlatformLogger (JMX ...)
and may be impacted ?

>
> What do you think of deprecating the PlatformLogger constants altogether
> (regardless of their type). The code using them could be migrated to use
> PlatformLogger.Level enum members directly and if "
> PlatformLogger.Level.INFO" looks to verbose, static imports can help or
> there could be some more helper methods added to PlatformLogger that would
> minimize the need to access the constants like:
>
> isInfoLoggable();
> isWarningLoggable();
> ...
>

It starts to mimic log4j / slf4j / logback  syntax I love but I think it
will change so many files that I think it is a waste of time for now.

I vote for adding such useful shortcuts in the new API but not change other
methods.

Cheers,
Laurent


Re: hg: jdk8/tl/jdk: 8010309: Improve PlatformLogger.isLoggable performance by direct mapping from an integer to Level

2013-04-03 Thread Laurent Bourgès
Does JavaFX belong to OpenJDK projects
(*openjfx/8)
*?

How do I build the complete OpenJDK (javafx, java web start ...) ?

Laurent

2013/4/3 Alan Bateman 

>  On 28/03/2013 20:16, mandy.ch...@oracle.com wrote:
>
> Changeset: e433ed08b733
> Author:mchung
> Date:  2013-03-28 13:14 -0700
> URL:   http://hg.openjdk.java.net/jdk8/tl/jdk/rev/e433ed08b733
>
> 8010309: Improve PlatformLogger.isLoggable performance by direct mapping from 
> an integer to Level
> Reviewed-by: mchung
> Contributed-by: peter.lev...@gmail.com, bourges.laur...@gmail.com
>
> ! src/share/classes/sun/util/logging/PlatformLogger.java
> ! test/sun/util/logging/PlatformLoggerTest.java
>
>  It seems that FX doesn't like this good work.
>
>   Caused by: java.lang.NoSuchMethodError: 
> sun.util.logging.PlatformLogger.getLevel()I
>   at com.sun.javafx.css.parser.CSSParser.(CSSParser.java:164)
>   at 
> com.sun.javafx.css.StyleManager.loadStylesheetUnPrivileged(StyleManager.java:854)
>   at com.sun.javafx.css.StyleManager.loadStylesheet(StyleManager.java:674)
>   at 
> com.sun.javafx.css.StyleManager.setDefaultUserAgentStylesheet(StyleManager.java:1050)
>   at 
> com.sun.javafx.css.StyleManager.setDefaultUserAgentStylesheet(StyleManager.java:1020)
>   at com.sun.javafx.application.PlatformImpl$10.run(PlatformImpl.java:525)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> com.sun.javafx.application.PlatformImpl.setPlatformUserAgentStylesheet(PlatformImpl.java:522)
>   at 
> com.sun.javafx.application.PlatformImpl.setDefaultPlatformUserAgentStylesheet(PlatformImpl.java:474)
>   at javafx.scene.control.Control.(Control.java:82)
>   at helloworld.HelloWorld.start(HelloWorld.java:14)
>   at com.sun.javafx.application.LauncherImpl$5.run(LauncherImpl.java:772)
>   at com.sun.javafx.application.PlatformImpl$6.run(PlatformImpl.java:260)
>   at 
> com.sun.javafx.application.PlatformImpl$5$1.run(PlatformImpl.java:223)
>   at 
> com.sun.javafx.application.PlatformImpl$5$1.run(PlatformImpl.java:220)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at com.sun.javafx.application.PlatformImpl$5.run(PlatformImpl.java:220)
>   at 
> com.sun.glass.ui.InvokeLaterDispatcher$Future.run(InvokeLaterDispatcher.java:94)
>   at com.sun.glass.ui.win.WinApplication._runLoop(Native Method)
>   at 
> com.sun.glass.ui.win.WinApplication.access$300(WinApplication.java:39)
>   at com.sun.glass.ui.win.WinApplication$3$1.run(WinApplication.java:101)
>   ... 1 more
>
>
>
>


Re: [OpenJDK 2D-Dev] sun.java2D.pisces big memory usage (waste ?)

2013-04-03 Thread Laurent Bourgès
Thanks for your valueable feedback!

Here is the current status of my patch alpha version:
>> http://jmmc.fr/~bourgesl/share/java2d-pisces/
>>
>> There is still a lot to be done: clean-up, stats, pisces class instance
>> recycling (renderer, stroker ...) and of course sizing correctly initial
>> arrays (dirty or clean) in the RendererContext (thread local storage).
>> For performance reasons, I am using now RendererContext members first
>> (cache for rowAARLE for example) before using ArrayCaches (dynamic arrays).
>>
>
> Thank you Laurent, those are some nice speedups.
>
I think it can still be improved: I hope to make it as fast as ductus or
maybe more (I have several idea for aggressive optimizations) but the main
improvement consist in reusing memory (like C / C++ does) to avoid wasted
memory / GC overhead in concurrent environment.


> About the thread local storage, that is a sensible choice for highly
> concurrent systems, at the same time, web containers normally complain about
> orphaned thread locals created by an application and not cleaned up.
> Not sure if ones created at the core libs level get special treatment, but
> in general, I guess it would be nice to have some way to clean them up.
>

You're right that's why my patch is not ready !

I chose ThreadLocal for simplicity and clarity but I see several issues:
1/ Web container: ThreadLocal must be clean up when stopping an application
to avoid memory leaks (application becomes unloadable due to classloader
leaks)
2/ ThreadLocal access is the fastest way to get the RendererContext as it
does not require any lock (unsynchronized); As I get the RendererContext
once per rendering request, I think the ThreadLocal can be replaced by a
thread-safe ConcurrentLinkedQueue but it may become a
performance bootleneck
3/ Using a ConcurrentLinkedQueue requires an efficient /
proper cache eviction to free memory (Weak or Soft references ?) or using
statistics (last usage timestamp, usage counts)

Any other idea (core-libs) to have an efficient thread context in a web
container ?

I'm not familiar with the API, but is there any way to clean them up when
> the graphics2d gets disposed of?
>

The RenderingEngine is instanciated by the JVM once and I do not see in the
RenderingEngine interface any way to perform callbacks for warmup / cleanup
... nor access to the Graphics RenderingHints (other RFE for tuning
purposes)


> A web application has no guarantee to see the same thread ever again
> during his life, so thread locals have to be cleaned right away.
>

I advocate ThreadLocal can lead to wasted memory as only few concurrent
threads can really use their RendererContext instance while others can
simply answer web requests => let's use a
ConcurrentLinkedQueue with a proper cache eviction.


>
> Either that, or see if there is any way to store the array caches in a
> global structure backed by a concurrent collection to reduce/eliminate
> contention.
>

Yes, it is a interesting alternative to benchmark.

Regards,
Laurent


Re: RFR JDK-7143928 : (coll) Optimize for Empty ArrayList and HashMap

2013-04-02 Thread Laurent Bourgès
> ---
>
>  604 Arrays.fill(elementData, newSize, size, null);
>
> In performance-critical code I would avoid Arrays.fill because it adds a
> bit of overhead (unless it's intrinsified, which I don't think it is).
>

Last week, I sent few benchmarks I did on array cleaning (zero fill)
comparing Arrays.fill, System.arraycopy, Unsafe.setMemory ...
Arrays.fill is the winner (always faster than arraycopy which use native
code) by 10 - 20% !
I suspect aggressive hotspot optimizations (native code ?) because I agree
Arrays.fill looks like a stupid for-loop !

Does somebody have clues explaining the Arrays.fill performance ?

Is there an alternative way to clear an array giving the best possible
performance (like c memset) ?
it could be useful to have in JDK a native array fill implementation
(System.fill ?) if it gives better performance.

Laurent


  1   2   >