Hello! Please review and sponsor the following patch: https://bugs.openjdk.java.net/browse/JDK-8154387 http://cr.openjdk.java.net/~tvaleev/webrev/8154387/r1/
The rationale is to speed-up the parallel processing for unordered streams with low limit value. Such problems occur when you want to perform expensive filtering and select at most x elements which pass the filter (order does not matter). Currently unordered limit operation buffers up to 128 elements for each parallel task before it checks whether limit is reached. This is actually harmful when requested limit is lower: much more elements are requested from the upstream than necessary. Here's simple JMH test which illustrates the problem: http://cr.openjdk.java.net/~tvaleev/webrev/8154387/jmh/ It extracts the requested number of probable-primes from the list of 10000 BigInteger numbers. The results with 9ea+111: Benchmark (limit) Mode Cnt Score Error Units LimitTest.parLimit 2 avgt 30 108,971 ± 0,643 us/op LimitTest.parLimit 20 avgt 30 934,176 ± 14,003 us/op LimitTest.parLimit 200 avgt 30 8772,417 ± 190,609 us/op LimitTest.parLimit 2000 avgt 30 41775,463 ± 1800,537 us/op LimitTest.parUnorderedLimit 2 avgt 30 2557,798 ± 13,161 us/op LimitTest.parUnorderedLimit 20 avgt 30 2578,283 ± 23,547 us/op LimitTest.parUnorderedLimit 200 avgt 30 4577,318 ± 40,793 us/op LimitTest.parUnorderedLimit 2000 avgt 30 12279,346 ± 523,823 us/op LimitTest.seqLimit 2 avgt 30 34,831 ± 0,190 us/op LimitTest.seqLimit 20 avgt 30 369,729 ± 1,427 us/op LimitTest.seqLimit 200 avgt 30 3690,544 ± 13,907 us/op LimitTest.seqLimit 2000 avgt 30 36681,637 ± 156,538 us/op When the limit is 2 or 20, parallel unordered version is slower than parallel ordered! Even for limit = 200 it's still slower than sequential operation. The idea of the patch is to tweak the CHUNK_SIZE using the given limit and parallelism level. I used the following formula: this.chunkSize = limit >= 0 ? (int)Math.min(CHUNK_SIZE, (skip + limit) / ForkJoinPool.getCommonPoolParallelism() + 1) : CHUNK_SIZE; This does not affect cases when limit is big or not set at all (in skip mode). However it greatly improves cases when limit is small: Benchmark (limit) Mode Cnt Score Error Units LimitTest.parLimit 2 avgt 30 109,502 ± 0,750 us/op LimitTest.parLimit 20 avgt 30 954,716 ± 39,276 us/op LimitTest.parLimit 200 avgt 30 8706,226 ± 184,330 us/op LimitTest.parLimit 2000 avgt 30 42126,346 ± 3163,444 us/op LimitTest.parUnorderedLimit 2 avgt 30 39,303 ± 0,177 us/op !!! LimitTest.parUnorderedLimit 20 avgt 30 266,107 ± 0,492 us/op !!! LimitTest.parUnorderedLimit 200 avgt 30 2547,177 ± 58,538 us/op !!! LimitTest.parUnorderedLimit 2000 avgt 30 12216,402 ± 430,574 us/op LimitTest.seqLimit 2 avgt 30 34,993 ± 0,704 us/op LimitTest.seqLimit 20 avgt 30 369,497 ± 1,754 us/op LimitTest.seqLimit 200 avgt 30 3716,059 ± 61,054 us/op LimitTest.seqLimit 2000 avgt 30 36814,356 ± 161,531 us/op Here you can see that unordered cases are significantly improved. Now they are always faster than parallel ordered and faster than sequential for limit >= 20. I did not think up how to test this patch as it does not change visible behavior, only speed. However all the existing tests pass. What do you think? With best regards, Tagir Valeev.