On Thu, 10 Feb 2022 03:22:42 GMT, Tagir F. Valeev <[email protected]> wrote:
>> See the bug description for details. >> >> I propose a simple solution. Let's allow ArraySpliterator to be non-SIZED >> and report artificial estimatedSize(), much bigger than the real one. This >> will allow AbstractSpliterator and IteratorSpliterator to produce prefix >> whose size is comparable to Long.MAX_VALUE (say, starting with >> Long.MAX_VALUE/2), and this will enable further splitting of the prefix. >> This change will drastically improve parallel streaming for affected streams >> of size <= 1024 and significantly improve for streams of size 1025..20000. >> The cost is higher-grained splitting for huge streams of unknown size. This >> might add a minor overhead for such scenarios which, I believe, is >> completely tolerable. >> >> No public API changes are necessary, sequential processing should not be >> affected, except an extra field in ArraySpliterator which increases a >> footprint by 8 bytes. >> >> I added a simple test using an artificial collector to ensure that at least >> two non-empty parts are created when parallelizing Stream.iterate source. >> More testing ideas are welcome. > > Tagir F. Valeev has updated the pull request incrementally with one > additional commit since the last revision: > > Benchmark to demonstrate the patch usefulness I added a microbenchmark to demonstrate the speed improvement achievable by this patch. For demonstration, I used `Pattern.splitAsStream` source (though as listed in JDK-8280915, many other standard sources are also affected). For downstream operation I selected `BigInteger.pow(1000)` which is moderately CPU-intensive (takes ~10us on my machine for input numbers within 1000..2000). The benchmarking results are as follows (my machine has 8 cores). Here's non-patched version: Benchmark (size) Mode Cnt Score Error Units sumOf1000thPowers 10 avgt 30 100.367 ± 0.793 us/op sumOf1000thPowers 100 avgt 30 963.857 ± 6.252 us/op sumOf1000thPowers 1000 avgt 30 10012.923 ± 81.091 us/op sumOf1000thPowers 10000 avgt 30 99546.370 ± 625.105 us/op sumOf1000thPowersParallel 10 avgt 30 104.561 ± 1.118 us/op sumOf1000thPowersParallel 100 avgt 30 995.400 ± 26.116 us/op sumOf1000thPowersParallel 1000 avgt 30 9969.296 ± 85.166 us/op sumOf1000thPowersParallel 10000 avgt 30 55856.516 ± 2315.530 us/op We observe that the results depend on the input size linearly for sequential run, which is quite expected. Parallel doesn't help at all for size = 10, 100, and 1000, which validates my claim that these spliterators cannot split at all for sizes <= 1024. For size = 10000, parallel version starts performing somewhat better (1.78x), though it's not nearly close to the available machine parallelism. Here's patched version: Benchmark (size) Mode Cnt Score Error Units sumOf1000thPowers 10 avgt 30 101.380 ± 0.961 us/op sumOf1000thPowers 100 avgt 30 970.843 ± 8.360 us/op sumOf1000thPowers 1000 avgt 30 9837.397 ± 93.656 us/op sumOf1000thPowers 10000 avgt 30 97741.823 ± 1276.116 us/op sumOf1000thPowersParallel 10 avgt 30 41.597 ± 0.910 us/op sumOf1000thPowersParallel 100 avgt 30 189.735 ± 1.911 us/op sumOf1000thPowersParallel 1000 avgt 30 1776.337 ± 31.034 us/op sumOf1000thPowersParallel 10000 avgt 30 17685.723 ± 127.846 us/op We observe no regression for sequential run and drastic improvement for parallel. Even for size = 10, we see 2.46x speedup and 41 us total time. For bigger sizes, we see 5.11x..5.54x speedup. Please review my patch. I'll be happy to discuss any concerns about this optimization you may have. Thanks in advance! ------------- PR: https://git.openjdk.java.net/jdk/pull/7279
