On Thu, 10 Feb 2022 03:22:42 GMT, Tagir F. Valeev <[email protected]> wrote:

>> See the bug description for details.
>> 
>> I propose a simple solution. Let's allow ArraySpliterator to be non-SIZED 
>> and report artificial estimatedSize(), much bigger than the real one. This 
>> will allow AbstractSpliterator and IteratorSpliterator to produce prefix 
>> whose size is comparable to Long.MAX_VALUE (say, starting with 
>> Long.MAX_VALUE/2), and this will enable further splitting of the prefix. 
>> This change will drastically improve parallel streaming for affected streams 
>> of size <= 1024 and significantly improve for streams of size 1025..20000. 
>> The cost is higher-grained splitting for huge streams of unknown size. This 
>> might add a minor overhead for such scenarios which, I believe, is 
>> completely tolerable.
>> 
>> No public API changes are necessary, sequential processing should not be 
>> affected, except an extra field in ArraySpliterator which increases a 
>> footprint by 8 bytes.
>> 
>> I added a simple test using an artificial collector to ensure that at least 
>> two non-empty parts are created when parallelizing Stream.iterate source. 
>> More testing ideas are welcome.
>
> Tagir F. Valeev has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Benchmark to demonstrate the patch usefulness

I added a microbenchmark to demonstrate the speed improvement achievable by 
this patch. For demonstration, I used `Pattern.splitAsStream` source (though as 
listed in JDK-8280915, many other standard sources are also affected). For 
downstream operation I selected `BigInteger.pow(1000)` which is moderately 
CPU-intensive (takes ~10us on my machine for input numbers within 1000..2000). 
The benchmarking results are as follows (my machine has 8 cores). Here's 
non-patched version:


Benchmark                  (size)  Mode  Cnt      Score      Error  Units
sumOf1000thPowers              10  avgt   30    100.367 ±    0.793  us/op
sumOf1000thPowers             100  avgt   30    963.857 ±    6.252  us/op
sumOf1000thPowers            1000  avgt   30  10012.923 ±   81.091  us/op
sumOf1000thPowers           10000  avgt   30  99546.370 ±  625.105  us/op
sumOf1000thPowersParallel      10  avgt   30    104.561 ±    1.118  us/op
sumOf1000thPowersParallel     100  avgt   30    995.400 ±   26.116  us/op
sumOf1000thPowersParallel    1000  avgt   30   9969.296 ±   85.166  us/op
sumOf1000thPowersParallel   10000  avgt   30  55856.516 ± 2315.530  us/op


We observe that the results depend on the input size linearly for sequential 
run, which is quite expected. Parallel doesn't help at all for size = 10, 100, 
and 1000, which validates my claim that these spliterators cannot split at all 
for sizes <= 1024. For size = 10000, parallel version starts performing 
somewhat better (1.78x), though it's not nearly close to the available machine 
parallelism.

Here's patched version:


Benchmark                  (size)  Mode  Cnt      Score      Error  Units
sumOf1000thPowers              10  avgt   30    101.380 ±    0.961  us/op
sumOf1000thPowers             100  avgt   30    970.843 ±    8.360  us/op
sumOf1000thPowers            1000  avgt   30   9837.397 ±   93.656  us/op
sumOf1000thPowers           10000  avgt   30  97741.823 ± 1276.116  us/op
sumOf1000thPowersParallel      10  avgt   30     41.597 ±    0.910  us/op
sumOf1000thPowersParallel     100  avgt   30    189.735 ±    1.911  us/op
sumOf1000thPowersParallel    1000  avgt   30   1776.337 ±   31.034  us/op
sumOf1000thPowersParallel   10000  avgt   30  17685.723 ±  127.846  us/op


We observe no regression for sequential run and drastic improvement for 
parallel. Even for size = 10, we see 2.46x speedup and 41 us total time. For 
bigger sizes, we see 5.11x..5.54x speedup.

Please review my patch. I'll be happy to discuss any concerns about this 
optimization you may have. Thanks in advance!

-------------

PR: https://git.openjdk.java.net/jdk/pull/7279

Reply via email to