Re: [performance] quick sort is 4x slower on Harmony

Pavel Ozhdikhin Fri, 11 Jan 2008 00:38:39 -0800

On 1/10/08, Aleksey Shipilev <[EMAIL PROTECTED]> wrote:
>
> And update here. I have confirmed that the main contributor is
> ValueProfiler.
>
> RI measurement (again):
> === /localdisk/jdk1.6.0_02/bin/java -server GenericQuicksort2 ===
> iteration 0: elapsed: 4825ms
> iteration 1: elapsed: 4805ms
> iteration 2: elapsed: 5128ms
> iteration 3: elapsed: 5125ms
> iteration 4: elapsed: 5130ms
>
> Baseline measurement (again):
> === /nfs/pb/home/ashipile/jre-r610377-clean/bin/java -Xem:server
> GenericQuicksort2 ===
> iteration 0: elapsed: 178898ms
> iteration 1: elapsed: 5663ms
> iteration 2: elapsed: 5666ms
> iteration 3: elapsed: 5660ms
> iteration 4: elapsed: 5672ms
>
> Collapsing critical section in ValueProfiler::addNewValue to wrap only
> insert_into_tnv_table - that should be initial proof-of-concept for
> going to CAS increase, Note that first iteration time decreased
> significantly, so we might consider CAS as an option:
> === /nfs/pb/home/ashipile/jre-r610377-work/bin/java -Xem:server
> GenericQuicksort2 ===
> iteration 0: elapsed: 85127ms
> iteration 1: elapsed: 5665ms
> iteration 2: elapsed: 5665ms
> iteration 3: elapsed: 5667ms
> iteration 4: elapsed: 5679ms
>
>
> Removing synchronization from VP at all (replacing
> lockProfile/unlockProfile with empty stubs rather that hymutex_*),
> note more decrease in rampup time and *boost* on next stages
> (probably, no more locking for concurrent SD1_OPT methods profiling?):
> === /nfs/pb/home/ashipile/jre-r610377-work/bin/java -Xem:server
> GenericQuicksort2 ===
> iteration 0: elapsed: 79678ms
> iteration 1: elapsed: 5018ms
> iteration 2: elapsed: 5014ms
> iteration 3: elapsed: 5013ms
> iteration 4: elapsed: 5028ms
>
> The profile of this mode, FIRST iteration, after 30 seconds of run:
> 27% Other32
> 21% libem#addNewValue
> 10% libharmonyvm#helper_get_interface_vtable
> 17% libem#find
> 8% libem#value_profiler_add_value
> 3% libem#getVPC
> 5% libharmonyvm#rth_get_interface_vtable
> 6% libjitrino#add_value_profile_value
>
> The profile of this mode, LAST iteration:
> 99% Other32
> 1% libjitrino#<various>
>
> Note that locks are disappeared - that testifies the problem with VP
> locks. After rampup there seem to be just a little JRE activity, most
> of the time executing user code.
>
> I'm going to propose the option that eliminates synchronization from
> VP completely sacrificing profile accuracy. Egor, Pavel, what do you
> think? Is synchronization removal too dangerous?



Synchronization removal won't likely break execution - there are no
allocations or object moving in addNewValue method of value profile. But
this may lead to intermittent slow down of the code. There may be
synchronization conflict when one thread adds value to a slot and another
thread assigns a frequency of another value to the same slot. Need to
evaluate this approach on other workloads. Egor's suggestion to not update a
value profile if a flag is up might work better.
I'm also thinking of packing 2 fields of Simple_TNV_Table structure to a
single value which would be written atomically. This will solve profile
mangling in lock-less solution.

So, subsequent steps might be following:
- check how lock-less solution work with bigger workloads (SPECs, DaCapo
etc)
- Check Egor's suggestion with a flag
- if needed, prototype atomic profile update


Just a thought: next thing we should consider is making VP to stop
> profiling after optimized version of code is available, since we don't
> care about profile information further.


I think this worth implementing.


Thanks,
> Aleksey.
>

Re: [performance] quick sort is 4x slower on Harmony

Reply via email to