On Thu, 26 Nov 2009, Maurilio Longo wrote:
Hi,
> the second one is built with
> hbmk2 -mt -gc3 speedtst
> and not -gc2 as I wrongly wrote before.
> > I've made a couple of tests: in the first one speedtst was built with
> > hbmk2 -mt speedtst
> > in the second with
> > hbmk2 -mt -gc2 speedtst
> >
> > Now, calling
> > speedtst --thread=2 --scale
> >
> > on a PIV HT with OS/2 and a SMP kernel I see that the first speedtst when
> > run
> > has a scaling factor of 1.18 while the second one has a scaling factor of
> > 1.32.
> >
> > Question is: is this simply a result of the more work the vm has to do
> > compared to the -gc3 build or are there serialization points inside the vm
> > which slow the first speedtst compared to the second?
No. Just simply the proportions between cost of different operations has
been changed or it's effect of some activity of other processes which
reduce simultaneous access to two different CPUs. Please remember that
in this test real time is used so any other code executed by OS strongly
reduce the scalability factor.
Harbour core code is MT safe in practice without any internal locks.
There is only one exception bound with memory allocation. Allocating
new complex items like codeblocks, arrays, hashes or GC pointers activates
code which needs mutex lock to register new GC memory block in a list of
GC blocks. It's very short lock but when you execute test like --scale in
speedtst where two threads make exactly the same operations in a loop then
it can highly exploits this fact and reduce scalability factor. How much it
depends on used hardware and OS. In speedtst.prg few tests allocates and
destroys complex items in a loop. [ T044: x := {} ] which does not make
anything more is the best to illustrate maximal overhead using given
hardware and OS. The cost of failing locks strongly depends on hardware
and test shows that sometimes there are huge differences between very
similar computers which use only a little bit different mainboards and/or
CPUs. Important is also the cost of suspending and resuming threads when
lock fails which depends on used OS (sometimes also CRTL).
It's very important for good scalability to eliminate all possible locks
which can often fail. The best is the code which is fully reentrant safe
and do not use any locks but test shows that locks which nearly always
success do not reduce performance.
As I said in Harbour core code only memory allocation for complex items
uses internal locks. But also outside Harbour core code memory manager
which is not optimized to work well in MT environment can use internal
locks and reduce the scalability. In such case HB_FM_DLMT_ALLOC can be
used which uses 16 separate memory pools to reduce possible conflicts
on multi CPU machines.
In the future we can try to use similar trick to reduce possible conflicts
in code which allocates new complex items. For sure it will nicely improve
scalability in tests like [ T044: x := {} ] but in real applications which
make also other things then allocating empty array in a loop the chance for
lock conflict is much smaller so it's not be such important. Anyhow I plan
to make it so it can be enabled at least optionally at Harbour build time
for some machines which have really big number of CPUs and programs which
use many threads extensively using existing CPUs.
BTW I've just made test on my 3 CPU machine to compare scalability for
2 threads. I intentionally used 2 threads in test instead of 3 to reduce
the factor reduction by other processes running in this system. I also
disable power saving mode with automatic CPU frequency updating - it
can hardly reduce scalability in some cases. Below are detail results.
Then I repeated it with disabled tests allocating complex items in a loop:
./speedtst --thread=2 --scale --exclude=023,025,027,031,041,042,044
Results are also below. DLMT was used as memory manager in this Harbour
build. As you can see we have nearly perfect scalability and only factor in:
[ T004: x := S_C ]______________________________ 0.14 0.24 -> 0.59
is not close to 2.0 and it's expected behavior because in this test two
threads operate on exactly the same string item increasing and decreasing
its reference counter in hardware atomic inc/dec operations what of course
needs low level hardware synchronization between CPUs. T004 is also quite
good test to check the hardware cost of lock failing because it's comparable
to conflict in LOCK INC/ LOCK DEC operations.
best regards,
Przemek
2009.11.26 21:44:59 Linux 2.6.31.5-0.1-desktop x86_64
Harbour 2.0.0beta3 (Rev. 13026) (MT)+ GNU C 4.4 (64-bit) x86-64
THREADS: 2
N_LOOPS: 1000000
1 th. 2 th. factor
============================================================================
[ T001: x := L_C ]____________________________________ 0.12 0.06 -> 2.00
[ T002: x := L_N ]____________________________________ 0.10 0.05 -> 1.96
[ T003: x := L_D ]____________________________________ 0.10 0.05 -> 2.00
[ T004: x := S_C ]____________________________________ 0.15 0.24 -> 0.62
[ T005: x := S_N ]____________________________________ 0.12 0.07 -> 1.78
[ T006: x := S_D ]____________________________________ 0.12 0.06 -> 1.91
[ T007: x := M->M_C ]_________________________________ 0.16 0.08 -> 1.99
[ T008: x := M->M_N ]_________________________________ 0.14 0.07 -> 1.99
[ T009: x := M->M_D ]_________________________________ 0.14 0.07 -> 2.01
[ T010: x := M->P_C ]_________________________________ 0.16 0.08 -> 1.99
[ T011: x := M->P_N ]_________________________________ 0.14 0.07 -> 1.99
[ T012: x := M->P_D ]_________________________________ 0.14 0.07 -> 2.00
[ T013: x := F_C ]____________________________________ 0.38 0.19 -> 1.97
[ T014: x := F_N ]____________________________________ 0.42 0.21 -> 2.00
[ T015: x := F_D ]____________________________________ 0.24 0.12 -> 1.99
[ T016: x := o:Args ]_________________________________ 0.31 0.15 -> 2.00
[ T017: x := o[2] ]___________________________________ 0.21 0.10 -> 1.97
[ T018: round( i / 1000, 2 ) ]________________________ 0.47 0.24 -> 1.91
[ T019: str( i / 1000 ) ]_____________________________ 0.82 0.41 -> 2.01
[ T020: val( s ) ]____________________________________ 0.45 0.23 -> 2.01
[ T021: val( a [ i % 16 + 1 ] ) ]_____________________ 0.71 0.36 -> 1.99
[ T022: dtos( d - i % 10000 ) ]_______________________ 0.69 0.34 -> 2.01
[ T023: eval( { || i % 16 } ) ]_______________________ 0.85 0.96 -> 0.89
[ T024: eval( bc := { || i % 16 } ) ]_________________ 0.50 0.25 -> 1.98
[ T025: eval( { |x| x % 16 }, i ) ]___________________ 0.64 0.98 -> 0.66
[ T026: eval( bc := { |x| x % 16 }, i ) ]_____________ 0.49 0.24 -> 2.00
[ T027: eval( { |x| f1( x ) }, i ) ]__________________ 0.74 1.17 -> 0.64
[ T028: eval( bc := { |x| f1( x ) }, i ) ]____________ 0.57 0.28 -> 2.00
[ T029: eval( bc := &("{ |x| f1( x ) }"), i ) ]_______ 0.59 0.29 -> 2.02
[ T030: x := &( "f1(" + str(i) + ")" ) ]______________ 4.88 2.58 -> 1.89
[ T031: bc := &( "{|x|f1(x)}" ), eval( bc, i ) ]______ 5.91 3.40 -> 1.74
[ T032: x := valtype( x ) + valtype( i ) ]___________ 0.71 0.36 -> 1.98
[ T033: x := strzero( i % 100, 2 ) $ a[ i % 16 + 1 ] ] 1.19 0.60 -> 1.99
[ T034: x := a[ i % 16 + 1 ] == s ]___________________ 0.49 0.25 -> 2.00
[ T035: x := a[ i % 16 + 1 ] = s ]____________________ 0.51 0.26 -> 2.01
[ T036: x := a[ i % 16 + 1 ] >= s ]___________________ 0.51 0.26 -> 2.00
[ T037: x := a[ i % 16 + 1 ] <= s ]___________________ 0.51 0.26 -> 2.00
[ T038: x := a[ i % 16 + 1 ] < s ]____________________ 0.51 0.26 -> 1.99
[ T039: x := a[ i % 16 + 1 ] > s ]____________________ 0.52 0.26 -> 1.98
[ T040: ascan( a, i % 16 ) ]__________________________ 0.54 0.27 -> 2.04
[ T041: ascan( a, { |x| x == i % 16 } ) ]_____________ 5.05 3.08 -> 1.64
[ T042: iif( i%1000==0, a:={}, ) , aadd(a,{i,1,.T.,s ] 1.56 1.15 -> 1.36
[ T043: x := a ]______________________________________ 0.13 0.06 -> 1.98
[ T044: x := {} ]_____________________________________ 0.30 0.64 -> 0.47
[ T045: f0() ]________________________________________ 0.18 0.09 -> 2.06
[ T046: f1( i ) ]_____________________________________ 0.24 0.12 -> 2.03
[ T047: f2( c[1...8] ) ]______________________________ 0.24 0.12 -> 1.98
[ T048: f2( c[1...40000] ) ]__________________________ 0.24 0.12 -> 2.05
[ T049: f2( @c[1...40000] ) ]_________________________ 0.24 0.12 -> 2.04
[ T050: f2( @c[1...40000] ), c2 := c ]________________ 0.29 0.14 -> 2.05
[ T051: f3( a, a2, s, i, s2, bc, i, n, x ) ]__________ 0.59 0.30 -> 1.95
[ T052: f2( a ) ]_____________________________________ 0.24 0.12 -> 2.04
[ T053: x := f4() ]___________________________________ 0.83 0.44 -> 1.91
[ T054: x := f5() ]___________________________________ 0.48 0.24 -> 1.99
[ T055: x := space(16) ]______________________________ 0.37 0.19 -> 1.97
[ T056: f_prv( c ) ]__________________________________ 0.67 0.33 -> 2.01
============================================================================
[ TOTAL ]_________________________________________ 38.62 23.50 -> 1.64
============================================================================
[ total application time: ]....................................85.03
[ total real time: ]...........................................62.12
2009.11.26 21:46:55 Linux 2.6.31.5-0.1-desktop x86_64
Harbour 2.0.0beta3 (Rev. 13026) (MT)+ GNU C 4.4 (64-bit) x86-64
THREADS: 2
N_LOOPS: 1000000
excluded tests: 023,025,027,031,041,042,044
1 th. 2 th. factor
============================================================================
[ T001: x := L_C ]____________________________________ 0.12 0.06 -> 1.98
[ T002: x := L_N ]____________________________________ 0.10 0.05 -> 2.02
[ T003: x := L_D ]____________________________________ 0.10 0.05 -> 2.00
[ T004: x := S_C ]____________________________________ 0.14 0.24 -> 0.59
[ T005: x := S_N ]____________________________________ 0.12 0.06 -> 1.98
[ T006: x := S_D ]____________________________________ 0.12 0.06 -> 1.98
[ T007: x := M->M_C ]_________________________________ 0.15 0.08 -> 1.96
[ T008: x := M->M_N ]_________________________________ 0.14 0.07 -> 1.99
[ T009: x := M->M_D ]_________________________________ 0.14 0.07 -> 2.00
[ T010: x := M->P_C ]_________________________________ 0.16 0.08 -> 2.01
[ T011: x := M->P_N ]_________________________________ 0.14 0.07 -> 1.97
[ T012: x := M->P_D ]_________________________________ 0.14 0.07 -> 1.99
[ T013: x := F_C ]____________________________________ 0.37 0.19 -> 1.94
[ T014: x := F_N ]____________________________________ 0.41 0.21 -> 1.97
[ T015: x := F_D ]____________________________________ 0.24 0.12 -> 2.02
[ T016: x := o:Args ]_________________________________ 0.32 0.16 -> 1.99
[ T017: x := o[2] ]___________________________________ 0.21 0.11 -> 2.00
[ T018: round( i / 1000, 2 ) ]________________________ 0.45 0.24 -> 1.86
[ T019: str( i / 1000 ) ]_____________________________ 0.75 0.38 -> 1.99
[ T020: val( s ) ]____________________________________ 0.45 0.22 -> 2.02
[ T021: val( a [ i % 16 + 1 ] ) ]_____________________ 0.68 0.34 -> 2.02
[ T022: dtos( d - i % 10000 ) ]_______________________ 0.66 0.33 -> 2.02
[ T024: eval( bc := { || i % 16 } ) ]_________________ 0.49 0.25 -> 1.96
[ T026: eval( bc := { |x| x % 16 }, i ) ]_____________ 0.49 0.24 -> 2.00
[ T028: eval( bc := { |x| f1( x ) }, i ) ]____________ 0.57 0.29 -> 1.98
[ T029: eval( bc := &("{ |x| f1( x ) }"), i ) ]_______ 0.58 0.29 -> 2.02
[ T030: x := &( "f1(" + str(i) + ")" ) ]______________ 4.97 2.63 -> 1.89
[ T032: x := valtype( x ) + valtype( i ) ]___________ 0.73 0.36 -> 2.03
[ T033: x := strzero( i % 100, 2 ) $ a[ i % 16 + 1 ] ] 1.17 0.59 -> 1.98
[ T034: x := a[ i % 16 + 1 ] == s ]___________________ 0.49 0.24 -> 1.99
[ T035: x := a[ i % 16 + 1 ] = s ]____________________ 0.52 0.26 -> 2.03
[ T036: x := a[ i % 16 + 1 ] >= s ]___________________ 0.52 0.26 -> 2.01
[ T037: x := a[ i % 16 + 1 ] <= s ]___________________ 0.51 0.25 -> 2.01
[ T038: x := a[ i % 16 + 1 ] < s ]____________________ 0.52 0.26 -> 2.00
[ T039: x := a[ i % 16 + 1 ] > s ]____________________ 0.52 0.26 -> 1.98
[ T040: ascan( a, i % 16 ) ]__________________________ 0.54 0.26 -> 2.05
[ T043: x := a ]______________________________________ 0.13 0.06 -> 2.00
[ T045: f0() ]________________________________________ 0.19 0.09 -> 2.16
[ T046: f1( i ) ]_____________________________________ 0.24 0.12 -> 2.09
[ T047: f2( c[1...8] ) ]______________________________ 0.25 0.12 -> 2.10
[ T048: f2( c[1...40000] ) ]__________________________ 0.25 0.12 -> 2.11
[ T049: f2( @c[1...40000] ) ]_________________________ 0.24 0.12 -> 1.98
[ T050: f2( @c[1...40000] ), c2 := c ]________________ 0.29 0.15 -> 1.95
[ T051: f3( a, a2, s, i, s2, bc, i, n, x ) ]__________ 0.60 0.30 -> 1.99
[ T052: f2( a ) ]_____________________________________ 0.25 0.12 -> 2.07
[ T053: x := f4() ]___________________________________ 0.81 0.41 -> 1.98
[ T054: x := f5() ]___________________________________ 0.49 0.25 -> 1.99
[ T055: x := space(16) ]______________________________ 0.36 0.19 -> 1.93
[ T056: f_prv( c ) ]__________________________________ 0.69 0.34 -> 2.02
============================================================================
[ TOTAL ]_________________________________________ 23.49 12.07 -> 1.95
============================================================================
[ total application time: ]....................................47.37
[ total real time: ]...........................................35.56
_______________________________________________
Harbour mailing list (attachment size limit: 40KB)
[email protected]
http://lists.harbour-project.org/mailman/listinfo/harbour