On Thu, 25 Sep 2008, Przemyslaw Czerpak wrote:
Hi All,
> The cost of TLS access is strictly compiler/OS dependent. I've
> just make interesting experiment to compare the code of using
> stack pointer to dynamically allocated stack instead of statick
> stack address in ST programs.
> I made very simple modificatrion. In hbstack.c for ST mode I changed:
> extern HB_STACK hb_stack;
> to:
> extern PHB_STACK hb_stack_ptr;
> # define hb_stack ( * hb_stack_ptr )
> and in estack.c:
> # if defined( HB_STACK_MACROS )
> HB_STACK hb_stack;
> # else
> static HB_STACK hb_stack;
> # endif
> to:
> HB_STACK _hb_stack_;
> PHB_STACK hb_stack_ptr = &_hb_stack_;
An now I compared BCC-5.5 and GCC-4.3.1 assembler code generated for
such modified HVM and this simple code:
void func( void )
{
hb_stackPush();
hb_stackPop();
}
BCC with -4 -5 -6 -O2 gives:
;
; void func( void )
; {
; hb_stackPush();
;
@3:
mov eax,dword ptr [_hb_stack_ptr]
add dword ptr [eax+4],4
mov edx,dword ptr [eax+4]
mov ecx,dword ptr [_hb_stack_ptr]
cmp edx,dword ptr [ecx+8]
jne short @4
call _hb_stackIncrease
;
; hb_stackPop();
;
@4:
mov eax,dword ptr [_hb_stack_ptr]
sub dword ptr [eax+4],4
mov edx,dword ptr [_hb_stack_ptr]
mov ecx,dword ptr [edx+4]
mov eax,dword ptr [ecx]
test dword ptr [eax],46085
je short @5
push eax
call _hb_itemClear
pop ecx
;
; }
;
@5:
@6:
ret
Please note that _hb_stack_ptr is accessed always 4 times.
In my GCC final code looks for -O3 is:
func:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl hb_stack_ptr, %ecx
movl 4(%ecx), %eax
addl $4, %eax
cmpl 8(%ecx), %eax
movl %eax, 4(%ecx)
je .L6
.L2:
movl 4(%ecx), %edx
leal -4(%edx), %eax
movl %eax, 4(%ecx)
movl -4(%edx), %eax
testw $-19451, (%eax)
jne .L7
leave
ret
.L7:
movl %eax, (%esp)
call hb_itemClear
leave
ret
.L6:
call hb_stackIncrease
movl hb_stack_ptr, %ecx
jmp .L2
It access hb_stack_ptr only _ONCE_ during normal code execution.
The second hb_stack_ptr is used when external function like
hb_stackIncrease() have to be called (in practice never or few
times in whole application live).
And this explains the speed difference. Which such optimization
the overhead in my builds is minimal when TLS native variables
are used. GCC was always optimized to reduce memory access when
BCC seems to be hardcoded for x86 machines where the cost of memory
operation was relatively small in the past and now data CPU caches
reduce the overhead but it's still not friendly code for CPU
optimization logic.
It also shows why TLS cost so much in BCC. Four calls instead
of one in my GCC in such simple example.
best regards,
Przemek
_______________________________________________
Harbour mailing list
[email protected]
http://lists.harbour-project.org/mailman/listinfo/harbour