Moin Dmitry,

On Mon, October 6, 2014 09:01, Anatol Belski wrote:
> On Sun, October 5, 2014 21:32, Anatol Belski wrote:
>
>> Hi Dmitry,
>>
>>
>>
>> On Wed, October 1, 2014 08:01, Dmitry Stogov wrote:
>>
>>
>>> Hi Anatol,
>>>
>>>
>>>
>>>
>>> I know, TSRM uses TLS APIs internally.
>>>
>>>
>>>
>>>
>>> In my opinion, the simplest (and probably efficient) way to get rid
>>> of TSRMLS_DC arguments and TSRMLS_FETCH calls, would be introducing a
>>> global thread specific variable.
>>>
>>> __thread void ***tsrm_ls;
>>>
>>>
>>>
>>>
>>> As I understood it won't work on Windows anyway, because windows
>>> linker is not smart enough to use TLS variables across different DLLs.
>>> May be
>>> it's possible to have a local thread specific copy of tsrm_ls for each
>>>  DLL, but
>>> then we should make them to be consistent...
>>>
>>> Sorry, I can't give you any advice, and can't spend a lot of time on
>>> this topic.
>>>
>>> May be description of TLS internals on ELF systems would give you
>>> some ideas.
>>>
>>> http://www.akkadia.org/drepper/tls.pdf
>>>
>>>
>>>
>>>
>>> Thanks. Dmitry.
>>>
>>>
>>>
>>>
>> I've reworked this patch to take a pointer per one shared unit. Please
>> see here
>> http://git.php.net/?p=php-src.git;a=commitdiff;h=76081df168829a5cc0409f
>> ac 47c217d4927ec6f6
>> (though this was just the first in the series). Afterwards I've adapted
>> ext/standard and also converted ext/sockets as an exemplary item because
>>  it's usually compiled shared.
>>
>> With this change I experience much better performance - a diff is in
>> 100-50ms range compared to the master TS build. Particular positions in
>> bench.php show even some better result.
>>
>> However this is not a global __thread variable, but a local one to
>> every shared unit. Say tsrm_ls will have to be declared in every so, dll
>> or exe and updated on request. For now I've put the update code in MINIT
>> and into the first ctor (zmm is the one in the php7ts.dll) called. The
>> ctor seems to be the only reliable place (but maybe I'm wrong), despite
>> it'll be called for every request instead of per thread, that won't be
>> very bad.
>>
>>
>> I'd suggest to go this way so we have the same flow everywhere.
>>
>>
>>
the perf issue is fixed now, still yet core only converted, but here are
Zend/bench.php results on 64 bit

master ts linux

simple             0.158
simplecall         0.050
simpleucall        0.148
simpleudcall       0.151
mandel             0.310
mandel2            0.337
ackermann(7)       0.088
ary(50000)         0.010
ary2(50000)        0.009
ary3(2000)         0.154
fibo(30)           0.285
hash1(50000)       0.029
hash2(500)         0.023
heapsort(20000)    0.072
matrix(20)         0.082
nestedloop(12)     0.204
sieve(30)          0.062
strcat(200000)     0.014
------------------------
Total              2.185


native-tls linux

simple             0.072
simplecall         0.036
simpleucall        0.163
simpleudcall       0.169
mandel             0.297
mandel2            0.354
ackermann(7)       0.123
ary(50000)         0.010
ary2(50000)        0.009
ary3(2000)         0.158
fibo(30)           0.396
hash1(50000)       0.030
hash2(500)         0.024
heapsort(20000)    0.072
matrix(20)         0.069
nestedloop(12)     0.130
sieve(30)          0.054
strcat(200000)     0.011
------------------------
Total              2.178


master ts windows

simple             0.100
simplecall         0.048
simpleucall        0.146
simpleudcall       0.120
mandel             0.292
mandel2            0.364
ackermann(7)       0.091
ary(50000)         0.009
ary2(50000)        0.008
ary3(2000)         0.133
fibo(30)           0.238
hash1(50000)       0.025
hash2(500)         0.020
heapsort(20000)    0.076
matrix(20)         0.069
nestedloop(12)     0.168
sieve(30)          0.048
strcat(200000)     0.011
------------------------
Total              1.965


native-tls windows

simple             0.100
simplecall         0.050
simpleucall        0.108
simpleudcall       0.110
mandel             0.292
mandel2            0.347
ackermann(7)       0.097
ary(50000)         0.009
ary2(50000)        0.008
ary3(2000)         0.140
fibo(30)           0.280
hash1(50000)       0.025
hash2(500)         0.021
heapsort(20000)    0.075
matrix(20)         0.072
nestedloop(12)     0.176
sieve(30)          0.048
strcat(200000)     0.010
------------------------
Total              1.969


Still there is some room for improvement (for instance the fibo results),
but the overall result shows at least same perf now. What do you think
guys?

Regards

Anatol


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to