New issue 2901: Disable inlining globally hurts performance
https://bitbucket.org/pypy/pypy/issues/2901/disable-inlining-globally-hurts

Tinho Lee:

`call_assembler_r` is found to be a heavy function when I profile my project 
with the jit log, because of the parameters and result packing/unpacking. It 
seems that inlining would help to solve the problem, however, inlining is not 
work as expected sometimes.

I have construct two test cases, one for recursive call, another for 
overwriting __getattr__. All the tests are run with environment variable 
PYPY_GC_MIN=20G, in order to eliminate the influence of gc. 

Here is the basic functions defined for the recursive call test case

```
#!python

def rec(n):
        return 1 if n <= 1 else rec(n - 1) + n


def a():
        rec(1)
        rec(1)
        rec(1)
        rec(1)
        rec(1)


def b():
        rec(100)
        rec(100)
        rec(100)


def test_a():
        start = time.time()
        for n in xrange(10000000):
                a()
        print 'a', time.time() - start


def test_b():
        start = time.time()
        for n in xrange(100000):
                b()
        print 'b', time.time() - start
```

and the test code

```
#!python

print 'warm up'
test_a()
test_b()

print 'test'
test_a()
test_b()
```

The order of warm up is important in this case. Here is the result if `test_a` 
is called before `test_b`:

```
warm up
a 0.0164198875427
b 0.196079015732
test
a 0.0130050182343
b 0.189415931702
```

And the result if `test_b` is called before `test_a`:

```
warm up
b 0.198947906494
a 1.29781603813
test
a 1.27277779579
b 0.186721801758
```

The performance of function `a` is greatly hurts by the order of warm up. I 
have found some clues from the jit log. If `test_a` is warmed up first, `rec` 
would be inlined in the function `a`, everything works fine. However, when 
`test_b` is warmed up first, `rec` would be detected as recursive function call 
and inline is disabled, which also affect the performance of `a`. 

The implementation lies in `_opimpl_recursive_call` of pyjitpl.py. 

```
#!python

if count >= memmgr.max_unroll_recursion:
    # This function is recursive and has exceeded the
    # maximum number of unrollings we allow. We want to stop
    # inlining it further and to make sure that, if it
    # hasn't happened already, the function is traced
    # separately as soon as possible.
    if have_debug_prints():
        loc = targetjitdriver_sd.warmstate.get_location_str(greenboxes)
        debug_print("recursive function (not inlined):", loc)
    warmrunnerstate.dont_trace_here(greenboxes)
```

And the same behavior is found for the non-recursive function call. `--jit 
trace_limit=300` is set as the `pypy` startup parameter to make the test 
easily. Here is the basic definitions

```
#!python

class X(object):
        def __getattr__(self, name):
                return len(name)


def a(x):
        return x.a + x.b + x.c


def b(x):
        a = 1
        a += 1
        a += 2
        a += 3
        a += 4
        a += 5
        a += 6
        a += 7
        a += 8
        a += 9
        return x.a + x.b + x.c


def test_a(x):
        start = time.time()
        for n in xrange(1000000):
                a(x)
        print 'a', time.time() - start


def test_b(x):
        start = time.time()
        for n in xrange(1000000):
                b(x)
        print 'b', time.time() - start
```

and the test case

```
#!python

x = X()

print 'warm up'
test_a(x)
test_b(x)

print 'test'
test_a(x)
test_b(x)
```

The order of warm-up also have significant influence on the result. 

When `test_a` is called before `test_b`, `__getattr__` is inlined for better 
performance.

```
warm up
a 0.0307860374451
b 0.114766836166
test
a 0.0264270305634
b 0.110279083252
```

When `test_b` is called before `test_a`, `__getattr__` is disabled for 
inlining, which result in worse performance for `a`. 

```
warm up
b 0.115270137787
a 0.111267089844
test
a 0.10728096962
b 0.108850002289
```

I came up with some explanations for the behavior from the code 
`blackhole_if_trace_too_long` in pyjitpl.py. If `test_b` is called first, trace 
too long would be detected since a small trace limit is set. Then the biggest 
function in `b`, which is `__getattr__`, is set to be noninlineable. 

```
#!python

def blackhole_if_trace_too_long(self):
    warmrunnerstate = self.jitdriver_sd.warmstate
    if (self.history.length() > warmrunnerstate.trace_limit or
            self.history.trace_tag_overflow()):
        jd_sd, greenkey_of_huge_function = self.find_biggest_function()
        self.staticdata.stats.record_aborted(greenkey_of_huge_function)
        self.portal_trace_positions = None
        if greenkey_of_huge_function is not None:
            jd_sd.warmstate.disable_noninlinable_function(
                greenkey_of_huge_function)
            self.aborted_tracing_jitdriver = jd_sd
            self.aborted_tracing_greenkey = greenkey_of_huge_function
            if self.current_merge_points:
                jd_sd = self.jitdriver_sd
                greenkey = 
self.current_merge_points[0][0][:jd_sd.num_green_args]
                warmrunnerstate.JitCell.trace_next_iteration(greenkey)
        raise SwitchToBlackhole(Counters.ABORT_TOO_LONG)
```

It seems that the disable inlining have a global influence on all the tracing 
path. For the cases above, inlining should be enabled in `a` and be disabled in 
`b`, but the truth is inlining is disabled in both `a` and `b`. 

Is there any proposal could help to improve the performance?


_______________________________________________
pypy-issue mailing list
pypy-issue@python.org
https://mail.python.org/mailman/listinfo/pypy-issue

Reply via email to