New issue 2901: Disable inlining globally hurts performance https://bitbucket.org/pypy/pypy/issues/2901/disable-inlining-globally-hurts
Tinho Lee: `call_assembler_r` is found to be a heavy function when I profile my project with the jit log, because of the parameters and result packing/unpacking. It seems that inlining would help to solve the problem, however, inlining is not work as expected sometimes. I have construct two test cases, one for recursive call, another for overwriting __getattr__. All the tests are run with environment variable PYPY_GC_MIN=20G, in order to eliminate the influence of gc. Here is the basic functions defined for the recursive call test case ``` #!python def rec(n): return 1 if n <= 1 else rec(n - 1) + n def a(): rec(1) rec(1) rec(1) rec(1) rec(1) def b(): rec(100) rec(100) rec(100) def test_a(): start = time.time() for n in xrange(10000000): a() print 'a', time.time() - start def test_b(): start = time.time() for n in xrange(100000): b() print 'b', time.time() - start ``` and the test code ``` #!python print 'warm up' test_a() test_b() print 'test' test_a() test_b() ``` The order of warm up is important in this case. Here is the result if `test_a` is called before `test_b`: ``` warm up a 0.0164198875427 b 0.196079015732 test a 0.0130050182343 b 0.189415931702 ``` And the result if `test_b` is called before `test_a`: ``` warm up b 0.198947906494 a 1.29781603813 test a 1.27277779579 b 0.186721801758 ``` The performance of function `a` is greatly hurts by the order of warm up. I have found some clues from the jit log. If `test_a` is warmed up first, `rec` would be inlined in the function `a`, everything works fine. However, when `test_b` is warmed up first, `rec` would be detected as recursive function call and inline is disabled, which also affect the performance of `a`. The implementation lies in `_opimpl_recursive_call` of pyjitpl.py. ``` #!python if count >= memmgr.max_unroll_recursion: # This function is recursive and has exceeded the # maximum number of unrollings we allow. We want to stop # inlining it further and to make sure that, if it # hasn't happened already, the function is traced # separately as soon as possible. if have_debug_prints(): loc = targetjitdriver_sd.warmstate.get_location_str(greenboxes) debug_print("recursive function (not inlined):", loc) warmrunnerstate.dont_trace_here(greenboxes) ``` And the same behavior is found for the non-recursive function call. `--jit trace_limit=300` is set as the `pypy` startup parameter to make the test easily. Here is the basic definitions ``` #!python class X(object): def __getattr__(self, name): return len(name) def a(x): return x.a + x.b + x.c def b(x): a = 1 a += 1 a += 2 a += 3 a += 4 a += 5 a += 6 a += 7 a += 8 a += 9 return x.a + x.b + x.c def test_a(x): start = time.time() for n in xrange(1000000): a(x) print 'a', time.time() - start def test_b(x): start = time.time() for n in xrange(1000000): b(x) print 'b', time.time() - start ``` and the test case ``` #!python x = X() print 'warm up' test_a(x) test_b(x) print 'test' test_a(x) test_b(x) ``` The order of warm-up also have significant influence on the result. When `test_a` is called before `test_b`, `__getattr__` is inlined for better performance. ``` warm up a 0.0307860374451 b 0.114766836166 test a 0.0264270305634 b 0.110279083252 ``` When `test_b` is called before `test_a`, `__getattr__` is disabled for inlining, which result in worse performance for `a`. ``` warm up b 0.115270137787 a 0.111267089844 test a 0.10728096962 b 0.108850002289 ``` I came up with some explanations for the behavior from the code `blackhole_if_trace_too_long` in pyjitpl.py. If `test_b` is called first, trace too long would be detected since a small trace limit is set. Then the biggest function in `b`, which is `__getattr__`, is set to be noninlineable. ``` #!python def blackhole_if_trace_too_long(self): warmrunnerstate = self.jitdriver_sd.warmstate if (self.history.length() > warmrunnerstate.trace_limit or self.history.trace_tag_overflow()): jd_sd, greenkey_of_huge_function = self.find_biggest_function() self.staticdata.stats.record_aborted(greenkey_of_huge_function) self.portal_trace_positions = None if greenkey_of_huge_function is not None: jd_sd.warmstate.disable_noninlinable_function( greenkey_of_huge_function) self.aborted_tracing_jitdriver = jd_sd self.aborted_tracing_greenkey = greenkey_of_huge_function if self.current_merge_points: jd_sd = self.jitdriver_sd greenkey = self.current_merge_points[0][0][:jd_sd.num_green_args] warmrunnerstate.JitCell.trace_next_iteration(greenkey) raise SwitchToBlackhole(Counters.ABORT_TOO_LONG) ``` It seems that the disable inlining have a global influence on all the tracing path. For the cases above, inlining should be enabled in `a` and be disabled in `b`, but the truth is inlining is disabled in both `a` and `b`. Is there any proposal could help to improve the performance? _______________________________________________ pypy-issue mailing list pypy-issue@python.org https://mail.python.org/mailman/listinfo/pypy-issue