Feature Requests item #1087418, was opened at 2004-12-18 00:22 Message generated for change (Comment added) made by gregsmith You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1087418&group_id=5470
Category: None Group: None Status: Open Resolution: None Priority: 4 Submitted By: Gregory Smith (gregsmith) Assigned to: Nobody/Anonymous (nobody) Summary: long int bitwise ops speedup (patch included) Initial Comment: The 'inner loop' for applying bitwise ops to longs is quite inefficient. The improvement in the attached diff is - 'a' is never shorter than 'b' (result: only test 1 loop index condition instead of 3) - each operation ( & | ^ ) has its own loop, instead of switch inside loop - I found that, when this is done, a lot of things can be simplified, resulting in further speedup, and the resulting code is not very much longer than before (my libpython2.4.dll .text got 140 bytes longer). Operations on longs of a few thousand bits appear to be 2 ... 2.5 times faster with this patch. I'm not 100% sure the code is right, but it passes test_long.py, anyway. ---------------------------------------------------------------------- >Comment By: Gregory Smith (gregsmith) Date: 2005-02-10 22:45 Message: Logged In: YES user_id=292741 I started by just factoring out the inner switch loop. But then it becomes evident that when op = '^', you always have maska == maskb, so there's no point in doing the ^mask at all. And when op == '|', then maska==maskb==0. So likewise. And if you put a check in so that len(a) >= len(b), then the calculation of len_z can be simplified. It also becomes easy to break the end off the loops, so that, say, or'ing a small number with a really long becomes mostly a copy. etc. It's was just a series of small simple changes following from the refactoring of the loop/switch. I see a repeatable 1.5 x speedup at 300 bits, which I think is significant (I wasn't using negative #s, which of course have their own extra overhead). The difference should be even higher on CPUs that don't have several 100 mW of branch-prediction circuitry. One use case is that you can simulate an array of hundreds or thousands of simple 1-bit processors in pure python using long operations, and get very good performance, even better with this fix. This app involves all logical ops, with the occasional shift. IMHO, I don't think the changed code is more complex; it's a little longer, but it's more explicit in what is really being done, and it doesn't roll together 3 cases, which don't really have that much in common, for the sake of brevity. It wasn't obvious to me about the masks being redundant until after I did the factoring, and this is my point - rolling it together hides that. The original author may not have noticed the redundancy. I see a lot of effort being expended on very complex multiply operations, why should the logical ops be left behind for the sake of a few lines? ---------------------------------------------------------------------- Comment By: Raymond Hettinger (rhettinger) Date: 2005-01-07 01:54 Message: Logged In: YES user_id=80475 Patch Review ------------ On Windows using MSC 6.0, I could only reproduce about a small speedup at around 300 bits. While the patch is short, it adds quite a bit of complexity to the routine. Its correctness is not self-evident or certain. Even if correct, it is likely to encumber future maintenance. Unless you have important use cases and feel strongly about it, I think this one should probably not go in. An alternative to submit a patch that limits its scope to factoring out the innermost switch/case. I tried that and found that the speedup is microscopic. I suspect that that one unpredictable branch is not much of a bottleneck. More time is likely spent on creating z. ---------------------------------------------------------------------- Comment By: Gregory Smith (gregsmith) Date: 2005-01-03 14:54 Message: Logged In: YES user_id=292741 I originally timed this on a cygwin system, I've since found that cygwin timings tend to be strange and possibly misleading. On a RH8 system, I'm seeing speedup of x3.5 with longs of ~1500 bits and larger, and x1.5 speedup with only about 300 bits. Times were measured with timeit.Timer( 'a|b', 'a=...; b=...') Increase in .text size is likewise about 120 bytes. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1087418&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com