I tried to find any research on this subject, but I couldn't find any,
so I'll be daring and vulnerable and just try it out to see what your thoughts
are.
I single stepped a simple loop in Python to see where the efficiency bottlenecks are. I was impressed by the optimizations already in there, but I still dare to suggest an optimization that from my estimates might shave off a few cycles, speeding up Python about 5%. The idea is simple: change the byte code argument values from two bytes to one.
Implications are:
- code changes are relatively simple, see below
- fewer memory reads, which are becoming more and more expensive
- saves three instructions for every opcode with args (i.e. most of them)


Code changes are, as far as I could find:
compile.c:
assemble_emit must produce extended opcodes
    for all cases of more than 8 bits instead of 16

ceval.c:
NEXTARG and PEEKARG need adjustment
EXTENDED_ARG needs adjustment
    (this will be a four byte instruction, which is ugly, I agree)

peephole.c:
GETARG, SETARG, need adjustment
also GETJUMPTGT, CODESIZE
routine tuple_of_constants, fold_binops_on_constants, PyCode_Optimize
    are dependent on instruction length, which will be 2 instead of 3
(search for the digit 3 will find all cases, as far as I checked)
you probably will have to write a macro for codestr[i+3]
there is a check for code length >32700, but I think this one might stay,
maybe if a few extra checks are added.

dis:
minor adjustments


Estimation of speed impact:
about 80% of the instructions seem to have an argument, and I never saw an opcode >255 while looking at bytecode, so they are probably not frequent.

The NEXTARG macro expands on my Macbook to:

mov    -408(%ebp),%edx        (next_instr)
movzbl 2(%edx),%eax           (*second byte)
shl    $0x8,%eax              (*shift)
movzbl 1(%edx),%edx           (first byte)
add    %edx,%eax              (*combine)

and the starred instructions will vanish.
The main loop is approximately 40 instructions, so a saving of three instructions is significant. I don't dare to claim 3/40 = 7.5% savings,
but I think 5% may be realistic.

Did anyone try this already? If not, I might take up the gauntlet
and try it myself, but I never did this before...


- Jurjen

PS I also saw that some scratch variables, mainly v and x, are carefull stored back in memory by the compiler and the end of the big interpreter loop, while their value isn't used anymore, of course. A few carefully placed braces might tell the compiler how useless this is and
save another few percent.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to