In general I'm a fan of submitting patches via the issue/bug tracker. I'm not sure what Kenton prefers.
Regardless, sign a contributor license agreement as mentioned here: http://groups.google.com/group/protobuf/msg/3b72718909d26b0a Your patch looks good but there is one improvement to make: Don't use (1<<14) syntax, use the raw number 0x4000. Python 2.4 doesn't do constant expression evaluation so the speedup isn't as great using the << syntax. In 2.5 and later this doesn't matter. After that, consider this +1 from me on accepting this patch. FWIW, I played around with various other versions of this function, using a comparison tree similar to the C++ code and such but nothing beat your simpled unrolled loop in python 2.x. Calling dis.dis() on each function to see the python byte code disassembly is revealing. side note: Interestingly a few approaches were better when the uint64 parameter was a long() rather than an int(), so that may be more useful when we get this working on python 3.x where all integers use the 2.x long() bignum type internally. In particular an approach using bisect.bisect_left() to get the return value by doing a table lookup was faster whenever the input was a long(). -gps On Sat, Jan 17, 2009 at 9:32 PM, Will Pierce <[email protected]> wrote: > I unrolled the loop and reordered the bounds checking on the varint > "_VarUInt64ByteSizeNoTag" function which gets called quite a bit within > wire_format.py and got some interesting speedups. > > What surprised me is that my first attempt to speed this function up was by > using the actual mathematical formula for how large a varint will be > encoded, using log base 2 and knowing that it gets spread out across 7-bit > pieces. > something like this: > if uint64 > UINT64_MAX: > raise ValueError > if uint64 < 1: > uint64 = 1 > uint64 += 1 > size = math.ceil(math.log(uint64, 2) / 7) > return size > > This code has a consistent running time for every value sent in to it. On > my simple hardware for encoded values under 2^28, this code is SLOWER than > the looping version (loop only goes around 4 times). But for values larger > than 2^28, this logarithm method was up to 2x faster! But that's the > opposite performance we want, since varints should be used less frequently > than fixed(32|64) for values greater than 2^28. > > So instead, I unrolled the loop and moved the range bounds checking to a > more sensible place. I then compared the runtimes using the timeit module > and found it's from 2x-3x faster than the old method. I didn't really > expect that kind of speedup from just unrolling the loop, so I would > appreciate some feedback. Given how often this code is called > > I've only tested this with python 2.5.2 on linux (fedora10), with -O and > without -O on the cmdline. I also ran the "python/setup.py test" tests and > got an "OK" for the 145 tests. I haven't tested yet on a 64bit box, or > other versions of python, so perhaps others could try it out. I've attached > the test code (named size_calc.py which is standalone code), and also the > patch, (named varintsizenotag.patch, taken from svn diff at trunk/../) which > has the unrolled loop (ATTACHED.) > Please submit patches for enhancements in the issue tracker. For some reason its not accepting bugs. BTW, for further potential optimization try a less linear search on the number's magnitude rather than a linear series of ifs so that large values are also sped up. See the C++ code for an example: http://code.google.com/p/protobuf/source/browse/trunk/src/google/protobuf/io/coded_stream.cc#741 I haven't benchmarked that approach in python. Also, --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~----------~----~----~----~------~----~------~--~---
