Interesting - is it possible that the same technique could be used to hide differences in character width? Specifically, if I concatenate an ascii string with a UTF-32 string, can the up-conversion to UTF-32 also be done lazily? If that could be done efficiently, it would resolve some outstanding issues that have come up on the Python-3000 list with regards to str/unicode convergence.
Larry Hastings wrote: > > I've significantly enhanced my string-concatenation patch, to the point > where that name is no longer accurate. So I've redubbed it the "lazy > strings" patch. > > The major new feature is that string *slices* are also represented with > a lazy-evaluation placeholder for the actual string, just as > concatenated strings were in my original patch. The lazy slice object > stores a reference to the original PyStringObject * it is sliced from, > and the desired start and stop slice markers. (It only supports step = > 1.) Its ob_sval is NULL until the string is rendered--but that rarely > happens! Not only does this mean string slices are faster, but I bet > this generally reduces overall memory usage for slices too. > > Now, one rule of the Python programming API is that "all strings are > zero-terminated". That part of makes the life of a Python extension > author sane--they don't have to deal with some exotic Python string > class, they can just assume C-style strings everywhere. Ordinarily, > this means a string slice couldn't simply point into the original > string; if it did, and you executed > x = "abcde" > y = x[1:4] > internally y->ob_sval[3] would not be 0, it would be 'e', breaking the > API's rule about strings. > > However! When a PyStringObject lives out its life purely within the > Python VM, the only code that strenuously examines its internals is > stringobject.c. And that code almost never needs the trailing zero*. > So I've added a new static method in stringobject.c: > char * PyString_AsUnterminatedString(PyStringObject *) > If you call it on a lazy-evaluation slice object, it gives you back a > pointer into the original string's ob_sval. The s->ob_size'th element > of this *might not* be zero, but if you call this function you're saying > that's a-okay, you promise not to look at it. (If the PyStringObject * > is any other variety, it calls into PyString_AsString, which renders > string concatenation objects then returns ob_sval.) > > Again: this behavior is *never* observed by anyone outside of > stringobject.c. External users of PyStringObjects call > PyString_AS_STRING(), which renders all lazy concatenation and lazy > slices so they look just like normal zero-terminated PyStringObjects. > With my patch applied, trunk still passes all expected tests. > > Of course, lazy slice objects aren't just for literal slices created > with [x:y]. There are lots of string methods that return what are > effectively string slices, like lstrip() and split(). > > With this code in place, string slices that aren't examined by modules > are very rarely rendered. I ran "pybench -n 2" (two rounds, warp 10 > (whatever that means)) while collecting some statistics. When it > finished, the interpreter had created a total of 640,041 lazy slices, of > which only *19* were ever rendered. > > > Apart from lazy slices, there's only one more enhancement when compared > with v1: string prepending now reuses lazy concatenation objects much > more often. There was an optimization in string_concatenate > (Python/ceval.c) that said: "if the left-side string has two references, > and we're about to overwrite the second reference by storing this > concatenation to an object, tell that object to drop its reference". > That often meant the reference on the string dropped to 1, which meant > PyString_Resize could just resize the left-side string in place and > append the right-side. I modified it so it drops the reference to the > right-hand operand too. With this change, even with a reduction in the > allowable stack depth for right-hand recursion (so it's less likely to > blow the stack), I was able to prepend over 86k strings before it forced > a render. (Oh, for the record: I ensure depth limits are enforced when > combining lazy slices and lazy concatenations, so you still won't blow > your stack when you mix them together.) > > > Here are the highlights of a single apples-to-apples pybench run, 2.6 > trunk revision 52413 ("this") versus that same revision with my patch > applied ("other"): > > Test minimum run-time average run-time > this other diff this other > diff > ------------------------------------------------------------------------------- > > > ConcatStrings: 204ms 76ms +168.4% 213ms 77ms > +177.7% > CreateStringsWithConcat: 159ms 138ms +15.7% 163ms 142ms > +15.1% > StringSlicing: 142ms 86ms +65.5% 145ms 88ms > +64.6% > ------------------------------------------------------------------------------- > > > Totals: 7976ms 7713ms +3.4% 8257ms > 7975ms +3.5% > > I also ran this totally unfair benchmark: > x = "abcde" * (20000) # 100k characters > for i in xrange(10000000): > y = x[1:-1] > and found my patched version to be 9759% faster. (You heard that right, > 98x faster.) > > > I'm ready to post the patch. However, as a result of this work, the > description on the original patch page is really no longer accurate: > > http://sourceforge.net/tracker/index.php?func=detail&aid=1569040&group_id=5470&atid=305470 > > > Shall I close/delete that patch and submit a new patch with a more > modern description? After all, there's not a lot of activity on the old > patch page... > > > Cheers, > > > /larry/ > > * As I recall, stringobject.c needs the trailing zero in exactly *one* > place: when comparing two zero-length strings. My patch ensures that > zero-length slices and concatenations still return nullstring, so this > still works as expected. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/talin%40acm.org _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com