On Sun, 2010-10-10, Stefan Fuhrmann wrote: > On 07.10.2010 16:07, Julian Foad wrote: > >> New Revision: 997203 > >> > >> URL: http://svn.apache.org/viewvc?rev=997203&view=rev > >> Log: > >> Merge r985037, r985046, r995507 and r995603 from the performance branch. > >> > >> These changes introduce the svn_stringbuf_appendbyte() function, which has > >> significantly less overhead than svn_stringbuf_appendbytes(), and can be > >> used in a number of places within our codebase. > > Hi Stefan2. > > > > I have been wondering about the actual benefit of such tightly > > hand-optimized code. Today I got around to giving it a quick spin. > > > > In my first test, it made no difference whatsoever, because the > > optimized code is never executed. The recent merge r1002898 of r1001413 > > from the performance branch introduced a bug: > > > > - if (str->blocksize> old_len + 1) > > + if (str->blocksize< old_len + 1) > WTF how did that happen? > Well, that's what reviews are for ;) > > which totally disables the optimized code path. > > > > Fixed in r1005437. > Thanks. > > That solved, a loop doing a million simple appendbyte calls runs more > > than twice as fast as calling appendbytes(..., 1). That's fantastic. > > > > But what about that hand-optimization? I wrote a simple version of the > > function, and called it 'appendbyte0': > > > > svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte) > > { > > if (str->blocksize> str->len + 1) > > { > > str->data[str->len++] = byte; > > str->data[str->len] = '\0'; > > } > > else > > svn_stringbuf_appendbytes(str,&byte, 1); > > } > > > > Here are the results (see full patch attached): > > > > Times: appendbytes appendbyte0 appendbyte (in ms) > > run: 89 31 34 > > run: 88 30 35 > > run: 88 31 34 > > run: 88 30 34 > > run: 88 31 34 > > min: 88 30 34 > > > > This tells me that the hand-optimization is actually harmful and the > > compiler does a 10% better job by itself. > > > > Have I made a mistake? > My guess is that you might have made two mistakes actually.
Heh, OK. > First, the number of operations was relatively low - everything > in the low ms range could be due to secondary effects (and > still be repeatable). I can't think of any reason why. I ran the whole benchmark lots of times. Occasionally some of the times were a big chunk longer due to other system activity. Normally they were stable. I also ran it several times with 10 million ops instead of 1 million, and the results were exactly 10x longer with the same degree of variability. > The most important problem would be compiler optimizations > or lack thereof. Since the numbers are very large (50..100 > ticks per byte, depending on your CPU clock), I would assume > that you used a normal debug build for testing. In that case, You're right. I'll try again, with "--disable-debug -O2". > the actual number of C statements has a large impact on > the execution time. See my results below for details. > > What are the results for your system? > > > > (I'm using GCC 4.4.1 on an Intel Centrino laptop CPU.) > > > Test code used (10^10 calls, newer re-allocate): > > int i; > unsigned char c; > > svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool); OK so you're avoiding any calls to the "re-alloc". My tests included re-allocs, but were long enough strings that this wasn't much overhead. Nevertheless I'll avoid re-allocs so we can compare results fairly. > for (i = 0; i < 50000000; ++i) > { > s->len = 0; > for (c = 0; c < 200; ++c) > svn_stringbuf_appendbyte (s, c); > } > > > XEON 55xx / Core i7, hyper-threading on, 3GHz peak > 64 bits Linux, GCC 4.3.3; ./configure --disable-debug > > (1) 10,7s; IPC = 2.1 > (2) 8,11s; IPC = 1.4 > (3) 2,64s; IPC = 2.4 > (4) 2,43s; IPC = 2.3 > (1) use appendbytes gives 9 instructions in main, 59 in appbytes > (2) handle count==1 directly in in appendbytes; 9 inst. in main, 26 in > appbytes > (3) appendbyte0 (same compiler output as the the non-handtuned code); > 13 inst. in appbyte, 6 in main > (4) tr...@head appendbyte; 11 inst. in appbyte, 6 in main > > Core2 2.4GHz, Win32, VS2010 > (not using valgrind to count instructions here; also not testing (2)) > > (1) 17,0s release, 20,2s debug > (3) 10,6s release, 12,2s debug > (4) 9,7s release, 13,0s debug With a test harness more similar to yours, and built with "--disable-debug -O2", here are my relative numbers (but only 1 million outer loops, compared to your 50 million): $ svn-c-test subr string 24 Times for 1000000 x 200 bytes, in seconds (1)appendbytes (3)appendbyte0 (4)tr...@head (5)macro run: 7.03 2.06 1.37 0.53 run: 6.62 1.76 1.55 0.53 run: 6.53 1.67 1.44 0.53 run: 6.54 1.60 1.44 0.53 run: 6.52 1.84 1.37 0.53 min: 6.52 1.60 1.37 0.53 This agrees with what you found: the hand-tuned code gives a 10% improvement. This also shows another variant, writing the function as a macro, reported in the column "(5)macro". This gives a further two-and-a-half times speed-up in this test. [...] > * your numbers seem very large in comparison > -> probably something else going on Can you use my test code or show me yours? Mine has the harness in "string-test.c" and the code under test in libsvn_subr - see attachment - and the default build on my platform loads libsvn_subr as a shared library. Maybe yours is completely different. Other observations: - Some callers that use very tight loops are in fact simply copying a string, up as far as some particular end marker. They may benefit from being re-written to first determine how much they want to copy and then call svn_stringbuf_appendbytyes(). - The stringbuf type initially allocates a 1-byte space, and doubles this every time it re-allocates. Could we not could allocate at least a few bytes (4? 8?) as a minimum, without actually consuming more memory in practice? That would avoid two or three re-allocs in some usage patterns. - Julian
A comparative performance test of svn_stringbuf_appendbyte(). * subversion/libsvn_subr/svn_string.c (svn_stringbuf_appendbyte0): New function, for comparative testing. * subversion/tests/libsvn_subr/string-test.c (svn_stringbuf_appendbyte_macro): New macro, for comparative testing. (test24): New test, measuring and printing a speed comparison. (test_funcs): Add it. --This line, and those below, will be ignored-- Index: subversion/libsvn_subr/svn_string.c =================================================================== --- subversion/libsvn_subr/svn_string.c (revision 1006000) +++ subversion/libsvn_subr/svn_string.c (working copy) @@ -392,6 +392,29 @@ svn_stringbuf_ensure(svn_stringbuf_t *st } +/* Unoptimized version of svn_stringbuf_appendbyte(), for comparison. */ +void +svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte) +{ + /* In most cases, there will be pre-allocated memory left + * to just write the new byte at the end of the used section + * and terminate the string properly. + */ + if (str->blocksize > str->len + 1) + { + str->data[str->len++] = byte; + str->data[str->len] = '\0'; + } + else + { + /* we need to re-allocate the string buffer + * -> let the more generic implementation take care of that part + */ + svn_stringbuf_appendbytes(str, &byte, 1); + } +} + + /* WARNING - Optimized code ahead! * This function has been hand-tuned for performance. Please read * the comments below before modifying the code. Index: subversion/tests/libsvn_subr/string-test.c =================================================================== --- subversion/tests/libsvn_subr/string-test.c (revision 1006000) +++ subversion/tests/libsvn_subr/string-test.c (working copy) @@ -511,6 +511,97 @@ test23(apr_pool_t *pool) return test_stringbuf_unequal("abc", "abb", pool); } +void +svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte); + +#define svn_stringbuf_appendbyte_macro(str, byte) \ + do { \ + svn_stringbuf_t *svn__s = (str); \ + if (svn__s->blocksize > svn__s->len + 1) { \ + svn__s->data[svn__s->len++] = (byte); \ + svn__s->data[svn__s->len] = '\0'; \ + } else { \ + char svn__byte = (byte); \ + svn_stringbuf_appendbytes(svn__s, &svn__byte, 1); \ + } \ + } while (0) + +static svn_error_t * +test24(apr_pool_t *pool) +{ + int i, ITERS = 1000000; + int f, FUNCTIONS = 4; + int r, RUNS = 5; + apr_time_t t[FUNCTIONS], min_t[FUNCTIONS]; + + for (f = 0; f < FUNCTIONS; f++) + min_t[f] = 1.0e12; + + printf("Times for %d x 200 bytes, in seconds\n", ITERS); + printf(" (1)appendbytes (3)appendbyte0 (4)tr...@head (5)macro\n"); + + for (r = 0; r < RUNS; r++) + { + apr_time_t start; + svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool); + + start = apr_time_now(); + for (i = 0; i < ITERS; ++i) + { + unsigned char c; + s->len = 0; + for (c = 0; c < 200; ++c) + svn_stringbuf_appendbytes(s, &c, 1); + } + t[0] = apr_time_now() - start; + + start = apr_time_now(); + for (i = 0; i < ITERS; ++i) + { + unsigned char c; + s->len = 0; + for (c = 0; c < 200; ++c) + svn_stringbuf_appendbyte0(s, c); + } + t[1] = apr_time_now() - start; + + start = apr_time_now(); + for (i = 0; i < ITERS; ++i) + { + unsigned char c; + s->len = 0; + for (c = 0; c < 200; ++c) + svn_stringbuf_appendbyte(s, c); + } + t[2] = apr_time_now() - start; + + start = apr_time_now(); + for (i = 0; i < ITERS; ++i) + { + unsigned char c; + s->len = 0; + for (c = 0; c < 200; ++c) + svn_stringbuf_appendbyte_macro(s, c); + } + t[3] = apr_time_now() - start; + + printf("run:"); + for (f = 0; f < FUNCTIONS; f++) + printf(" %10.2f", (float)t[f] / 1000000); + printf("\n"); + + for (f = 0; f < FUNCTIONS; f++) + if (t[f] < min_t[f]) min_t[f] = t[f]; + } + + printf("min:"); + for (f = 0; f < FUNCTIONS; f++) + printf(" %10.2f", (float)min_t[f] / 1000000); + printf("\n"); + + return SVN_NO_ERROR; +} + /* ==================================================================== If you add a new test to this file, update this array. @@ -568,5 +659,7 @@ struct svn_test_descriptor_t test_funcs[ "compare stringbufs; different lengths"), SVN_TEST_PASS2(test23, "compare stringbufs; same length, different content"), + SVN_TEST_PASS2(test24, + "benchmark svn_stringbuf_appendbyte"), SVN_TEST_NULL };