Re: Performance optimization - svn_stringbuf_appendbyte()

Julian Foad Mon, 11 Oct 2010 08:07:44 -0700

On Sun, 2010-10-10, Stefan Fuhrmann wrote:
> On 07.10.2010 16:07, Julian Foad wrote:
> >> New Revision: 997203
> >>
> >> URL: http://svn.apache.org/viewvc?rev=997203&view=rev
> >> Log:
> >> Merge r985037, r985046, r995507 and r995603 from the performance branch.
> >>
> >> These changes introduce the svn_stringbuf_appendbyte() function, which has
> >> significantly less overhead than svn_stringbuf_appendbytes(), and can be
> >> used in a number of places within our codebase.
> > Hi Stefan2.
> >
> > I have been wondering about the actual benefit of such tightly
> > hand-optimized code.  Today I got around to giving it a quick spin.
> >
> > In my first test, it made no difference whatsoever, because the
> > optimized code is never executed.  The recent merge r1002898 of r1001413
> > from the performance branch introduced a bug:
> >
> > -  if (str->blocksize>  old_len + 1)
> > +  if (str->blocksize<  old_len + 1)
> WTF how did that happen?
> Well, that's what reviews are for ;)
> > which totally disables the optimized code path.
> >
> > Fixed in r1005437.
> Thanks.
> > That solved, a loop doing a million simple appendbyte calls runs more
> > than twice as fast as calling appendbytes(..., 1).  That's fantastic.
> >
> > But what about that hand-optimization?  I wrote a simple version of the
> > function, and called it 'appendbyte0':
> >
> > svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte)
> > {
> >    if (str->blocksize>  str->len + 1)
> >      {
> >        str->data[str->len++] = byte;
> >        str->data[str->len] = '\0';
> >      }
> >    else
> >      svn_stringbuf_appendbytes(str,&byte, 1);
> > }
> >
> > Here are the results (see full patch attached):
> >
> > Times:  appendbytes appendbyte0 appendbyte  (in ms)
> > run:          89          31          34
> > run:          88          30          35
> > run:          88          31          34
> > run:          88          30          34
> > run:          88          31          34
> > min:          88          30          34
> >
> > This tells me that the hand-optimization is actually harmful and the
> > compiler does a 10% better job by itself.
> >
> > Have I made a mistake?
> My guess is that you might have made two mistakes actually.


Heh, OK.

> First, the number of operations was relatively low - everything
> in the low ms range could be due to secondary effects (and
> still be repeatable).

I can't think of any reason why.  I ran the whole benchmark lots of
times.  Occasionally some of the times were a big chunk longer due to
other system activity.  Normally they were stable.  I also ran it
several times with 10 million ops instead of 1 million, and the results
were exactly 10x longer with the same degree of variability.

> The most important problem would be compiler optimizations
> or lack thereof. Since the numbers are very large (50..100
> ticks per byte, depending on your CPU clock), I would assume
> that you used a normal debug build for testing. In that case,

You're right.  I'll try again, with "--disable-debug -O2".

> the actual number of C statements has a large impact on
> the execution time. See my results below for details.
> > What are the results for your system?
> >
> > (I'm using GCC 4.4.1 on an Intel Centrino laptop CPU.)
> >
> Test code used (10^10 calls, newer re-allocate):
> 
>          int i;
>          unsigned char c;
> 
>          svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool);

OK so you're avoiding any calls to the "re-alloc".  My tests included
re-allocs, but were long enough strings that this wasn't much overhead.
Nevertheless I'll avoid re-allocs so we can compare results fairly.

>          for (i = 0; i < 50000000; ++i)
>          {
>              s->len = 0;
>              for (c = 0; c < 200; ++c)
>                  svn_stringbuf_appendbyte (s, c);
>          }
> 
> 
> XEON 55xx / Core i7, hyper-threading on, 3GHz peak
> 64 bits Linux, GCC 4.3.3; ./configure --disable-debug
> 
> (1)  10,7s; IPC = 2.1
> (2)  8,11s; IPC = 1.4
> (3)  2,64s; IPC = 2.4
> (4)  2,43s; IPC = 2.3

> (1) use appendbytes gives 9 instructions in main, 59 in appbytes
> (2) handle count==1 directly in in appendbytes; 9 inst. in main, 26 in 
> appbytes
> (3) appendbyte0 (same compiler output as the the non-handtuned code);
>      13 inst. in appbyte, 6 in main
> (4) tr...@head appendbyte; 11 inst. in appbyte, 6 in main
> 
> Core2 2.4GHz, Win32, VS2010
> (not using valgrind to count instructions here; also not testing (2))
> 
> (1)  17,0s release, 20,2s debug
> (3)  10,6s release, 12,2s debug
> (4)  9,7s release, 13,0s debug

With a test harness more similar to yours, and built with
"--disable-debug -O2", here are my relative numbers (but only 1 million
outer loops, compared to your 50 million):

$ svn-c-test subr string 24
Times for 1000000 x 200 bytes, in seconds
  (1)appendbytes (3)appendbyte0 (4)tr...@head (5)macro
run:        7.03        2.06        1.37        0.53
run:        6.62        1.76        1.55        0.53
run:        6.53        1.67        1.44        0.53
run:        6.54        1.60        1.44        0.53
run:        6.52        1.84        1.37        0.53
min:        6.52        1.60        1.37        0.53

This agrees with what you found: the hand-tuned code gives a 10%
improvement.

This also shows another variant, writing the function as a macro,
reported in the column "(5)macro".  This gives a further two-and-a-half
times speed-up in this test.


[...]
> * your numbers seem very large in comparison
>    -> probably something else going on

Can you use my test code or show me yours?  Mine has the harness in
"string-test.c" and the code under test in libsvn_subr - see attachment
- and the default build on my platform loads libsvn_subr as a shared
library.  Maybe yours is completely different.


Other observations:

- Some callers that use very tight loops are in fact simply copying a
string, up as far as some particular end marker.  They may benefit from
being re-written to first determine how much they want to copy and then
call svn_stringbuf_appendbytyes().

- The stringbuf type initially allocates a 1-byte space, and doubles
this every time it re-allocates.  Could we not could allocate at least a
few bytes (4? 8?) as a minimum, without actually consuming more memory
in practice?  That would avoid two or three re-allocs in some usage
patterns.
 
- Julian

A comparative performance test of svn_stringbuf_appendbyte().

* subversion/libsvn_subr/svn_string.c
  (svn_stringbuf_appendbyte0): New function, for comparative testing.

* subversion/tests/libsvn_subr/string-test.c
  (svn_stringbuf_appendbyte_macro): New macro, for comparative testing.
  (test24): New test, measuring and printing a speed comparison.
  (test_funcs): Add it.

--This line, and those below, will be ignored--

Index: subversion/libsvn_subr/svn_string.c
===================================================================
--- subversion/libsvn_subr/svn_string.c	(revision 1006000)
+++ subversion/libsvn_subr/svn_string.c	(working copy)
@@ -392,6 +392,29 @@ svn_stringbuf_ensure(svn_stringbuf_t *st
 }
 
 
+/* Unoptimized version of svn_stringbuf_appendbyte(), for comparison. */
+void
+svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte)
+{
+  /* In most cases, there will be pre-allocated memory left
+   * to just write the new byte at the end of the used section
+   * and terminate the string properly.
+   */
+  if (str->blocksize > str->len + 1)
+    {
+      str->data[str->len++] = byte;
+      str->data[str->len] = '\0';
+    }
+  else
+    {
+      /* we need to re-allocate the string buffer
+       * -> let the more generic implementation take care of that part
+       */
+      svn_stringbuf_appendbytes(str, &byte, 1);
+    }
+}
+
+
 /* WARNING - Optimized code ahead! 
  * This function has been hand-tuned for performance. Please read 
  * the comments below before modifying the code.
Index: subversion/tests/libsvn_subr/string-test.c
===================================================================
--- subversion/tests/libsvn_subr/string-test.c	(revision 1006000)
+++ subversion/tests/libsvn_subr/string-test.c	(working copy)
@@ -511,6 +511,97 @@ test23(apr_pool_t *pool)
   return test_stringbuf_unequal("abc", "abb", pool);
 }
 
+void
+svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte);
+
+#define svn_stringbuf_appendbyte_macro(str, byte) \
+  do {                                                                      \
+    svn_stringbuf_t *svn__s = (str);                                        \
+    if (svn__s->blocksize > svn__s->len + 1) {                              \
+      svn__s->data[svn__s->len++] = (byte);                                 \
+      svn__s->data[svn__s->len] = '\0';                                     \
+    } else {                                                                \
+      char svn__byte = (byte);                                              \
+      svn_stringbuf_appendbytes(svn__s, &svn__byte, 1);                     \
+    }                                                                       \
+  } while (0)
+
+static svn_error_t *
+test24(apr_pool_t *pool)
+{
+  int i, ITERS = 1000000;
+  int f, FUNCTIONS = 4;
+  int r, RUNS = 5;
+  apr_time_t t[FUNCTIONS], min_t[FUNCTIONS];
+
+  for (f = 0; f < FUNCTIONS; f++)
+    min_t[f] = 1.0e12;
+
+  printf("Times for %d x 200 bytes, in seconds\n", ITERS);
+  printf("  (1)appendbytes (3)appendbyte0 (4)tr...@head (5)macro\n");
+
+  for (r = 0; r < RUNS; r++)
+    {
+      apr_time_t start;
+      svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool);
+
+      start = apr_time_now();
+      for (i = 0; i < ITERS; ++i)
+        {
+          unsigned char c;
+          s->len = 0;
+          for (c = 0; c < 200; ++c)
+            svn_stringbuf_appendbytes(s, &c, 1);
+        }
+      t[0] = apr_time_now() - start;
+
+      start = apr_time_now();
+      for (i = 0; i < ITERS; ++i)
+        {
+          unsigned char c;
+          s->len = 0;
+          for (c = 0; c < 200; ++c)
+            svn_stringbuf_appendbyte0(s, c);
+        }
+      t[1] = apr_time_now() - start;
+
+      start = apr_time_now();
+      for (i = 0; i < ITERS; ++i)
+        {
+          unsigned char c;
+          s->len = 0;
+          for (c = 0; c < 200; ++c)
+            svn_stringbuf_appendbyte(s, c);
+        }
+      t[2] = apr_time_now() - start;
+
+      start = apr_time_now();
+      for (i = 0; i < ITERS; ++i)
+        {
+          unsigned char c;
+          s->len = 0;
+          for (c = 0; c < 200; ++c)
+            svn_stringbuf_appendbyte_macro(s, c);
+        }
+      t[3] = apr_time_now() - start;
+
+      printf("run:");
+      for (f = 0; f < FUNCTIONS; f++)
+        printf("  %10.2f", (float)t[f] / 1000000);
+      printf("\n");
+
+      for (f = 0; f < FUNCTIONS; f++)
+        if (t[f] < min_t[f]) min_t[f] = t[f];
+    }
+
+  printf("min:");
+  for (f = 0; f < FUNCTIONS; f++)
+    printf("  %10.2f", (float)min_t[f] / 1000000);
+  printf("\n");
+
+  return SVN_NO_ERROR;
+}
+
 /*
    ====================================================================
    If you add a new test to this file, update this array.
@@ -568,5 +659,7 @@ struct svn_test_descriptor_t test_funcs[
                    "compare stringbufs; different lengths"),
     SVN_TEST_PASS2(test23,
                    "compare stringbufs; same length, different content"),
+    SVN_TEST_PASS2(test24,
+                   "benchmark svn_stringbuf_appendbyte"),
     SVN_TEST_NULL
   };

Re: Performance optimization - svn_stringbuf_appendbyte()

Reply via email to