This patch adds a new function buffer_append_uescape(), a streamlined
replacement for uescape(). The optimizations are as follows:
1. As described in an earlier post, the new function appends the uescaped
text to an existing growing_buffer, rather than into an internal one that
it must create and destroy. It thereby saves two mallocs and two frees.
In addition, the calling code doesn't have to do a buffer_add().
2. Because it doesn't create an internal growing_buffer, it doesn't need
to do a strlen() to determine how big a buffer to allocate.
3. Since the new version doesn't have a boolean parameter, it doesn't have
to test the boolean.
4. For characters < 128 (i.e. ASCII characters), I rearranged the order of
the tests so that non-control characters are recognized immediately. In
uescape() we first go through a switch/case looking for several specific
control characters like '\b' and '\n'. In practice most characters are
not control characters, so this rearrangement saves a few CPU cycles.
5. For control characters, uescape() uses buffer_fadd() to format the
character into hex. Now, buffer_fadd is slow because it uses vsnprintf()
twice, once to determine a buffer size and once to do the formatting. In
turn vsnprintf() is slow because it has to parse the format string. In
this case we don't need vsnprintf() because we already know exactly how
big the buffer needs to be, and the formatting is simple. I eliminated
buffer_fadd() and formatted the hex manually with a little bit-twiddling.
Some of these optimizations could be applied to uescape(), but I haven't
bothered, partly because I wanted a clean comparison for benchmarking
purposes and partly because I expect uescape() to disappear from use
(though I am leaving it available).
--------------
Benchmark results:
For a six-character text string, the new function is faster than the old
one by a factor of about 2.7. This speedup is attributable mostly to the
elimination of mallocs and frees.
For a long text string (276 characters) the new function is faster by a
factor of about 1.7. Here the saving in mallocs and frees is diluted by
the extra work of processing more bytes. I haven't tried to determine
how much of the speedup is attributable to the rearrangement of
character tests.
For a string consisting entirely of 31 control characters, the new function
is faster by a factor of about 6.2, attributable in part to the elimination
of buffer_fadd(). This figure probably overstates the speedup, due to an
artifact of the benchmark. Because the control characters expand to six times
their original size, uescape() needs to expand its internal
growing_buffer, incurring a malloc and free on every call. The new
function reuses the same growing_buffer, which once expanded doesn't need
to be expanded again. I haven't tried to compensate for this artifact to
reflect a more realistic input string.
In case you want to play around with benchmarking, I attach my test
program tuencode.c.
--------------
buffer_append_uescape() returns:
0 if successful;
-1 if the pointer to growing_buffer is NULL;
A positive integer if the function finds a character that is not valid
UTF-8. In this case the return value is the position of the invalid
character in the input string, counting from 1.
The first byte of a valid UTF-8 character should begin with a binary
110, 1110, or 1111. Bytes starting with binary 10 are reserved for the
second, third, or fourth byte. If a byte starting with binary 10 occurs
where we expect the first byte of a multibyte sequence, that's an error.
Both uescape() and buffer_append_uescape() will detect it.
uescape() responds by returning NULL, thus discarding all of the input
processed so far, even though it was valid. buffer_append_uescape()
responds a little differently -- it just stops, leaving the previously
translated characters in place. I could nake it behave more like the
original by truncating the growing_buffer back to the starting point, but
it hardly seems worth the trouble.
Probably the best response to invalid UTF-8 is to skip over the invalid
bytes and resume the translation with the first valid character. At least
in this round I haven't been that ambitious.
There other kinds of invalid UTF-8 that neither function will catch.
For example, a first byte of 11000000 or 11110111 is invalid, according
to Wikipedia.
Please note that I haven't tried to test the error handling. In fact I
haven't tested the UTF-8 handling at all, It should work the same as
before, because I didn't change any of it except for the response to an
error. You probably have some UTF-8 text readily to hand for testing.
I would have to tediously build test cases by hand, and it would be hard
to be sure that I had covered all the possibilities. However I will do so
if you ask.
-------------
Scott McKellar
http://home.swbell.net/mck9/ct/
The following applies not only to these patches but to the patches I
submitted over the last few days (when I forgot to include the DCO).
Developer's Certificate of Origin 1.1 By making a contribution to
this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license indicated
in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source license
and I have the right under that license to submit that work with
modifications, whether created in whole or in part by me, under the
same open source license (unless I am permitted to submit under a
different license), as indicated in the file; or
(c) The contribution was provided directly to me by some other person
who certified (a), (b) or (c) and I have not modified it; and
(d) In the case of each of (a), (b), or (c), I understand and agree
that this project and the contribution are public and that a record
of the contribution (including all personal information I submit
with it, including my sign-off) is maintained indefinitely and may
be redistributed consistent with this project or the open source
license indicated in the file.
*** ./trunk/src/libopensrf/utils.c 2008-11-17 16:12:14.000000000 -0600
--- ./trunk-mod/src/libopensrf/utils.c 2008-11-17 16:06:23.000000000 -0600
***************
*** 16,21 ****
--- 16,24 ----
#include <opensrf/log.h>
#include <errno.h>
+ static const char hex_chars[] = "0123456789abcdef";
+ static unsigned char hex_code[7] = "\\u00";
+
inline void* safe_malloc( int size ) {
void* ptr = (void*) malloc( size );
if( ptr == NULL ) {
***************
*** 424,429 ****
--- 418,527 ----
return buffer_release(buf);
}
+ int buffer_append_uescape( growing_buffer* buf, const char* string ) {
+
+ if(NULL == string)
+ return 0; // Nothing to add? Nothing to do
+
+ if( NULL == buf )
+ return -1; // Nothing to add to
+
+ int idx = 0;
+ unsigned long int c;
+
+ while (string[idx]) {
+
+ c = 0x0;
+
+ if ((unsigned char)string[idx] >= 0x80) { // not ASCII
+
+ if ((unsigned char)string[idx] >= 0xC0 && (unsigned char)string[idx] <= 0xF4) { // starts a UTF8 string
+
+ int clen = 1;
+ if (((unsigned char)string[idx] & 0xF0) == 0xF0) {
+ clen = 3;
+ c = (unsigned char)string[idx] ^ 0xF0;
+
+ } else if (((unsigned char)string[idx] & 0xE0) == 0xE0) {
+ clen = 2;
+ c = (unsigned char)string[idx] ^ 0xE0;
+
+ } else if (((unsigned char)string[idx] & 0xC0) == 0xC0) {
+ clen = 1;
+ c = (unsigned char)string[idx] ^ 0xC0;
+ }
+
+ for (;clen;clen--) {
+
+ idx++; // look at the next byte
+ c = (c << 6) | ((unsigned char)string[idx] & 0x3F); // add this byte worth
+ }
+
+ buffer_fadd(buf, "\\u%04x", c);
+
+ } else {
+ return idx + 1;
+ }
+
+ } else if(string[idx] >= ' ' ) { // printable ASCII character
+ OSRF_BUFFER_ADD_CHAR(buf, string[idx]);
+ } else {
+ c = string[idx];
+
+ /* escape the usual suspects */
+ switch(c) {
+ case '"':
+ OSRF_BUFFER_ADD_CHAR(buf, '\\');
+ OSRF_BUFFER_ADD_CHAR(buf, '"');
+ break;
+
+ case '\b':
+ OSRF_BUFFER_ADD_CHAR(buf, '\\');
+ OSRF_BUFFER_ADD_CHAR(buf, 'b');
+ break;
+
+ case '\f':
+ OSRF_BUFFER_ADD_CHAR(buf, '\\');
+ OSRF_BUFFER_ADD_CHAR(buf, 'f');
+ break;
+
+ case '\t':
+ OSRF_BUFFER_ADD_CHAR(buf, '\\');
+ OSRF_BUFFER_ADD_CHAR(buf, 't');
+ break;
+
+ case '\n':
+ OSRF_BUFFER_ADD_CHAR(buf, '\\');
+ OSRF_BUFFER_ADD_CHAR(buf, 'n');
+ break;
+
+ case '\r':
+ OSRF_BUFFER_ADD_CHAR(buf, '\\');
+ OSRF_BUFFER_ADD_CHAR(buf, 'r');
+ break;
+
+ case '\\':
+ OSRF_BUFFER_ADD_CHAR(buf, '\\');
+ OSRF_BUFFER_ADD_CHAR(buf, '\\');
+ break;
+
+ default:
+ {
+ // Represent as \u followed by four hex characters
+ hex_code[ 4 ] = hex_chars[ c >> 4 ]; // high nybble
+ hex_code[ 5 ] = hex_chars[ c & 0x0F ]; // low nybble
+ hex_code[ 6 ] = '\0';
+ OSRF_BUFFER_ADD(buf, (char *) hex_code);
+ break;
+ }
+ }
+ }
+
+ idx++;
+ }
+
+ return 0;
+ }
// A function to turn a process into a daemon
int daemonize( void ) {
*** ./trunk/include/opensrf/utils.h 2008-07-06 09:54:16.000000000 -0500
--- ./trunk-mod/include/opensrf/utils.h 2008-11-17 16:06:47.000000000 -0600
***************
*** 218,223 ****
--- 220,226 ----
*/
char* uescape( const char* string, int size, int full_escape );
+ int buffer_append_uescape( growing_buffer* buf, const char* string );
/* utility methods */
int set_fl( int fd, int flags );
/* Test harness for new versions of uencode() */
#include <stdlib.h>
#include <stdio.h>
#include "utils.h"
static char text[] = "Goober";
static char long_text[] =
"[EMAIL PROTECTED]&*(),./<>?;'[]:{}-=_+\\|"
"[EMAIL PROTECTED]&*(),./<>?;'[]:{}-=_+\\|"
"[EMAIL PROTECTED]&*(),./<>?;'[]:{}-=_+\\|";
static char control_chars[ ' ' ];
static void init_control_chars();
static void buffer_compare_output( const char * s );
static void buffer_compare_speed( const char * s );
int main()
{
init_control_chars(); // Fill array with control characters
// Verify correctness
buffer_compare_output( text );
buffer_compare_output( control_chars );
// Compare performance
printf( "\nPerformance test for short text:\n" );
buffer_compare_speed( text );
printf( "Performance test for long text:\n" );
buffer_compare_speed( long_text );
printf( "Performance test for control characters:\n" );
buffer_compare_speed( control_chars );
return 0;
}
// Fill control_chars[] with control characters 1 through 31
static void init_control_chars()
{
int i;
for( i = 1; i < ' '; ++i )
control_chars[ i - 1 ] = i;
control_chars[ ' ' - 1 ] = '\0';
}
static void buffer_compare_output( const char * s )
{
char * old_escaped = uescape( s, strlen( text ), 1 );
printf( "Old result: %s\n", old_escaped );
growing_buffer * new_buf = buffer_init( 512 );
buffer_append_uescape( new_buf, s );
char * new_escaped = buffer_release( new_buf );
new_buf = NULL;
printf( "New result: %s\n", new_escaped );
if( strcmp( old_escaped, new_escaped ) )
printf( "Different results!\n" );
free( old_escaped );
free( new_escaped );
}
static void buffer_compare_speed( const char * s )
{
int i;
static const int iterations = 51200;
char * result;
int size;
double start_time;
double old_elapsed;
double new_elapsed;
growing_buffer * buf = buffer_init( 512 );
start_time = get_timestamp_millis();
for( i = iterations; i; --i )
{
size = strlen( s );
result = uescape( s, size, 1 );
OSRF_BUFFER_ADD( buf, result );
free( result );
buffer_reset( buf );
}
old_elapsed = get_timestamp_millis() - start_time;
printf( "Old elapsed time: %f\n", old_elapsed );
start_time = get_timestamp_millis();
for( i = iterations; i; --i )
{
buffer_append_uescape( buf, s );
buffer_reset( buf );
}
new_elapsed = get_timestamp_millis() - start_time;
printf( "New elapsed time: %f\n", new_elapsed );
if( old_elapsed > new_elapsed )
printf( "Speedup: %.2f %%\n\n", ( old_elapsed / new_elapsed - 1.0 ) * 100 );
buffer_free( buf );
}