[OPEN-ILS-DEV] Another new function for uescaping UTF-8 strings

Scott McKellar Sat, 22 Nov 2008 21:31:13 -0800

The attached files contain a drop-in replacement for the 
buffer_append_uescape function that I submitted a few days ago.  I regard
this new one as experimental, at least for now.


They also offer some byte-testing functions, and the equivalent macros, 
that may be useful in other code that deals with UTF-8 strings.

The new function buffer_append_utf8() differs from buffer_append_uescape()
in the following ways:

1. It treats 0xF7 as a control character, which it is.

2. It is more finicky about recognizing the header byte of multibyte
characters.  For example 0xF6 is not a valid UTF-8 header byte.

3. When it sees a nul byte in the middle of a multibyte character, it 
stops.  In the same situation, the older buffer_append_uescape() and
uescape() functions accumulate the nul byte into the hex codes they
build and then keep going, risking not only misbehavior but undefined
behavior.

4. When it finds invalid UTF-8 characters in the input string, it skips
over the invalid UTF-8 until it finds a valid character, and then
continues to translate the rest.  In other words it excises the garbage
and translates the rest intact.

---------

The file osrf_utf8.c includes an array of bitmasks that it uses to look
up the characteristics of each byte.  Not trusting myself to do that
much tedious typing by hand, I wrote a program to write the list of 
bitmasks.  The macros are broadly similar to the standard C functions 
isprint(), isalpha(), and so forth.

There is also a collection of functions, equivalent to the macros, with
the same names except using double underscores.  These may never find a
use, but they're there in case anyone ever needs a function pointer for
some reason.

The logic uses a finite state machine (FSM) to examine and dispatch each
byte in the input stream.  Because it needs to branch on the current
state as well as the type of each character, this logic is a little
slower than buffer_append_uescape().  However pretty much any 
implementation of the same behavior would probably incur some such extra
overhead in some form.

-------------

Please note that this new function does *not* address the concerns I wrote
about in my previous post -- namely the fact that both uescape() and
buffer_append_uescape() create ambiguities that no decoding scheme can
resolve.  However, because the FSM logic systematically recognizes
every possible situation, it should be a straightforward matter to 
implement a new set of rules, once we decide what those rules should be.

Scott McKellar
http://home.swbell.net/mck9/ct/

Developer's Certificate of Origin 1.1 By making a contribution to
this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license indicated
in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source license
and I have the right under that license to submit that work with
modifications, whether created in whole or in part by me, under the
same open source license (unless I am permitted to submit under a
different license), as indicated in the file; or

(c) The contribution was provided directly to me by some other person
who certified (a), (b) or (c) and I have not modified it; and

(d) In the case of each of (a), (b), or (c), I understand and agree
that this project and the contribution are public and that a record
of the contribution (including all personal information I submit
with it, including my sign-off) is maintained indefinitely and may
be redistributed consistent with this project or the open source
license indicated in the file.

/*----------------------------------------------------
 Desc    : functions and macros for processing UTF-8
 Author  : Scott McKellar
 Notes   : 

 Copyright 2008 Scott McKellar
 All Rights reserved
 
 Date       Change
 ---------- -----------------------------------------
 2008/11/20 Initial creation
 ---------------------------------------------------*/
#include <opensrf/utils.h>
#include <opensrf/osrf_utf8.h>

unsigned char osrf_utf8_mask_[] =
{
	193,	/* 00000000	Control character */
	193,	/* 00000001	Control character */
	193,	/* 00000010	Control character */
	193,	/* 00000011	Control character */
	193,	/* 00000100	Control character */
	193,	/* 00000101	Control character */
	193,	/* 00000110	Control character */
	193,	/* 00000111	Control character */
	193,	/* 00001000	Control character */
	193,	/* 00001001	Control character */
	193,	/* 00001010	Control character */
	193,	/* 00001011	Control character */
	193,	/* 00001100	Control character */
	193,	/* 00001101	Control character */
	193,	/* 00001110	Control character */
	193,	/* 00001111	Control character */
	193,	/* 00010000	Control character */
	193,	/* 00010001	Control character */
	193,	/* 00010010	Control character */
	193,	/* 00010011	Control character */
	193,	/* 00010100	Control character */
	193,	/* 00010101	Control character */
	193,	/* 00010110	Control character */
	193,	/* 00010111	Control character */
	193,	/* 00011000	Control character */
	193,	/* 00011001	Control character */
	193,	/* 00011010	Control character */
	193,	/* 00011011	Control character */
	193,	/* 00011100	Control character */
	193,	/* 00011101	Control character */
	193,	/* 00011110	Control character */
	193,	/* 00011111	Control character */
	194,	/* 00100000	Printable ASCII */
	194,	/* 00100001	Printable ASCII */
	194,	/* 00100010	Printable ASCII */
	194,	/* 00100011	Printable ASCII */
	194,	/* 00100100	Printable ASCII */
	194,	/* 00100101	Printable ASCII */
	194,	/* 00100110	Printable ASCII */
	194,	/* 00100111	Printable ASCII */
	194,	/* 00101000	Printable ASCII */
	194,	/* 00101001	Printable ASCII */
	194,	/* 00101010	Printable ASCII */
	194,	/* 00101011	Printable ASCII */
	194,	/* 00101100	Printable ASCII */
	194,	/* 00101101	Printable ASCII */
	194,	/* 00101110	Printable ASCII */
	194,	/* 00101111	Printable ASCII */
	194,	/* 00110000	Printable ASCII */
	194,	/* 00110001	Printable ASCII */
	194,	/* 00110010	Printable ASCII */
	194,	/* 00110011	Printable ASCII */
	194,	/* 00110100	Printable ASCII */
	194,	/* 00110101	Printable ASCII */
	194,	/* 00110110	Printable ASCII */
	194,	/* 00110111	Printable ASCII */
	194,	/* 00111000	Printable ASCII */
	194,	/* 00111001	Printable ASCII */
	194,	/* 00111010	Printable ASCII */
	194,	/* 00111011	Printable ASCII */
	194,	/* 00111100	Printable ASCII */
	194,	/* 00111101	Printable ASCII */
	194,	/* 00111110	Printable ASCII */
	194,	/* 00111111	Printable ASCII */
	194,	/* 01000000	Printable ASCII */
	194,	/* 01000001	Printable ASCII */
	194,	/* 01000010	Printable ASCII */
	194,	/* 01000011	Printable ASCII */
	194,	/* 01000100	Printable ASCII */
	194,	/* 01000101	Printable ASCII */
	194,	/* 01000110	Printable ASCII */
	194,	/* 01000111	Printable ASCII */
	194,	/* 01001000	Printable ASCII */
	194,	/* 01001001	Printable ASCII */
	194,	/* 01001010	Printable ASCII */
	194,	/* 01001011	Printable ASCII */
	194,	/* 01001100	Printable ASCII */
	194,	/* 01001101	Printable ASCII */
	194,	/* 01001110	Printable ASCII */
	194,	/* 01001111	Printable ASCII */
	194,	/* 01010000	Printable ASCII */
	194,	/* 01010001	Printable ASCII */
	194,	/* 01010010	Printable ASCII */
	194,	/* 01010011	Printable ASCII */
	194,	/* 01010100	Printable ASCII */
	194,	/* 01010101	Printable ASCII */
	194,	/* 01010110	Printable ASCII */
	194,	/* 01010111	Printable ASCII */
	194,	/* 01011000	Printable ASCII */
	194,	/* 01011001	Printable ASCII */
	194,	/* 01011010	Printable ASCII */
	194,	/* 01011011	Printable ASCII */
	194,	/* 01011100	Printable ASCII */
	194,	/* 01011101	Printable ASCII */
	194,	/* 01011110	Printable ASCII */
	194,	/* 01011111	Printable ASCII */
	194,	/* 01100000	Printable ASCII */
	194,	/* 01100001	Printable ASCII */
	194,	/* 01100010	Printable ASCII */
	194,	/* 01100011	Printable ASCII */
	194,	/* 01100100	Printable ASCII */
	194,	/* 01100101	Printable ASCII */
	194,	/* 01100110	Printable ASCII */
	194,	/* 01100111	Printable ASCII */
	194,	/* 01101000	Printable ASCII */
	194,	/* 01101001	Printable ASCII */
	194,	/* 01101010	Printable ASCII */
	194,	/* 01101011	Printable ASCII */
	194,	/* 01101100	Printable ASCII */
	194,	/* 01101101	Printable ASCII */
	194,	/* 01101110	Printable ASCII */
	194,	/* 01101111	Printable ASCII */
	194,	/* 01110000	Printable ASCII */
	194,	/* 01110001	Printable ASCII */
	194,	/* 01110010	Printable ASCII */
	194,	/* 01110011	Printable ASCII */
	194,	/* 01110100	Printable ASCII */
	194,	/* 01110101	Printable ASCII */
	194,	/* 01110110	Printable ASCII */
	194,	/* 01110111	Printable ASCII */
	194,	/* 01111000	Printable ASCII */
	194,	/* 01111001	Printable ASCII */
	194,	/* 01111010	Printable ASCII */
	194,	/* 01111011	Printable ASCII */
	194,	/* 01111100	Printable ASCII */
	194,	/* 01111101	Printable ASCII */
	194,	/* 01111110	Printable ASCII */
	193,	/* 01111111	Control character */
	132,	/* 10000000	UTFR-8 continuation */
	132,	/* 10000001	UTFR-8 continuation */
	132,	/* 10000010	UTFR-8 continuation */
	132,	/* 10000011	UTFR-8 continuation */
	132,	/* 10000100	UTFR-8 continuation */
	132,	/* 10000101	UTFR-8 continuation */
	132,	/* 10000110	UTFR-8 continuation */
	132,	/* 10000111	UTFR-8 continuation */
	132,	/* 10001000	UTFR-8 continuation */
	132,	/* 10001001	UTFR-8 continuation */
	132,	/* 10001010	UTFR-8 continuation */
	132,	/* 10001011	UTFR-8 continuation */
	132,	/* 10001100	UTFR-8 continuation */
	132,	/* 10001101	UTFR-8 continuation */
	132,	/* 10001110	UTFR-8 continuation */
	132,	/* 10001111	UTFR-8 continuation */
	132,	/* 10010000	UTFR-8 continuation */
	132,	/* 10010001	UTFR-8 continuation */
	132,	/* 10010010	UTFR-8 continuation */
	132,	/* 10010011	UTFR-8 continuation */
	132,	/* 10010100	UTFR-8 continuation */
	132,	/* 10010101	UTFR-8 continuation */
	132,	/* 10010110	UTFR-8 continuation */
	132,	/* 10010111	UTFR-8 continuation */
	132,	/* 10011000	UTFR-8 continuation */
	132,	/* 10011001	UTFR-8 continuation */
	132,	/* 10011010	UTFR-8 continuation */
	132,	/* 10011011	UTFR-8 continuation */
	132,	/* 10011100	UTFR-8 continuation */
	132,	/* 10011101	UTFR-8 continuation */
	132,	/* 10011110	UTFR-8 continuation */
	132,	/* 10011111	UTFR-8 continuation */
	132,	/* 10100000	UTFR-8 continuation */
	132,	/* 10100001	UTFR-8 continuation */
	132,	/* 10100010	UTFR-8 continuation */
	132,	/* 10100011	UTFR-8 continuation */
	132,	/* 10100100	UTFR-8 continuation */
	132,	/* 10100101	UTFR-8 continuation */
	132,	/* 10100110	UTFR-8 continuation */
	132,	/* 10100111	UTFR-8 continuation */
	132,	/* 10101000	UTFR-8 continuation */
	132,	/* 10101001	UTFR-8 continuation */
	132,	/* 10101010	UTFR-8 continuation */
	132,	/* 10101011	UTFR-8 continuation */
	132,	/* 10101100	UTFR-8 continuation */
	132,	/* 10101101	UTFR-8 continuation */
	132,	/* 10101110	UTFR-8 continuation */
	132,	/* 10101111	UTFR-8 continuation */
	132,	/* 10110000	UTFR-8 continuation */
	132,	/* 10110001	UTFR-8 continuation */
	132,	/* 10110010	UTFR-8 continuation */
	132,	/* 10110011	UTFR-8 continuation */
	132,	/* 10110100	UTFR-8 continuation */
	132,	/* 10110101	UTFR-8 continuation */
	132,	/* 10110110	UTFR-8 continuation */
	132,	/* 10110111	UTFR-8 continuation */
	132,	/* 10111000	UTFR-8 continuation */
	132,	/* 10111001	UTFR-8 continuation */
	132,	/* 10111010	UTFR-8 continuation */
	132,	/* 10111011	UTFR-8 continuation */
	132,	/* 10111100	UTFR-8 continuation */
	132,	/* 10111101	UTFR-8 continuation */
	132,	/* 10111110	UTFR-8 continuation */
	132,	/* 10111111	UTFR-8 continuation */
	0,	/* 11000000	Invalid UTF-8 */
	0,	/* 11000001	Invalid UTF-8 */
	200,	/* 11000010	Header of 2-byte character */
	200,	/* 11000011	Header of 2-byte character */
	200,	/* 11000100	Header of 2-byte character */
	200,	/* 11000101	Header of 2-byte character */
	200,	/* 11000110	Header of 2-byte character */
	200,	/* 11000111	Header of 2-byte character */
	200,	/* 11001000	Header of 2-byte character */
	200,	/* 11001001	Header of 2-byte character */
	200,	/* 11001010	Header of 2-byte character */
	200,	/* 11001011	Header of 2-byte character */
	200,	/* 11001100	Header of 2-byte character */
	200,	/* 11001101	Header of 2-byte character */
	200,	/* 11001110	Header of 2-byte character */
	200,	/* 11001111	Header of 2-byte character */
	200,	/* 11010000	Header of 2-byte character */
	200,	/* 11010001	Header of 2-byte character */
	200,	/* 11010010	Header of 2-byte character */
	200,	/* 11010011	Header of 2-byte character */
	200,	/* 11010100	Header of 2-byte character */
	200,	/* 11010101	Header of 2-byte character */
	200,	/* 11010110	Header of 2-byte character */
	200,	/* 11010111	Header of 2-byte character */
	200,	/* 11011000	Header of 2-byte character */
	200,	/* 11011001	Header of 2-byte character */
	200,	/* 11011010	Header of 2-byte character */
	200,	/* 11011011	Header of 2-byte character */
	200,	/* 11011100	Header of 2-byte character */
	200,	/* 11011101	Header of 2-byte character */
	200,	/* 11011110	Header of 2-byte character */
	200,	/* 11011111	Header of 2-byte character */
	208,	/* 11100000	Header of 3-byte character */
	208,	/* 11100001	Header of 3-byte character */
	208,	/* 11100010	Header of 3-byte character */
	208,	/* 11100011	Header of 3-byte character */
	208,	/* 11100100	Header of 3-byte character */
	208,	/* 11100101	Header of 3-byte character */
	208,	/* 11100110	Header of 3-byte character */
	208,	/* 11100111	Header of 3-byte character */
	208,	/* 11101000	Header of 3-byte character */
	208,	/* 11101001	Header of 3-byte character */
	208,	/* 11101010	Header of 3-byte character */
	208,	/* 11101011	Header of 3-byte character */
	208,	/* 11101100	Header of 3-byte character */
	208,	/* 11101101	Header of 3-byte character */
	208,	/* 11101110	Header of 3-byte character */
	208,	/* 11101111	Header of 3-byte character */
	224,	/* 11110000	Header of 4-byte character */
	224,	/* 11110001	Header of 4-byte character */
	224,	/* 11110010	Header of 4-byte character */
	224,	/* 11110011	Header of 4-byte character */
	224,	/* 11110100	Header of 4-byte character */
	0,	/* 11110101	Invalid UTF-8 */
	0,	/* 11110110	Invalid UTF-8 */
	0,	/* 11110111	Invalid UTF-8 */
	0,	/* 11111000	Invalid UTF-8 */
	0,	/* 11111001	Invalid UTF-8 */
	0,	/* 11111010	Invalid UTF-8 */
	0,	/* 11111011	Invalid UTF-8 */
	0,	/* 11111100	Invalid UTF-8 */
	0,	/* 11111101	Invalid UTF-8 */
	0,	/* 11111110	Invalid UTF-8 */
	0	/* 11111111	Invalid UTF-8 */
};

// Functions equivalent to the corresponding macros, for cases
// where you need a function pointer

int is__utf8__control( int c ) {
	return osrf_utf8_mask_[ c & 0xFF ] & UTF8_CONTROL;
}

int is__utf8__print( int c ) {
	return osrf_utf8_mask_[ c & 0xFF ] & UTF8_PRINT;
}

int is__utf8__continue( int c ) {
	return osrf_utf8_mask_[ c & 0xFF ] & UTF8_CONTINUE;
}

int is__utf8__2_byte( int c ) {
	return osrf_utf8_mask_[ c & 0xFF ] & UTF8_2_BYTE;
}

int is__utf8__3_byte( int c ) {
	return osrf_utf8_mask_[ c & 0xFF ] & UTF8_3_BYTE;
}

int is__utf8__4_byte( int c ) {
	return osrf_utf8_mask_[ c & 0xFF ] & UTF8_4_BYTE;
}

int is__utf8__sync( int c ) {
	return osrf_utf8_mask_[ c & 0xFF ] & UTF8_SYNC;
}

int is__utf8( int c ) {
	return osrf_utf8_mask_[ c & 0xFF ] & UTF8_VALID;
}

typedef enum
{
	S_BEGIN,   // Expecting nothing in particular
	S_2_OF_2,  // Expecting second of 2-byte character
	S_2_OF_3,  // Expecting second of 3-byte-character
	S_3_OF_3,  // Expecting third of 3-byte-character
	S_2_OF_4,  // Expecting second of 4-byte character
	S_3_OF_4,  // Expecting third of 4-byte-character
	S_4_OF_4,  // Expecting fourth of 4-byte-character
	S_ERROR,   // Looking for a valid byte to resync with
	S_END      // Found a terminal nul
} utf8_state;

int buffer_append_utf8( growing_buffer* buf, const char* string ) {
	utf8_state state = S_BEGIN;
	unsigned long utf8_char;
	const unsigned char* s = (unsigned char *) string;
	int i = 0;
	int rc = 0;

	do
	{
		switch( state )
		{
			case S_BEGIN :

				while( s[i] && (s[i] < 0x80) ) {    // Handle ASCII
					if( is_utf8_print( s[i] ) ) {   // Printable
						switch( s[i] )
						{
							case '"' :
							case '\\' :
								OSRF_BUFFER_ADD_CHAR( buf, '\\' );
							default :
								OSRF_BUFFER_ADD_CHAR( buf, s[i] );
								break;
						}
					} else if( s[i] ) {   // Control character

						switch( s[i] )    // Escape some
						{
							case '\n' :
								OSRF_BUFFER_ADD_CHAR( buf, '\\' );
								OSRF_BUFFER_ADD_CHAR( buf, 'n' );
								break;
							case '\t' :
								OSRF_BUFFER_ADD_CHAR( buf, '\\' );
								OSRF_BUFFER_ADD_CHAR( buf, 't' );
								break;
							case '\r' :
								OSRF_BUFFER_ADD_CHAR( buf, '\\' );
								OSRF_BUFFER_ADD_CHAR( buf, 'r' );
								break;
							case '\f' :
								OSRF_BUFFER_ADD_CHAR( buf, '\\' );
								OSRF_BUFFER_ADD_CHAR( buf, 'f' );
								break;
							case '\b' :
								OSRF_BUFFER_ADD_CHAR( buf, '\\' );
								OSRF_BUFFER_ADD_CHAR( buf, 'b' );
								break;
							default : {   // Format the rest in hex
								static const char hex_chars[] = "0123456789abcdef";
								static char hex_code[7] = "\\u00";

								hex_code[ 4 ] = hex_chars[ s[i] >> 4 ];    // high nybble
								hex_code[ 5 ] = hex_chars[ s[i] & 0x0F ];  // low nybble
								hex_code[ 6 ] = '\0';
								OSRF_BUFFER_ADD( buf, hex_code );
								break;
							}
						}
					}
					++i;
				}

				// If the next byte is the first of a multibyte sequence, we zero out
				// the length bits and store the rest.
				
				if( '\0' == s[i] )
					state = S_END;
				else if( 128 > s[i] )
					state = S_BEGIN;
				else if( is_utf8_2_byte( s[i] ) ) {
					utf8_char = s[i] ^ 0xC0;
					state = S_2_OF_2;   // Expect 1 continuation byte
				} else if( is_utf8_3_byte( s[i] ) ) {
					utf8_char = s[i] ^ 0xE0;
					state = S_2_OF_3;   // Expect 2 continuation bytes
				} else if( is_utf8_4_byte( s[i] ) ) {
					utf8_char = s[i] ^ 0xF0;
					state = S_2_OF_4;   // Expect 3 continuation bytes
				} else {
					if( 0 == rc )
						rc = i;
					state = S_ERROR;
				}
				
				++i;
				break;
			case S_2_OF_2 :  //Expect second byte of 1-byte character
				if( is_utf8_continue( s[i] ) ) {  // Append lower 6 bits
					utf8_char = (utf8_char << 6) | (s[i] & 0x3F);
					buffer_fadd(buf, "\\u%04x", utf8_char);  // Finish UTF-8 character
					state = S_BEGIN;
					++i;
				} else if( '\0' == s[i] ) {  // Unexpected end of string
					if( 0 == rc )
						rc = i;
					state = S_END;
				} else {   // Non-continuation character
					if( 0 == rc )
						rc = i;
					state = S_BEGIN;
				}
				break;
			case S_2_OF_3 :
				if( is_utf8_continue( s[i] ) ) {  // Append lower 6 bits
					utf8_char = (utf8_char << 6) | (s[i] & 0x3F);
					state = S_3_OF_3;
					++i;
				} else if( '\0' == s[i] ) {  // Unexpected end of string
					if( 0 == rc )
						rc = i;
					state = S_END;
				} else {   // Non-continuation character
					if( 0 == rc )
						rc = i;
					state = S_BEGIN;
				}
				break;
			case S_3_OF_3 :
				if( is_utf8_continue( s[i] ) ) {  // Append lower 6 bits
					utf8_char = (utf8_char << 6) | (s[i] & 0x3F);
					buffer_fadd(buf, "\\u%04x", utf8_char);  // Finish UTF-8 character
					state = S_BEGIN;
					++i;
				} else if( '\0' == s[i] ) {  // Unexpected end of string
					if( 0 == rc )
						rc = i;
					state = S_END;
				} else {   // Non-continuation character
					if( 0 == rc )
						rc = i;
					state = S_BEGIN;
				}
				break;
			case S_2_OF_4 :
				if( is_utf8_continue( s[i] ) ) {  // Append lower 6 bits
					utf8_char = (utf8_char << 6) | (s[i] & 0x3F);
					state = S_3_OF_4;
					++i;
				} else if( '\0' == s[i] ) {  // Unexpected end of string
					if( 0 == rc )
						rc = i;
					state = S_END;
				} else {   // Non-continuation character
					if( 0 == rc )
						rc = i;
					state = S_BEGIN;
				}
				break;
			case S_3_OF_4 :
				if( is_utf8_continue( s[i] ) ) {  // Append lower 6 bits
					utf8_char = (utf8_char << 6) | (s[i] & 0x3F);
					state = S_4_OF_4;
					++i;
				} else if( '\0' == s[i] ) {  // Unexpected end of string
					if( 0 == rc )
						rc = i;
					state = S_END;
				} else {   // Non-continuation character
					if( 0 == rc )
						rc = i;
					state = S_BEGIN;
				}
				break;
			case S_4_OF_4 :
				if( is_utf8_continue( s[i] ) ) {  // Append lower 6 bits
					utf8_char = (utf8_char << 6) | (s[i] & 0x3F);
					buffer_fadd(buf, "\\u%04x", utf8_char);  // Finish UTF-8 character
					state = S_BEGIN;
					++i;
				} else if( '\0' == s[i] ) {  // Unexpected end of string
					if( 0 == rc )
						rc = i;
					state = S_END;
				} else {   // Non-continuation character
					if( 0 == rc )
						rc = i;
					state = S_BEGIN;
				}
				break;
			case S_ERROR :
				if( '\0' == s[i] )
					state = S_END;
				else if( is_utf8_sync( s[i] ) )
					state = S_BEGIN;  // Resume translation
				else
					++i;

				break;
			default :
				state = S_END;
				break;
		}
	} while ( state != S_END );
	
	return rc;
}

/*----------------------------------------------------
 Desc    : functions and macros for processing UTF-8
 Author  : Scott McKellar
 Notes   : 

 Copyright 2008 Scott McKellar
 All Rights reserved
 
 Date       Change
 ---------- -----------------------------------------
 2008/11/20 Initial creation
 ---------------------------------------------------*/

#ifndef OSRF_UTF8_H
#define OSRF_UTF8_H

extern unsigned char osrf_utf8_mask_[];  // Lookup table of bitmasks

// Meanings of the various bit switches:

#define UTF8_CONTROL  0x01
#define UTF8_PRINT    0x02
#define UTF8_CONTINUE 0x04
#define UTF8_2_BYTE   0x08
#define UTF8_3_BYTE   0x10
#define UTF8_4_BYTE   0x20
#define UTF8_SYNC     0x40
#define UTF8_VALID    0x80

// macros:

#define is_utf8_control( x )  ( osrf_utf8_mask_[ (x) & 0xFF ] & UTF8_CONTROL )
#define is_utf8_print( x )    ( osrf_utf8_mask_[ (x) & 0xFF ] & UTF8_PRINT )
#define is_utf8_continue( x ) ( osrf_utf8_mask_[ (x) & 0xFF ] & UTF8_CONTINUE )
#define is_utf8_2_byte( x )   ( osrf_utf8_mask_[ (x) & 0xFF ] & UTF8_2_BYTE )
#define is_utf8_3_byte( x )   ( osrf_utf8_mask_[ (x) & 0xFF ] & UTF8_3_BYTE )
#define is_utf8_4_byte( x )   ( osrf_utf8_mask_[ (x) & 0xFF ] & UTF8_4_BYTE )
#define is_utf8_sync( x )     ( osrf_utf8_mask_[ (x) & 0xFF ] & UTF8_SYNC )
#define is_utf8( x )          ( osrf_utf8_mask_[ (x) & 0xFF ] & UTF8_VALID )

// Equivalent functions, for when you need a function pointer

int is__utf8__control( int c );
int is__utf8__print( int c );
int is__utf8__continue( int c );
int is__utf8__2_byte( int c );
int is__utf8__3_byte( int c );
int is__utf8__4_byte( int c );
int is__utf8__sync( int c );
int is__utf8( int c );

// Translate a string, escaping as needed, and append the
// result to a growing_buffer

int buffer_append_utf8( growing_buffer* buf, const char* string );

#endif

[OPEN-ILS-DEV] Another new function for uescaping UTF-8 strings

Reply via email to