> -----Original Message-----
> From: [email protected] [mailto:rsyslog-
> [email protected]] On Behalf Of [email protected]
> Sent: Friday, January 22, 2010 7:19 PM
> To: rsyslog-users
> Subject: Re: [rsyslog] Unicode & rsyslog - was: RE: PostgreSQL:
> Problems with character encoding
> 
> On Fri, 22 Jan 2010, Rainer Gerhards wrote:
> 
> > However, even then I need to have a build time switch to turn this
> on/off,
> > because rsyslog in Unicode mode will take not only considerably more
> space
> > (especially with larger in-memory queues), it will also considerably
> affect
> > its performance (in terms of bytes, the memory transfer rate is
> effectively
> > cut in half, as most data in syslog is character-based - also think
> about the
> > effects on cache performance).

David,

we need to make a distinction between UTF, a transformation (and transfer)
format and UCS, the actual native encoding format here. I think you mix these
two things up. Unicode has two (primary) flavors, which are usually encoded
in UCS-16 and UCS-32 (or ws it named UCS-2 and UCS-4 - guess so), being 2 and
4 bytes respectively. UCS-16 is what is implemented for example in Windows.
It covers many of this worlds scripts, but has proven to not cover all, which
caused additional code tables and UCS-32 presentation (at least as far as I
know, I am not an Unicode expert ;)).

UTF-8 is an encoding of Unicode code tables. You can think of it as
traditional multi-byte character set which means each character takes up a
varying number of bytes. Usually, UTF representations are converted into UCS
and then UCS is used to do the processing. While UCS requires more bytes, UTF
requires parsing of the message *each time* it is processed (e.g. to check
for a string match, count character sizes, obtain a substring). So using UTF
may use up fewer bytes, but can very considerably increase processing time
need and program complexity. For US-ASCII, of course, this is no problem. But
for other encodings, the performance hit can be very sever, much more than
the hit by double memory consumption (UCS-2 is still being considered as
"sufficient" for almost all cases, even in the future).

So I don't think it would serve the non-US-ASCII world well to process the
transformation formats. I guess that's a good option if you have a US-ASCII
based system that only very occasionally needs to process a foreign language
string (and even then, you need to parse the message *each* time you access
it, specifically when obtaining substrings...).

My conclusion is that rsyslog needs to do a UTF to UCS conversion on entry to
the system and then uses UCS internally (and converts back when messages are
output). Many software systems do so, and, as I said, IMHO do so for good
reasons.

Rainer

> 
> if the code uses UTF-8 throughout this doesn't make sense. assuming the
> input is plain ascii, UTF-8 strings and ASCII strings should be the
> same
> size (there is some additional cpu cycles involved to figure out the
> length in characters for any output routines that grab substrings, but
> that should be all)
> 
> the only way things would take double the space (and therefor halve the
> memory transfer rate) is if it converts everything to UTF-16 strings
> internally. This is a bad idea to start with as UTF-16 does not handle
> all
> characters (which is why there is UTF-32 as well), but also because
> UTF-16
> is significantly more expensive to store/copy/etc than UTF-8 for the
> common case where most of the characters are ASCII.
> 
> It may be that you have picked the wrong string library to use. prior
> to
> UTF-8 being defined 'unicode' and UTF-16 were basicly synonomous and a
> _lot_ of string libraries have been written with this assumption
> (converting everything to UTF-16 on input and to whatever on output).
> If
> you can find one that can handle the strings as UTF-8 internally it
> should
> be able to just about eliminate the overhead.
> 
> David Lang
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to