On Mon, 25 Jan 2010, Rainer Gerhards wrote:

> David,
>
> we need to make a distinction between UTF, a transformation (and transfer)
> format and UCS, the actual native encoding format here. I think you mix these
> two things up. Unicode has two (primary) flavors, which are usually encoded
> in UCS-16 and UCS-32 (or ws it named UCS-2 and UCS-4 - guess so), being 2 and
> 4 bytes respectively. UCS-16 is what is implemented for example in Windows.
> It covers many of this worlds scripts, but has proven to not cover all, which
> caused additional code tables and UCS-32 presentation (at least as far as I
> know, I am not an Unicode expert ;)).
>
> UTF-8 is an encoding of Unicode code tables. You can think of it as
> traditional multi-byte character set which means each character takes up a
> varying number of bytes. Usually, UTF representations are converted into UCS
> and then UCS is used to do the processing. While UCS requires more bytes, UTF
> requires parsing of the message *each time* it is processed (e.g. to check
> for a string match, count character sizes, obtain a substring). So using UTF
> may use up fewer bytes, but can very considerably increase processing time
> need and program complexity. For US-ASCII, of course, this is no problem. But
> for other encodings, the performance hit can be very sever, much more than
> the hit by double memory consumption (UCS-2 is still being considered as
> "sufficient" for almost all cases, even in the future).

thanks for the clarification on terms. I had the basic understanding, but 
not the exact terminology.

> So I don't think it would serve the non-US-ASCII world well to process the
> transformation formats. I guess that's a good option if you have a US-ASCII
> based system that only very occasionally needs to process a foreign language
> string (and even then, you need to parse the message *each* time you access
> it, specifically when obtaining substrings...).
>
> My conclusion is that rsyslog needs to do a UTF to UCS conversion on entry to
> the system and then uses UCS internally (and converts back when messages are
> output). Many software systems do so, and, as I said, IMHO do so for good
> reasons.

the question is how many different places/times are we parsing the data as 
strings, vs how many places are we just moving the data around as 
essentially opaque blobs.

when we receive and parse the message we have to deal with the data as 
strings of characters, but this is generally done in one pass through the 
input data, so it would be about the same to process the data as-is as to 
convert it to UCS-2 (let alone then processing it as UCS-2). This pass can 
calculate the number of characters in the string (i.e. 'length') and store 
it

then these parsed chunks of data get copied around (in complex 
configurations with many queues, they get copied around a LOT).

At some point (or points) comparisons are made, but in most cases these 
comparisons can be done byte-by-byte, you don't actually have to parse the 
data (for regex matches you do, and for contains you would have to check 
the byte prior to the start of the match to make sure that that first 
matching byte isn't the tail end of a prior character, but I think that's 
it)

and then eventually we create the output string. At that point we are 
assembling the string from the various substrings that we have stored 
(which still can be treated as a series of bytes). It's only when the 
property replacer is invoked with either character positions or options 
that the data needs to be treated as a UTF-8 string instead of a series of 
bytes again. Yes there are a lot of things that it can do, but how much 
are they used in real life (other than setting a max length, which could 
be special cased to not be checked if the number of bytes is less than 
the length you are checking against)?

Remember that this is not general-purpose input and output that we are 
dealing with, it's logs. And like it or not, most logs really are in 
ASCII, simply because for so many years there was no option.

Also consider that the input and output stages can be split into multiple 
worker threads, while the queue manipulation (and copying) is done inside 
locks.

It may be best to leave the data as UTF-8 unless the property replacer has 
been given options, and then let the property replacer convert the data, 
work on it, and convert it back (if there is more than one option being 
invoked)

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to