On Wed, Mar 28, 2007 at 04:03:23PM +0200, Egmont Koblinger wrote:
> On Tue, Mar 27, 2007 at 01:51:59PM -0400, SrinTuar wrote:
>
> > I'm not quite sure how "thinking in characters" helps an application,
> > in general. I'd be interested if you had a concrete example...
>
> dealing with. For example it's impossible to implement a regexp matching
> routine if you have no idea what encoding is being used.
>
> > It's probably advisable to use a library regex engine than to re-write
> > custom regex engines all the time.
>
> Sure.
I think SrinTuar has made it clear that he agrees that a
regular expression engine needs to be able to interpret characters.
His point is that the calling code does not have to know anything
about characters, only strings.
> > Once you have a regex library that handles codepoints, the code that uses
> > it doesnt have to care about them in particular.
>
> It's not so simple. Suppose you have a byte sequence (decimal) 65 195 129
> 66. (This is the beginning of the Hungarian alphabet AÁB... encoded in
> UTF-8). Suppose you want to test whether it matches to the regexp 65 46 66
> ("A.B"). Does it match? It depends. If the byte sequence really denotes AÁB
> (i.e. it is encoded in UTF-8) then it does. If it has different semantics (a
> different character sequence encoded in some other 8-bit encoding) then it
> doesn't. How do you think perl is supposed to overcome this problem if it
> didn't have Unicode support?
>
> You have to make sure that the string to test and the regexp itself are
> encoded in the same charset, and in turn this also matches the charset the
> regexp library routine expects.
When interpreting bytes as characters, you do so according to the
system's character encoding, as exposed by the C multibyte character
handling functions. On systems which allow the user to choose an
encoding, the user then selects it via the LC_CTYPE category. On my
system, it's always UTF-8 and not a runtime option.
If you want to process foreign encodings (not the system/locale native
encoding) then you should convert them to your native encoding first
(via iconv or a similar library). If your native encoding is not able
to represent all the characters in the foreign encoding then you're
out of luck and you should give up your legacy codepage and switch to
UTF-8 if you want multilingual support.
> Otherwise things will go plain wrong sooner
> or later. In some languages regexp matching is done via functions, and
> probably you may have an 8-bit match() and a Unicode-aware mbmatch() as
> well.
I don't know which languages do this, but it's wrong. mbmatch() would
cover both cases just fine (i.e. it would work even if the native
encoding is 8bit). If you want a BYTE-based regex engine, that's
another matter, and AFAIK few languages provide such a thing
explicitly. (If they do, it's by misappropriating an
8bit-codepage-targetted matcher.) But treating bytes and 8bit codepage
encodings as the same thing is wrong. Bytes represent numbers in the
range 0-255. 8bit codepages represent 256-character subsets of
Unicode. These are not the same.
> > The problem soon as you use a library routine that is utf-8 aware, it sets
> > the utf-8 flag on a string and problems start to result. If there was no
> > utf-8
> > flag on the scalar strings to be set, then you could stay in byte world all
> > the
> > time, while still using unicode functionality where you needed it.
>
> As I've already said, there's absolutely nothing preventing you from _not_
> using the Unicode features of Perl at all. But then I'm just curious how you
> would match accented characters to regexps for example.
Regex would always match characters...
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/