Re: perl unicode support

ＳｒｉｎＴｕａｒ Tue, 27 Mar 2007 19:54:42 -0800

007/3/27, Daniel B. <[EMAIL PROTECTED]>:

What about when it breaks a string into substrings at some delimiter,
say, using a regular expression?  It has to break the underlying byte
string at a character boundary.



Ｕｎｌｅｓｓ　ｙｏｕ　ｐａｓｓ　ｉｎｖａｌｉｄ　ｕｔｆ－８　ｓｅｑｕｅｎｃｅｓ　ｔｏ　ｙｏｕｒ　ｒｅｇｕｌａｒ　ｅｘｐｒｅｓｓｉｏｎ　ｌｉｂｒａｒｙ，　ｔｈａｔ
ｓｈｏｕｌｄ　ｂｅ　ｉｍｐｏｓｓｉｂｌｅ．　ｂｒｅａｋｉｎｇ　ｓｔｒｉｎｇｓ　ｗｏｒｋｓ　ｇｒｅａｔ　ａｓ　ｌｏｎｇ　ａｓ　ｙｏｕ
ｐａｔｔｅｒｎ　ｍａｔｃｈ　ｆｏｒ　ｂｏｕｎｄａｒｉｅｓ．

Ｔｈｅ　ｏｎｌｙ　ｔｉｍｅ　ｉｔ　ｆａｉｌｓ　ｉｓ　ｉｆ　ｙｏｕ　ｂｒｅａｋ　ｉｔ　ａｔ　ａｒｂｉｔｒａｒｙ　ｂｙｔｅ
ｉｎｄｅｘｅｓ．ｎｏｔｅ　ｔｈａｔ　ｂｒｅａｋｉｎｇ　ｕｔｆ－３２　ｓｔｒｉｎｇｓ　ａｔ　ａｒｂｉｒｔｒａｒｙ　ｉｎｄｉｃｉｅｓ　ａｌｓｏ
ｄｅｓｔｒｏｙｓ　ｔｈｅ　ｔｅｘｔ．

In fact, what about interpreting an underlying string of bytes as
as the right individual characters in that regular expression?


Ｔｈｅ　ｒｅｇｕｌａｒ　ｅｘｐｒｅｓｓｉｏｎ　ｅｎｇｉｎｅ　ｓｈｏｕｌｄ　ｂｅ　ｕｔｆ－８　ａｗａｒｅ．　Ｔｈｅ　ｃｏｄｅ　ｔｈａｔ
ｕｓｅｓ　ａｎｄ　ｃａｌｌｓ　ｉｔ　ｈａｓ　ｎｏ　ｎｅｅｄ　ｔｏ．

Any time a program uses the underlying byte string as a character
string other than simply a whole string (e.g., breaking it apart,
interpreting it), it needs to consider it at the character level,
not the byte level.


Ｏｎｌｙ　ｔｈｅ　ｍｏｓｔ　ｆａｎｃｙ　ｉｎｔｅｐｒｅｔａｔｉｏｎｓ　ｒｅｑｕｉｒｅ　ａｎｙ　ｋｎｏｗｌｅｄｇｅ　ｏｆ　ｕｎｉｃｏｄｅ
ｃｏｄｅ　ｐｏｉｎｔｓ．Ａｎｙ　ｓｕｂｓｔｒｉｎｇ　ｍａｔｃｈ　ｏｎ　ｖａｌｉｄ　ｓｅｑｕｅｎｃｅｓ　ｗｉｌｌ　ｐｒｏｄｕｃｅ　ｖａｌｉｄ
ｂｏｕｎｄａｒｉｅｓ　ｉｎ　ｕｔｆ－８，ａｎｄ　ｔｈａｔｓ　ｔｈｅ　ｗｈｏｌｅ　ｐｏｉｎｔ．

Re: perl unicode support

Reply via email to