Re: MultiByte Character Sets and False Matches

Tim Bunce Mon, 28 Jun 2004 02:10:25 -0700

On Mon, Jun 28, 2004 at 04:32:31PM +0900, ?$B%V%i%$%"%s wrote:
> Is there any support in the DBD package (or any workarounds) for
> handling searches in MultiByte Character Set data?
> 
> The below problem has occurred with DBD::CSV and DBD:Interbase
> I have not had the resources to test other packages (ie DBD::Oracle etc)
> 
> Using DBD I am using a SELECT statement to find a match for Japanese
> EUC strings in a Japanese record fields.
> 
> For the main part it works ok but there are also false matches.
> 
> In Perl (5.6) itself regular expressions are based on byte per byte
> matching rather than character matching


Unless you're using unicode. (Though unicode handling in 5.6 is buggy.
Best to use a recent 5.8.x)

> The same also seems to be so with the above mentioned DBD Packages
> 
> A little explanation :
> Japanese has three character types
> (1)Hiragana (around 80 characters)
> (2)Katakana (around 80 characters, used for foreign based words)
> (3)Kanji (thousands of characters, like pictograms)
> 
> Say, in a EUC character based table "SAMPLE_TABLE" a record field
> (for example field "TEXT_FIELD_1") has a string which contains two
> sequential 2-Byte Katakana characters  
> (Katakana Character 1 = \xA5\xB9, Katakana Character 2 = \xA5\xC8)
> 
> If I use a SELECT statement to find matches for the the 2-byte Kanji
> character "\xB9\xA5" it will match the above record
> ie
> 
> $search_str = "\xB9\xA5";
> 
> $sql_str "SELECT REC_ID from SAMPLE_TABLE WHERE TEXT_FIELD_1 %LIKE%
> $search_str"
>
> It will find a Kanji match in the  middle 2 bytes of the above Katakana
> character string ie \xA5(\xB9\xA5)\xC8

You need to be able to tell the database what the character set of the
quoted string is. Different databases have different ways of doing that
(though some can't, including DBD::CSV).

> The same happens in problem occurs in Perl regular expressions when
> using EUC strings.
> 
> The problems with false matching with MultiByte Character Sets
> are explained more (properly much more clearly than my explanation)
> )in english at :
> http://iis1.cps.unizar.es/Oreilly/perl/cookbook/ch06_19.htm

I suspect you'll find a world of pain and problems unless you can use unicode.
(Even then it may not be smooth sailing, but you'd find more people wiling to help.)

Tim.

Re: MultiByte Character Sets and False Matches

Reply via email to