Is there any support in the DBD package (or any workarounds) for
handling searches in MultiByte Character Set data?

The below problem has occurred with DBD::CSV and DBD:Interbase
I have not had the resources to test other packages (ie DBD::Oracle etc)

Using DBD I am using a SELECT statement to find a match for Japanese
EUC strings in a Japanese record fields.

For the main part it works ok but there are also false matches.

In Perl (5.6) itself regular expressions are based on byte per byte
matching rather than character matching
The same also seems to be so with the above mentioned DBD Packages

A little explanation :
Japanese has three character types
(1)Hiragana (around 80 characters)
(2)Katakana (around 80 characters, used for foreign based words)
(3)Kanji (thousands of characters, like pictograms)

Say, in a EUC character based table "SAMPLE_TABLE" a record field
(for example field "TEXT_FIELD_1") has a string which contains two
sequential 2-Byte Katakana characters  
(Katakana Character 1 = \xA5\xB9, Katakana Character 2 = \xA5\xC8)

If I use a SELECT statement to find matches for the the 2-byte Kanji
character "\xB9\xA5" it will match the above record
ie

$search_str = "\xB9\xA5";

$sql_str "SELECT REC_ID from SAMPLE_TABLE WHERE TEXT_FIELD_1 %LIKE%
$search_str"

It will find a Kanji match in the  middle 2 bytes of the above Katakana
character string ie \xA5(\xB9\xA5)\xC8

The same happens in problem occurs in Perl regular expressions when
using EUC strings.

The problems with false matching with MultiByte Character Sets
are explained more (properly much more clearly than my explanation)
)in english at :
http://iis1.cps.unizar.es/Oreilly/perl/cookbook/ch06_19.htm

Below is Perl Code for handling EUC character set regular expressions

$ascii = '[\x00-\x7F]';
$twoBytes = '[\x8E\xA1-\xFE][\xA1-\xFE]';
$threeBytes = '\x8F[\xA1-\xFE][\xA1-\xFE]';

if ($str =~ /^(?:$ascii|$twoBytes|$threeBytes)*?(?:$pattern)/) {
  print "Found\n";
}


--------------------------------------------
Brian Sweeney
mail:brian<AT>ssl<DOT>fujitsu<DOT>com

Reply via email to