RE: MultiByte Character Sets and False Matches

Tim Johnson Mon, 28 Jun 2004 01:02:20 -0700

Try doing a search of the perldocs using
 
perldoc -q multibyte
 
and I think you will find what you need.


        -----Original Message----- 
        From: ããããã [mailto:[EMAIL PROTECTED] 
        Sent: Mon 6/28/2004 12:32 AM 
        To: [EMAIL PROTECTED] 
        Cc: 
        Subject: MultiByte Character Sets and False Matches
        
        

        Is there any support in the DBD package (or any workarounds) for
        handling searches in MultiByte Character Set data?
        
        The below problem has occurred with DBD::CSV and DBD:Interbase
        I have not had the resources to test other packages (ie DBD::Oracle etc)
        
        Using DBD I am using a SELECT statement to find a match for Japanese
        EUC strings in a Japanese record fields.
        
        For the main part it works ok but there are also false matches.
        
        In Perl (5.6) itself regular expressions are based on byte per byte
        matching rather than character matching
        The same also seems to be so with the above mentioned DBD Packages
        
        A little explanation :
        Japanese has three character types
        (1)Hiragana (around 80 characters)
        (2)Katakana (around 80 characters, used for foreign based words)
        (3)Kanji (thousands of characters, like pictograms)
        
        Say, in a EUC character based table "SAMPLE_TABLE" a record field
        (for example field "TEXT_FIELD_1") has a string which contains two
        sequential 2-Byte Katakana characters 
        (Katakana Character 1 = \xA5\xB9, Katakana Character 2 = \xA5\xC8)
        
        If I use a SELECT statement to find matches for the the 2-byte Kanji
        character "\xB9\xA5" it will match the above record
        ie
        
        $search_str = "\xB9\xA5";
        
        $sql_str "SELECT REC_ID from SAMPLE_TABLE WHERE TEXT_FIELD_1 %LIKE%
        $search_str"
        
        It will find a Kanji match in the  middle 2 bytes of the above Katakana
        character string ie \xA5(\xB9\xA5)\xC8
        
        The same happens in problem occurs in Perl regular expressions when
        using EUC strings.
        
        The problems with false matching with MultiByte Character Sets
        are explained more (properly much more clearly than my explanation)
        )in english at :
        http://iis1.cps.unizar.es/Oreilly/perl/cookbook/ch06_19.htm
        
        Below is Perl Code for handling EUC character set regular expressions
        
        $ascii = '[\x00-\x7F]';
        $twoBytes = '[\x8E\xA1-\xFE][\xA1-\xFE]';
        $threeBytes = '\x8F[\xA1-\xFE][\xA1-\xFE]';
        
        if ($str =~ /^(?:$ascii|$twoBytes|$threeBytes)*?(?:$pattern)/) {
          print "Found\n";
        }
        
        
        --------------------------------------------
        Brian Sweeney
        mail:brian<AT>ssl<DOT>fujitsu<DOT>com

RE: MultiByte Character Sets and False Matches

Reply via email to