Re: Converting string to UTF-16LE

Sebastian Lehmann Mon, 01 Mar 2004 06:29:02 -0800

Hello Nick,

thanks a lot for your answer. When I ran your script (with the '�' in
$sLine), the scripts works great. Motivated by this "victory" I modified my
search script. The results were very strange.


Using the lc method instead of uc works. Using the uc method only works if I
placed the uc call directly before the compare command. If the uc call is
located before or shortly after the first open command, it returns the wrong
result. I have no idea, what's the reason for this.

Now I use lc and hope that this method will produce no bad results...

Confused but happy,

Sebastian.


"Nick Ing-Simmons" <[EMAIL PROTECTED]> schrieb im Newsbeitrag
news:[EMAIL PROTECTED]
Sebastian Lehmann <[EMAIL PROTECTED]> writes:
>Hello,
>
>i use a perl script to search different files. The search values are given
>from a HTML page, the results are displayed on this page, too. The files
are
>saved in the UTF16LE format, therefore i will open them with the following
>open command:
>
>    open(F, "<:raw:encoding(UTF-16LE)", $file) || die "Cannot read $file:
>$!\n";
>
>This works fine and the data is readed correctly after opening.
>
>The search value is specified in the HTML page, the URL with the value will
>look like the following:
>
>    http://10.0.5.62/search.pl?value=73,98,97,241,101,122
>
>The numbers are the charcodes of the search value and will be formed back
to
>a string var in the perl script:
>
>    sub decodeString {
>        my $sInput = shift;
>        my $sOutput = "";
>        my @arrChars = split(/,/, $sInput);
>        foreach ( @arrChars )
>        {
>            $iCharCode = ($_)*1;
>            $sOutput .= chr($iCharCode);
>        }
>        return $sOutput;
>    }
>
>For this example the search value will be "Iba�ez". Because of the search
>isn't case-sensitive, all letters should be uppercased, using the uc
method.
>But uc will return different strings for the search value and for the line
>read from the UTF16-LE file:
>
>    $sValue = uc($sValue);        # $sValue is IBA�EZ after uc
>    $sLine = uc($sLine);            # $sLine is IBA�EZ after uc
>
>So the search will not find the search value find although it should do so!
So (as mail tends to mangle this stuff too) the issue is that

uc(chr(241)) ne  '�' ?  (Upper case N with ~)?

This would seem to be a problem with the uc function.
Which perl version are you using?
Which locale are you in?

>
>I played a lot with the decode and encode method, but with no success.

You should not really need that with perl5.8 - get the UTF16-LE into
perl's internal form then just work on characters.

Does patern match style work? ($sLine =~ /$sValue/i)

>Either the return string isn't valid or the uc method's result is the same.
>
>Can anybody tell me how to work with UTF8 and UTF16 in the same script?

The way this is meant to work is everything gets converted into perl's
internal form (which happens to be UTF-8 in perl5 but that is none of
user's business) then work in characters.

So what you have _should_ work - but doesn't.
(Attached is above converted to a script which fails.)

In the old latin1 world it was better to lc both things - and
that does seem to work here too.

lib/unicore/CaseFolding.txt has

00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE

UnicodeData.txt has

00D1;LATIN CAPITAL LETTER N WITH TILDE;Lu;0;L;004E 0303;;;;N;LATIN CAPITAL
LETTER N TILDE;;;00F1;
00F1;LATIN SMALL LETTER N WITH TILDE;Ll;0;L;006E 0303;;;;N;LATIN SMALL
LETTER N TILDE;;00D1;;00D1

But what does that mean?



>Any
>help would be greatly appreciated.
>
>Thanks in advance,
>
>Sebastian




----------------------------------------------------------------------------
----





  sub decodeString {
        my $sInput = shift;
        my $sOutput = "";
        my @arrChars = split(/,/, $sInput);
        foreach ( @arrChars )
        {
            $iCharCode = ($_)*1;
            $sOutput .= chr($iCharCode);
        }
        return $sOutput;
    }

my $sLine = "IBAEZ";
$sLine .= chr(0x100);
chop($sLine);


my $sValue = decodeString("73,98,97,241,101,122");

binmode(STDOUT,":utf8");

my $match = ($sLine =~ /$sValue/i) ? 'Yes' : 'No';

print "$sLine/$sValue $match\n";

$sLine = uc($sLine);
$sValue = uc($sValue);

$match = ($sLine =~ /$sValue/) ? 'Yes' : 'No';


print "$sLine/$sValue $match\n";



$sLine = lc($sLine);
$sValue = lc($sValue);

$match = ($sLine =~ /$sValue/) ? 'Yes' : 'No';


print "$sLine/$sValue $match\n";

Re: Converting string to UTF-16LE

Reply via email to