Sebastian Lehmann <[EMAIL PROTECTED]> writes: >Hello, > >i use a perl script to search different files. The search values are given >from a HTML page, the results are displayed on this page, too. The files are >saved in the UTF16LE format, therefore i will open them with the following >open command: > > open(F, "<:raw:encoding(UTF-16LE)", $file) || die "Cannot read $file: >$!\n"; > >This works fine and the data is readed correctly after opening. > >The search value is specified in the HTML page, the URL with the value will >look like the following: > > http://10.0.5.62/search.pl?value=73,98,97,241,101,122 > >The numbers are the charcodes of the search value and will be formed back to >a string var in the perl script: > > sub decodeString { > my $sInput = shift; > my $sOutput = ""; > my @arrChars = split(/,/, $sInput); > foreach ( @arrChars ) > { > $iCharCode = ($_)*1; > $sOutput .= chr($iCharCode); > } > return $sOutput; > } > >For this example the search value will be "IbaÃez". Because of the search >isn't case-sensitive, all letters should be uppercased, using the uc method. >But uc will return different strings for the search value and for the line >read from the UTF16-LE file: > > $sValue = uc($sValue); # $sValue is IBAÃEZ after uc > $sLine = uc($sLine); # $sLine is IBAÃEZ after uc > >So the search will not find the search value find although it should do so! So (as mail tends to mangle this stuff too) the issue is that
uc(chr(241)) ne 'Ã' ? (Upper case N with ~)? This would seem to be a problem with the uc function. Which perl version are you using? Which locale are you in? > >I played a lot with the decode and encode method, but with no success. You should not really need that with perl5.8 - get the UTF16-LE into perl's internal form then just work on characters. Does patern match style work? ($sLine =~ /$sValue/i) >Either the return string isn't valid or the uc method's result is the same. > >Can anybody tell me how to work with UTF8 and UTF16 in the same script? The way this is meant to work is everything gets converted into perl's internal form (which happens to be UTF-8 in perl5 but that is none of user's business) then work in characters. So what you have _should_ work - but doesn't. (Attached is above converted to a script which fails.) In the old latin1 world it was better to lc both things - and that does seem to work here too. lib/unicore/CaseFolding.txt has 00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE UnicodeData.txt has 00D1;LATIN CAPITAL LETTER N WITH TILDE;Lu;0;L;004E 0303;;;;N;LATIN CAPITAL LETTER N TILDE;;;00F1; 00F1;LATIN SMALL LETTER N WITH TILDE;Ll;0;L;006E 0303;;;;N;LATIN SMALL LETTER N TILDE;;00D1;;00D1 But what does that mean? >Any >help would be greatly appreciated. > >Thanks in advance, > >Sebastian
sub decodeString { my $sInput = shift; my $sOutput = ""; my @arrChars = split(/,/, $sInput); foreach ( @arrChars ) { $iCharCode = ($_)*1; $sOutput .= chr($iCharCode); } return $sOutput; } my $sLine = "IBAŅEZ"; $sLine .= chr(0x100); chop($sLine); my $sValue = decodeString("73,98,97,241,101,122"); binmode(STDOUT,":utf8"); my $match = ($sLine =~ /$sValue/i) ? 'Yes' : 'No'; print "$sLine/$sValue $match\n"; $sLine = uc($sLine); $sValue = uc($sValue); $match = ($sLine =~ /$sValue/) ? 'Yes' : 'No'; print "$sLine/$sValue $match\n"; $sLine = lc($sLine); $sValue = lc($sValue); $match = ($sLine =~ /$sValue/) ? 'Yes' : 'No'; print "$sLine/$sValue $match\n";