Re: [OSM-dev] perl and special utf-8 characters

Robert Joop Sat, 19 Mar 2011 01:00:38 -0700

On 11-03-11 15:11:50 CET, Gary68 wrote:
> hi,
> 
> i want to find out if certain characters (german umlaute) are contained
> in a string that i work char by char.
> 
>       my $text = "abc äöü" ;
>       my $out = "" ;
>       @chars = split //, $text ;
> 
>       foreach my $c (@chars) {
>               # here a condition is needed ! 
>               if ( $c eq <umlaut> ) {
>                       $out .= $c ;
>               }
>       }


You have to tell perl the encoding of your script.
(That's because you use non-ASCII strings literals in your script.)
If your script is encoded in UTF-8, write "use utf8;".

> unfortunately the umlaute are represented as two bytes - or whatever is
> the correct term here.

an indication that you've got UTF-8.

> is there someone who could spend 3 lines of code. probably some encode
> and decode is needed...

You need to decode when you need to turn bytes into characters, e.g.
when you read bytes from a GGI parameter or from a file.
You need to encode when you need to turn characters into bytes, as one
can never really know what perl's current internal representation is.

:r /tmp/g
#!/usr/bin/perl

use utf8;
use Encode;

my $text = "abc äöü" ;
my $out = "" ;
@chars = split //, $text ;

foreach my $c (@chars) {
        # here a condition is needed ! 
        if ($c eq 'ä') {
                $out .= $c ;
        }
}
print "out='$out'\n";
print encode ('UTF-8', "out='$out'\n");
__END__

:r !perl /tmp/g
out='?'
out='ä'

The first line is from the internal representation of the characters
which perl happened to use latin1 for, the second is the UTF-8 bytes for
the external representation.

rj

_______________________________________________
dev mailing list
[email protected]
http://lists.openstreetmap.org/listinfo/dev

Re: [OSM-dev] perl and special utf-8 characters

Reply via email to