Re: UTF-8 to ASCII

Chas. Owens Sat, 11 Apr 2009 12:46:55 -0700

On Sat, Apr 11, 2009 at 14:11, Kelly Jones <kelly.terry.jo...@gmail.com> wrote:
> I'm trying to convert UTF-8 to ASCII in Perl. Is there an easy way to
> do this?
>
> I tried Unicode::UTF8simple, but ended up w/ many ctrl-a's, which
> can't be right.
>
> I'm going for an extremely complete transliteration, so I want ETH
> (for example) to be converted to both "d" and "dh". In other words, my
> input is ONE string, but my return value is a LIST of strings.
>
> My goal: create an ASCII version of geonames' alternateNames table.


Hmm, I don't know of any functions off the top of my head that do that
sort of thing.  You might try searching CPAN[1].  If you don't find
anything you like, I would start by building a table like

my %utf8_to_ascii = (
    "\N{LATIN SMALL LETTER ETH}" => [ qw/ d dh / ],
);

Note, to use "\N{LATIN SMALL LETTER ETH}" instead of "\x{F0}" you will
need to use the charnames[2] pragma.  You could then break the string
into individual characters and create a list of all possible outcomes:

#!/usr/bin/perl

use strict;
use warnings;

my %map = (
        a => [ qw/ aa ab / ],
        e => ['y'],
);

for my $word (qw/ bad bed base /) {
        print "$word =>\n",
                map "\t$_\n", expand(romanize($word, \%map));
}


#produce a compact representation of the possible strings
sub romanize {
        my ($word, $map) = @_;
        my @string;

        for my $char (split //, $word) {
                my @chars = $map->{$char} ? @{$map->{$char}} : ($char);
                push @string, \...@chars;
        }
        return @string;
}

#expand the compact representation into all possible strings
sub expand {
        my @string = @_;
        my @result;

        return @{$string[0]} if @string == 1;

        for my $char (@{$string[0]}) {
                for my $string (expand(@string[1 .. $#string])) {
                        push @result, join '', "$char$string";
                }
        }
        
        return @result;
}


1. http://search.cpan.org/
2. http://perldoc.perl.org/charnames.html

-- 
Chas. Owens
wonkden.net
The most important skill a programmer can have is the ability to read.

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: UTF-8 to ASCII

Reply via email to