On 08/16/2012 10:18 AM, Pawel Krol wrote:
Hello!
I would like to ask you for a little assistance with the following issue...
There are websites, which contain special German characters in their
URLs, for example: http://www.pbb-planungsbüro-bartsch.de (known as
"umlauts").
I have been unsuccessfully trying to retrieve contents of such
websites using Perl (basically the purpose of it is to check, whether
the URL is valid/invalid - maybe there's a simpler way to do it?).
Here's a code snippet, which you can try out immediately:
#!/opt/local/bin/perl
use Data::Dumper;
use LWP::UserAgent;
my $url = q{http://www.pbb-planungsbüro-bartsch.de};
# my $url = q{http://www.pbb-planungsb%C3%BCro-bartsch.de/};
my $ua = LWP::UserAgent->new;
my $response = $ua->get($url);
warn Dumper $response;
__END__
The critical piece of data you need is "IRI" - Internatinoalized
Resource Identifier - HTTP the protocol requires that the HOST part of
the URI be a valid Domain Name System identifier, which is limited to
the characters A-Z, a-z, 0-9, and hyphen (See RFC 1035, Sec 2.3.1)
(Curiously - RFC 2181 states rather emphatically that any octet stream
must be acceptable as a resource label ... )
Anyway -- there is a well accepted technique for mapping strings like
güero to a name that fits into the "LDH" (letters, digits, hyphen) rule
- and it's called "internationalized domain name"
Read all about it in
http://search.cpan.org/~cfaerber/Net-IDN-Encode-2.003/lib/Net/IDN/Standards.pod