On 08/16/2012 10:18 AM, Pawel Krol wrote:
Hello!

I would like to ask you for a little assistance with the following issue...

There are websites, which contain special German characters in their
URLs, for example: http://www.pbb-planungsbüro-bartsch.de (known as
"umlauts").

I have been unsuccessfully trying to retrieve contents of such
websites using Perl (basically the purpose of it is to check, whether
the URL is valid/invalid - maybe there's a simpler way to do it?).

Here's a code snippet, which you can try out immediately:

#!/opt/local/bin/perl

use Data::Dumper;
use LWP::UserAgent;

my $url = q{http://www.pbb-planungsbüro-bartsch.de};
# my $url = q{http://www.pbb-planungsb%C3%BCro-bartsch.de/};

my $ua = LWP::UserAgent->new;
my $response = $ua->get($url);

warn Dumper $response;

__END__

The critical piece of data you need is "IRI" - Internatinoalized Resource Identifier - HTTP the protocol requires that the HOST part of the URI be a valid Domain Name System identifier, which is limited to the characters A-Z, a-z, 0-9, and hyphen (See RFC 1035, Sec 2.3.1) (Curiously - RFC 2181 states rather emphatically that any octet stream must be acceptable as a resource label ... )

Anyway -- there is a well accepted technique for mapping strings like güero to a name that fits into the "LDH" (letters, digits, hyphen) rule - and it's called "internationalized domain name"

Read all about it in http://search.cpan.org/~cfaerber/Net-IDN-Encode-2.003/lib/Net/IDN/Standards.pod

Reply via email to