Re: Requesting URLs containing German umlauts

Lawrence Statton Thu, 16 Aug 2012 08:34:15 -0700

On 08/16/2012 10:18 AM, Pawel Krol wrote:

Hello!


I would like to ask you for a little assistance with the following issue...

There are websites, which contain special German characters in their
URLs, for example: http://www.pbb-planungsbüro-bartsch.de (known as
"umlauts").

I have been unsuccessfully trying to retrieve contents of such
websites using Perl (basically the purpose of it is to check, whether
the URL is valid/invalid - maybe there's a simpler way to do it?).

Here's a code snippet, which you can try out immediately:

#!/opt/local/bin/perl

use Data::Dumper;
use LWP::UserAgent;

my $url = q{http://www.pbb-planungsbüro-bartsch.de};
# my $url = q{http://www.pbb-planungsb%C3%BCro-bartsch.de/};

my $ua = LWP::UserAgent->new;
my $response = $ua->get($url);

warn Dumper $response;

__END__

The critical piece of data you need is "IRI" - InternatinoalizedResource Identifier - HTTP the protocol requires that the HOST part ofthe URI be a valid Domain Name System identifier, which is limited tothe characters A-Z, a-z, 0-9, and hyphen (See RFC 1035, Sec 2.3.1)(Curiously - RFC 2181 states rather emphatically that any octet streammust be acceptable as a resource label ... )

Anyway -- there is a well accepted technique for mapping strings likegüero to a name that fits into the "LDH" (letters, digits, hyphen) rule- and it's called "internationalized domain name"

Read all about it inhttp://search.cpan.org/~cfaerber/Net-IDN-Encode-2.003/lib/Net/IDN/Standards.pod

Re: Requesting URLs containing German umlauts

Reply via email to