On Wed, Oct 21, 2009 at 19:35, Hildegard Schedthelm <hilde.sch...@yahoo.de> wrote: > Hello folks > > I've some troubles with a perlscript that you can see below. > The problem is that some german special characters (umlaut) are not > displayed as > they should be. This seems to be an encoding-issue. Either the internal > perl variables have the wrong encoding or the lwp-module when grapping the > html? > Additional the output to write the data into the MS-Access DB can also have > the > incorrect encoding. How can we fix the uncertainty? What can I do to ensure > the > right encoding in all levels?
A good start would be to use the $response->decoded_content method instead of $response->content. That gives you characters to work with instead of bytes. --Gisle > Here comes the script: > > #!C:\Program Files\Perl\bin\perl.exe -w > > use strict; > use LWP::UserAgent; > use Win32::ODBC; > > > my $db = new Win32::ODBC('PerlRes') ; > > my($inhalt, $detail, @compInfo, $datum, $headline, $company, $message, > $content, $ua, $request, > $response, $ua2, $request2, $response2); > > for(my $i = 1; $i < 2; $i++) { > > $ua = LWP::UserAgent->new(); > $request = HTTP::Request->new('GET', > "http://www.dgap.de/dgap/static/News/?newsType=ADHOC&page=" . $i . > "&limit=20"); > $request->header('Content-Type' => 'text/html; charset=iso-8859-1'); > $response = $ua->request($request); > $inhalt = $response->content; > > while($inhalt =~ /alt="DGAP-Ad-hoc" \/>\s+?<\/td>\s+?<td > class="content_text">\s+?<a href="(.+)">\s+?<strong>/g) { > > $ua2 = LWP::UserAgent->new(); > $request2 = HTTP::Request->new('GET', $1); > $request2->header('Content-Type' => 'text/html; > charset=iso-8859-1'); > $response2 = $ua->request($request2); > $detail = $response2->content; > > if($detail =~ /news_content ">\s+?<h2 > class="darkblue">\s+?(.+)\s+?<\/h2>/) { > $datum = $1; > $datum =~ s/\s*//; > $datum =~ s/\s+?$//g; > } > > if($detail =~ /<h2 > class="darkblue">\s+?.+?\s+?<\/h2>\s+?<div>\s+?<h1>(.+)<\/h1>/) { > $headline = $1; > $headline =~ s/;/|/g; > $headline =~ s/\n//g; > } > > if($detail =~ /<div > class="newsDetail_body_pre"><pre>\s+?<b>(.+)<\/b>/) { > @compInfo = split("/",$1); > $company = $compInfo[0]; > $company =~ s/\n//g; > $message = $compInfo[1]; > $message =~ s/\s//g; > > } > > if($detail =~ /<pre>(.+)<\/pre>/s) { > $content = $1; > $content =~ s/;/|/g; > $content =~ s/<\/?.+?>//g; > } > > > $db->Sql("INSERT into results VALUES('" . $datum . "','" . > $headline . "','" . $company . "','" . $message . "','" . $content . "')"); > > } > > $db->Close(); > > } > > >