I'm new to using libwww (and to this list) and only moderately experienced with Perl. I'm currently working on a set of scripts to bulk load some data into a MediaWiki wiki that I'm working with several other people to set up. For the curious, the wiki focus is the writings of Patrick O'Brian. We have a number of resources that people have developed over the years, including a glossary of all non-English words used in O'Brian's books. The glossary includes terms from a number of languages, some of which use accented characters, esp. French & Irish.

I've written a bot, using LWP, to upload articles extracted from the glossary and it works fine for those that don't contain accented characters. Unfortunately, articles containing accented characters have those characters corrupted when they are uploaded. I've been able to deal with these characters when they end up in the URL as the page title in the wiki (by converting them to '%xx', although URI:Escape doesn't seem to work for this) but I can't seem to figure out how to get the article content with these characters up to the wiki without corruption.

OS = Mac OS X 10.3, perl version = 5.8.1, LWP = latest from CPAN

Here's my function to POST the articles:

sub SubmitArticle
{
# get params
my ($refArticle) = @_;

# retrieve an Edit page for the new article and get the edit token
my $url = $gWikiURL . $refArticle->{title} . $gActionEdit;
my $response = $gBot->request(GET $url);
my ($editToken) =
   ($response->content =~ m/.*value="(.*?)".*name="wpEditToken"/s);

# create & send the submission request
$url = $gWikiURL . $refArticle->{title} . $gActionSubmit;
$response = $gBot->request(POST $url,
    Content_Type => 'form-data',
    Content              =>  [wpSave      => "Save page",
                                         wpSection   => "",
                                         wpEdittime  => "",
                                         wpEditToken => $editToken,
                                         wpSummary   => $gSubmissionComment,
                                         wpTextbox1  => 
$refArticle->{wikitext}]);


# return the outcome based on the response status return $response->is_error?$FALSE:$TRUE; }

The text causing the problems is in $refArticle->{wikitext} and I get the following message from perl when running the script:

   "Parsing of undecoded UTF-8 will give garbage when decoding
    entities at /Library/Perl/5.8.1/LWP/Protocol.pm line 114."

But, of course, I don't know how to correct this problem.

Any help would be greatly appreciated, and, of course, I'd like to figure out the problem with URI::Escape not escaping these same characters in the URLs -- these are in,

   $refArticle->{title}

also seen in the code above.


John Blumel



Reply via email to