XZise added a comment.

Okay thank you that helps a lot. Here are all the steps to understand what is 
happening: The page août 
<https://pt.wiktionary.org/w/index.php?title=ao%C3%BBt&oldid=1930450> on the 
Portuguese wiki is using `{{urlencode:ao%FBt}}`. Now our code is searching 
through the text for the templates to make sure that it is not protected for 
bot edits and it picks up `{{urlencode:ao%FBt}}` as a template. With that it 
tries to create a `Link` instance and by doing that tries to decode the percent 
encoding. Which is why `urlencode:ao%FBt` is the text you got when printed.

And the rest is straight forward: It encodes that using the site's encoding, 
tries to handle the percent encoding and then decodes the bytes it got from 
that again with the site's encoding. And that makes `u'urlencode:ao%FBt'` first 
into `b'urlencode:ao%FBt'` using UTF-8 (as all characters are ASCII characters) 
it decodes the percent encoding to `b'urlencode:ao\xFBt'` and then tries to 
decode it using UTF-8 which does not work as `0xFB` alone is no valid UTF-8 
character.

Now to fix this particular case (as you've already done 
<https://pt.wiktionary.org/w/index.php?title=ao%C3%BBt&diff=1994294&oldid=1930450>)
 it's possible to just fix the usage in the page as it doesn't make sense to 
percent encode a percent encoded string.

But while the fault lies by whoever wrote that text and not really by pywikibot 
I think we need to mitigate that. I don't think it's possible to get percent 
encoded text in the API as it will use `\u00FB` instead, so I think we could 
skip that and that it's probably because previous versions screen scraped an 
HTML page which might use %-encoded text.

Alternatively we should provide a more sensible output including the original 
values which would make it more obvious what went wrong in case some page has 
the same problem in the future.


TASK DETAIL
  https://phabricator.wikimedia.org/T111116

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: XZise
Cc: XZise, pywikibot-bugs-list, Malafaya, Aklapper, jayvdb, Malyacko



_______________________________________________
pywikibot-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikibot-bugs

Reply via email to