[Pywikipedia-bugs] [Maniphest] [Changed Subscribers] T103253: UnicodeDecodeError occured

XZise Sun, 21 Jun 2015 06:14:58 -0700

XZise added a subscriber: jayvdb.
XZise added a comment.

Okay looking at it there may be several factors. For one `unicode_literals` 
(https://phabricator.wikimedia.org/rPWBC1e54a7d6886d56a21101900025038e25bab5ad03)
 adds that strings (using just quotes) are now unicode. Now this does not 
directly affect the line where you are because it's already unicode so there 
nothing changed. Now additionally 
https://phabricator.wikimedia.org/rPWBCb44e59ae60a65bcba0e5fbc6d1941f1edcdc640c 
added support for `Page` instances in `Request`. So instead of the title with 
the section added manually it always uses the `Page` instance and `Request` 
itself extracts the title. This is not done in the parameters though but only 
on submit. And on submit it creates a new list of the parameters normalized to 
str/bytes but the original parameters (`Request._params`) are untouched.


And now we are getting somewhere, because the error line in question uses 
`self._params` and thus might get a `Page` instance. And `Page.__repr__` 
currently returns bytes in Python 2 and encoded in the console encoding. Now 
when you have something like `u'%s' % (u'Ünicöde tëxt'.encode('latin1'))` it 
tries to put bytes into unicode (in that case it doesn't matter if 
`unicode_literals` is used and if it's used that just means you could remove 
the u-prefixes). Now whenever Python tries to put a `bytes` into `unicode` it 
decodes it using ASCII which won't work because there are several characters 
not mapped by ASCII.

So the underlying issue is simply that `Page.__repr__` does not return an ASCII 
compatible string, which also leads to failures seen in 
https://phabricator.wikimedia.org/T95809. @jayvdb actually reverted the changes 
related to `Page.__repr__` from the `unicode_literals` patch in 
https://phabricator.wikimedia.org/rPWBC853e6b0bdce3e4fe60920c1c04f64d3a8eecde5e.
 Although the `unicode_literals` variant is also not appropriate as it returned 
a `unicode` in Python 2 but it wouldn't try to decode it.

I guess the most appropriate way would be to just do a `repr` on the title and 
site. And then surround that with something to make it look like today. The 
obvious disadvantage would be that the representation is not as readable 
(especially on wikis not using the latin alphabet) but it would be standard 
conform. The implementation in the `unicode_literals` patch wasn't that far 
off. On Python 2.7 I get the following:

  >>> u'Ünicöde tëxt'.encode('latin1').decode('unicode-escape')
  u'\xdcnic\xf6de t\xebxt'
  >>> repr(u'Ünicöde tëxt')
  "u'\\xdcnic\\xf6de t\\xebxt'"

And the result in Python 3, which is using `str` (`unicode` in Python 2 terms) 
as a result for `__repr__` (afaik is `__ascii__` the implementation for 
`__repr__` returning an ASCII compatible string) looks also quite similar:

  >>> u'Ünicöde tëxt'.encode('latin1').decode('unicode-escape')
  'Ünicöde tëxt'
  >>> repr(u'Ünicöde tëxt')
  "'Ünicöde tëxt'"


TASK DETAIL
  https://phabricator.wikimedia.org/T103253

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Xqt, XZise
Cc: jayvdb, gerritbot, XZise, Aklapper, Xqt, pywikibot-bugs-list, Malyacko, 
P.Copp



_______________________________________________
pywikibot-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikibot-bugs

[Pywikipedia-bugs] [Maniphest] [Changed Subscribers] T103253: UnicodeDecodeError occured

Reply via email to