Xqt has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/110674

Change subject: convert html text to unicode, read charset, use utf-8 by default
......................................................................

convert html text to unicode, read charset, use utf-8 by default

Change-Id: I208500f399a309665ecbac082db72de72c354a5f
---
M pywikibot/comms/http.py
1 file changed, 10 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/pywikibot/core 
refs/changes/74/110674/1

diff --git a/pywikibot/comms/http.py b/pywikibot/comms/http.py
index 6a4c287..e9bc57f 100644
--- a/pywikibot/comms/http.py
+++ b/pywikibot/comms/http.py
@@ -13,7 +13,7 @@
 """
 
 #
-# (C) Pywikipedia bot team, 2007
+# (C) Pywikipedia bot team, 2008-2014
 #
 # Distributed under the terms of the MIT license.
 #
@@ -24,6 +24,7 @@
 import urllib
 import logging
 import atexit
+import re
 
 try:
     from httplib2 import SSLHandshakeError
@@ -146,5 +147,11 @@
     if request.data[0].status != 200:
         pywikibot.warning(u"Http response status %(status)s"
                           % {'status': request.data[0].status})
-
-    return request.data[1]
+    text = request.data[1]
+    # Convert text to Unicode
+    try:
+        charset = re.findall('charset=([^\'\";]+)', text)[0]
+    except IndexError:
+        charset = 'utf-8'  # default
+    text = unicode(text, charset, errors='strict')
+    return text

-- 
To view, visit https://gerrit.wikimedia.org/r/110674
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I208500f399a309665ecbac082db72de72c354a5f
Gerrit-PatchSet: 1
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-Owner: Xqt <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to