jenkins-bot has submitted this change and it was merged.

Change subject: Decoding text: catch exception
......................................................................


Decoding text: catch exception

Wrapped the decode instruction derived from
change Ia2051a2a80851b15b1a04a135763291bd633d4e3
in a "try: except:" block, as suggested in comment 9 of bug 67410

Also: added a comma in "self.CHARSET" regex in case of
"contentType" contains a list of value from HTML-meta

Change-Id: I3af86d3386ea919001287fe1c057932c16537eb4
---
M scripts/reflinks.py
1 file changed, 6 insertions(+), 2 deletions(-)

Approvals:
  John Vandenberg: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/scripts/reflinks.py b/scripts/reflinks.py
index 65dcb5b..a7f50f0 100644
--- a/scripts/reflinks.py
+++ b/scripts/reflinks.py
@@ -434,7 +434,7 @@
         # Regex to grasp content-type meta HTML tag in HTML source
         self.META_CONTENT = re.compile(r'(?i)<meta[^>]*content\-type[^>]*>')
         # Extract the encoding from a charset property (from content-type !)
-        self.CHARSET = re.compile(r'(?i)charset\s*=\s*(?P<enc>[^\'";>/]*)')
+        self.CHARSET = re.compile(r'(?i)charset\s*=\s*(?P<enc>[^\'",;>/]*)')
         # Extract html title from page
         self.TITLE = re.compile(r'(?is)(?<=<title>).*?(?=</title>)')
         # Matches content inside <script>/<style>/HTML comments
@@ -683,7 +683,11 @@
 
                 if 'utf-8' not in enc:
                     enc.append('utf-8')
-                u = linkedpagetext.decode(enc[0])   # Bug 67410
+                try:
+                    u = linkedpagetext.decode(enc[0])   # Bug 67410
+                except (UnicodeDecodeError, LookupError) as e:
+                    pywikibot.output(u'%s : Decoding error - %s' % (ref.link, 
e))
+                    continue
 
                 # Retrieves the first non empty string inside <title> tags
                 for m in self.TITLE.finditer(u):

-- 
To view, visit https://gerrit.wikimedia.org/r/155226
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I3af86d3386ea919001287fe1c057932c16537eb4
Gerrit-PatchSet: 2
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-Owner: Beta16 <[email protected]>
Gerrit-Reviewer: John Vandenberg <[email protected]>
Gerrit-Reviewer: Ladsgroup <[email protected]>
Gerrit-Reviewer: Merlijn van Deen <[email protected]>
Gerrit-Reviewer: jenkins-bot <>

_______________________________________________
Pywikibot-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikibot-commits

Reply via email to