[Pywikibot-commits] [Gerrit] pywikibot/core[master]: proofreadpage.py: OCR needs BeautifulSoup

jenkins-bot (Code Review) Mon, 03 Dec 2018 09:47:45 -0800

jenkins-bot has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/475613 )


Change subject: proofreadpage.py: OCR needs BeautifulSoup
......................................................................

proofreadpage.py: OCR needs BeautifulSoup

In proofreadpage.py, OCR needs BeautifulSoup in:
- url_image()
- _do_hocr()

Soup() is defined at import time only if bs4 is available.
Define it also when bs4 is not avaiable and make it raise
ImportError when called.
Rename Soup() to _bs4_soup() to comply with function naming rules.

OCR tests if bs4 is not available are already skipped:
- see Iaeabb046660b294fa19025282a344356f756c5bf

Bug: T210335
Change-Id: I5e3d235cdb1cba9b4ed52ba2442a9bfb1802d9bf
---
M pywikibot/proofreadpage.py
1 file changed, 18 insertions(+), 7 deletions(-)

Approvals:
  Xqt: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/pywikibot/proofreadpage.py b/pywikibot/proofreadpage.py
index e22c2e6..5432d70 100644
--- a/pywikibot/proofreadpage.py
+++ b/pywikibot/proofreadpage.py
@@ -38,20 +38,29 @@
     from bs4 import BeautifulSoup, FeatureNotFound
 except ImportError as e:
     BeautifulSoup = e
+
+    def _bs4_soup(*args, **kwargs):
+        """Raise BeautifulSoup when called, if bs4 is not available."""
+        raise BeautifulSoup
 else:
     try:
         BeautifulSoup('', 'lxml')
     except FeatureNotFound:
-        Soup = partial(BeautifulSoup, features='html.parser')
+        _bs4_soup = partial(BeautifulSoup, features='html.parser')
     else:
-        Soup = partial(BeautifulSoup, features='lxml')
+        _bs4_soup = partial(BeautifulSoup, features='lxml')

 import pywikibot
 from pywikibot.comms import http
 from pywikibot.data.api import Request
+from pywikibot.tools import ModuleDeprecationWrapper

 _logger = 'proofreadpage'

+wrapper = ModuleDeprecationWrapper(__name__)
+wrapper._add_deprecated_attr('Soup', _bs4_soup, replacement_name='_bs4_soup',
+                             since='20181128')
+

 class FullHeader(object):

@@ -524,9 +533,10 @@
         @rtype: str/unicode

         @raises Exception: in case of http errors
+        @raise ImportError: if bs4 is not installed, _bs4_soup() will raise
         @raises ValueError: in case of no prp_page_image src found for scan
         """
-        # wrong link fail with various possible Exceptions.
+        # wrong link fails with various possible Exceptions.
         if not hasattr(self, '_url_image'):

             if self.exists():
@@ -541,7 +551,7 @@
                 pywikibot.error('Error fetching HTML for %s.' % self)
                 raise

-            soup = Soup(response.text)
+            soup = _bs4_soup(response.text)

             try:
                 self._url_image = soup.find(class_='prp-page-image')
@@ -623,10 +633,11 @@
         This is the main method for 'phetools'.
         Fallback method is ocr.

+        @raise ImportError: if bs4 is not installed, _bs4_soup() will raise
         """
         def parse_hocr_text(txt):
             """Parse hocr text."""
-            soup = Soup(txt)
+            soup = _bs4_soup(txt)

             res = []
             for ocr_page in soup.find_all(class_='ocr_page'):
@@ -823,7 +834,7 @@
             del self._parsed_text

         self._parsed_text = self._get_parsed_page()
-        self._soup = Soup(self._parsed_text)
+        self._soup = _bs4_soup(self._parsed_text)
         # Do not search for "new" here, to avoid to skip purging if links
         # to non-existing pages are present.
         attrs = {'class': re.compile('prp-pagequality')}
@@ -845,7 +856,7 @@
             self.purge()
             del self._parsed_text
             self._parsed_text = self._get_parsed_page()
-            self._soup = Soup(self._parsed_text)
+            self._soup = _bs4_soup(self._parsed_text)
             if not self._soup.find_all('a', attrs=attrs):
                 raise ValueError(
                     'Missing class="qualityN prp-pagequality-N" or '

--
To view, visit https://gerrit.wikimedia.org/r/475613
To unsubscribe, or for help writing mail filters, visit 
https://gerrit.wikimedia.org/r/settings

Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I5e3d235cdb1cba9b4ed52ba2442a9bfb1802d9bf
Gerrit-Change-Number: 475613
Gerrit-PatchSet: 6
Gerrit-Owner: Mpaa <[email protected]>
Gerrit-Reviewer: John Vandenberg <[email protected]>
Gerrit-Reviewer: Mpaa <[email protected]>
Gerrit-Reviewer: Xqt <[email protected]>
Gerrit-Reviewer: jenkins-bot (75)

_______________________________________________
Pywikibot-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikibot-commits

[Pywikibot-commits] [Gerrit] pywikibot/core[master]: proofreadpage.py: OCR needs BeautifulSoup

Reply via email to