commit python-beautifulsoup4 for openSUSE:Factory

root Fri, 05 Jun 2020 11:01:20 -0700

Hello community,

here is the log from the commit of package python-beautifulsoup4 for 
openSUSE:Factory checked in at 2020-06-05 20:00:44
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/python-beautifulsoup4 (Old)
 and      /work/SRC/openSUSE:Factory/.python-beautifulsoup4.new.3606 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Package is "python-beautifulsoup4"

Fri Jun  5 20:00:44 2020 rev:33 rq:811097 version:4.9.1

Changes:
--------
--- 
/work/SRC/openSUSE:Factory/python-beautifulsoup4/python-beautifulsoup4.changes  
    2020-04-15 19:52:40.685549843 +0200
+++ 
/work/SRC/openSUSE:Factory/.python-beautifulsoup4.new.3606/python-beautifulsoup4.changes
    2020-06-05 20:00:49.956062319 +0200
@@ -1,0 +2,26 @@
+Wed Jun  3 11:10:03 UTC 2020 - Dirk Mueller <dmuel...@suse.com>
+
+- update to 4.9.1:
+  * Added a keyword argument 'on_duplicate_attribute' to the
+    BeautifulSoupHTMLParser constructor (used by the html.parser tree
+    builder) which lets you customize the handling of markup that
+    contains the same attribute more than once, as in:
+    <a href="url1" href="url2"> [bug=1878209]
+  * Added a distinct subclass, GuessedAtParserWarning, for the warning
+    issued when BeautifulSoup is instantiated without a parser being
+    specified. [bug=1873787]
+  * Added a distinct subclass, MarkupResemblesLocatorWarning, for the
+    warning issued when BeautifulSoup is instantiated with 'markup' that
+    actually seems to be a URL or the path to a file on
+    disk. [bug=1873787]
+  * The new NavigableString subclasses (Stylesheet, Script, and
+    TemplateString) can now be imported directly from the bs4 package.
+  * If you encode a document with a Python-specific encoding like
+    'unicode_escape', that encoding is no longer mentioned in the final
+    XML or HTML document. Instead, encoding information is omitted or
+    left blank. [bug=1874955]
+  * Fixed test failures when run against soupselect 2.0. Patch by Tomáš
+    Chvátal. [bug=1872279]
+- remove soupsieve2-tests.patch: upstreamed
+
+-------------------------------------------------------------------

Old:
----
  beautifulsoup4-4.9.0.tar.gz
  soupsieve2-tests.patch

New:
----
  beautifulsoup4-4.9.1.tar.gz

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Other differences:
------------------
++++++ python-beautifulsoup4.spec ++++++
--- /var/tmp/diff_new_pack.Lq4WuK/_old  2020-06-05 20:00:51.244066544 +0200
+++ /var/tmp/diff_new_pack.Lq4WuK/_new  2020-06-05 20:00:51.248066558 +0200
@@ -18,13 +18,12 @@
 
 %{?!python_module:%define python_module() python-%{**} python3-%{**}}
 Name:           python-beautifulsoup4
-Version:        4.9.0
+Version:        4.9.1
 Release:        0
 Summary:        HTML/XML Parser for Quick-Turnaround Applications Like 
Screen-Scraping
 License:        MIT
 URL:            https://www.crummy.com/software/BeautifulSoup/
 Source:         
https://files.pythonhosted.org/packages/source/b/beautifulsoup4/beautifulsoup4-%{version}.tar.gz
-Patch0:         soupsieve2-tests.patch
 BuildRequires:  %{python_module pytest}
 BuildRequires:  %{python_module setuptools}
 BuildRequires:  %{python_module soupsieve >= 1.2}
@@ -75,7 +74,6 @@
 
 %prep
 %setup -q -n beautifulsoup4-%{version}
-%patch0 -p1
 
 %build
 %python_build

++++++ beautifulsoup4-4.9.0.tar.gz -> beautifulsoup4-4.9.1.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/NEWS.txt 
new/beautifulsoup4-4.9.1/NEWS.txt
--- old/beautifulsoup4-4.9.0/NEWS.txt   2020-04-05 21:42:09.000000000 +0200
+++ new/beautifulsoup4-4.9.1/NEWS.txt   2020-05-17 20:06:06.000000000 +0200
@@ -1,3 +1,31 @@
+= 4.9.1 (20200517)
+
+* Added a keyword argument 'on_duplicate_attribute' to the
+  BeautifulSoupHTMLParser constructor (used by the html.parser tree
+  builder) which lets you customize the handling of markup that
+  contains the same attribute more than once, as in:
+  <a href="url1" href="url2"> [bug=1878209]
+
+* Added a distinct subclass, GuessedAtParserWarning, for the warning
+  issued when BeautifulSoup is instantiated without a parser being
+  specified. [bug=1873787]
+
+* Added a distinct subclass, MarkupResemblesLocatorWarning, for the
+  warning issued when BeautifulSoup is instantiated with 'markup' that
+  actually seems to be a URL or the path to a file on
+  disk. [bug=1873787]
+
+* The new NavigableString subclasses (Stylesheet, Script, and
+  TemplateString) can now be imported directly from the bs4 package.
+
+* If you encode a document with a Python-specific encoding like
+  'unicode_escape', that encoding is no longer mentioned in the final
+  XML or HTML document. Instead, encoding information is omitted or
+  left blank. [bug=1874955]
+
+* Fixed test failures when run against soupselect 2.0. Patch by Tomáš
+  Chvátal. [bug=1872279]
+
 = 4.9.0 (20200405)
 
 * Added PageElement.decomposed, a new property which lets you
@@ -5,7 +33,8 @@
   NavigableString.
 
 * Embedded CSS and Javascript is now stored in distinct Stylesheet and
-  Script tags, which are ignored by methods like get_text(). This
+  Script tags, which are ignored by methods like get_text() since most
+  people don't consider this sort of content to be 'text'. This
   feature is not supported by the html5lib treebuilder. [bug=1868861]
 
 * Added a Russian translation by 'authoress' to the repository.
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/PKG-INFO 
new/beautifulsoup4-4.9.1/PKG-INFO
--- old/beautifulsoup4-4.9.0/PKG-INFO   2020-04-05 21:55:13.000000000 +0200
+++ new/beautifulsoup4-4.9.1/PKG-INFO   2020-05-17 20:10:59.000000000 +0200
@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: beautifulsoup4
-Version: 4.9.0
+Version: 4.9.1
 Summary: Screen-scraping library
 Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
 Author: Leonard Richardson
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' 
old/beautifulsoup4-4.9.0/beautifulsoup4.egg-info/PKG-INFO 
new/beautifulsoup4-4.9.1/beautifulsoup4.egg-info/PKG-INFO
--- old/beautifulsoup4-4.9.0/beautifulsoup4.egg-info/PKG-INFO   2020-04-05 
21:55:13.000000000 +0200
+++ new/beautifulsoup4-4.9.1/beautifulsoup4.egg-info/PKG-INFO   2020-05-17 
20:10:59.000000000 +0200
@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: beautifulsoup4
-Version: 4.9.0
+Version: 4.9.1
 Summary: Screen-scraping library
 Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
 Author: Leonard Richardson
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' 
old/beautifulsoup4-4.9.0/beautifulsoup4.egg-info/requires.txt 
new/beautifulsoup4-4.9.1/beautifulsoup4.egg-info/requires.txt
--- old/beautifulsoup4-4.9.0/beautifulsoup4.egg-info/requires.txt       
2020-04-05 21:55:13.000000000 +0200
+++ new/beautifulsoup4-4.9.1/beautifulsoup4.egg-info/requires.txt       
2020-05-17 20:10:59.000000000 +0200
@@ -1,4 +1,5 @@
 soupsieve>1.2
+soupsieve<2.0
 
 [html5lib]
 html5lib
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/bs4/__init__.py 
new/beautifulsoup4-4.9.1/bs4/__init__.py
--- old/beautifulsoup4-4.9.0/bs4/__init__.py    2020-04-05 20:51:49.000000000 
+0200
+++ new/beautifulsoup4-4.9.1/bs4/__init__.py    2020-05-17 19:53:28.000000000 
+0200
@@ -15,7 +15,7 @@
 """
 
 __author__ = "Leonard Richardson (leona...@segfault.org)"
-__version__ = "4.9.0"
+__version__ = "4.9.1"
 __copyright__ = "Copyright (c) 2004-2020 Leonard Richardson"
 # Use of this source code is governed by the MIT license.
 __license__ = "MIT"
@@ -39,15 +39,32 @@
     NavigableString,
     PageElement,
     ProcessingInstruction,
+    PYTHON_SPECIFIC_ENCODINGS,
     ResultSet,
+    Script,
+    Stylesheet,
     SoupStrainer,
     Tag,
+    TemplateString,
     )
 
 # The very first thing we do is give a useful error if someone is
 # running this code under Python 3 without converting it.
 'You are trying to run the Python 2 version of Beautiful Soup under Python 3. 
This will not work.'<>'You need to convert the code, either by installing it 
(`python setup.py install`) or by running 2to3 (`2to3 -w bs4`).'
 
+# Define some custom warnings.
+class GuessedAtParserWarning(UserWarning):
+    """The warning issued when BeautifulSoup has to guess what parser to
+    use -- probably because no parser was specified in the constructor.
+    """
+
+class MarkupResemblesLocatorWarning(UserWarning):
+    """The warning issued when BeautifulSoup is given 'markup' that
+    actually looks like a resource locator -- a URL or a path to a file
+    on disk.
+    """
+
+
 class BeautifulSoup(Tag):
     """A data structure representing a parsed HTML or XML document.
 
@@ -93,7 +110,7 @@
     ASCII_SPACES = '\x20\x0a\x09\x0c\x0d'
 
     NO_PARSER_SPECIFIED_WARNING = "No parser was explicitly specified, so I'm 
using the best available %(markup_type)s parser for this system 
(\"%(parser)s\"). This usually isn't a problem, but if you run this code on 
another system, or in a different virtual environment, it may use a different 
parser and behave differently.\n\nThe code that caused this warning is on line 
%(line_number)s of the file %(filename)s. To get rid of this warning, pass the 
additional argument 'features=\"%(parser)s\"' to the BeautifulSoup 
constructor.\n"
-
+    
     def __init__(self, markup="", features=None, builder=None,
                  parse_only=None, from_encoding=None, exclude_encodings=None,
                  element_classes=None, **kwargs):
@@ -269,7 +286,10 @@
                         parser=builder.NAME,
                         markup_type=markup_type
                     )
-                    warnings.warn(self.NO_PARSER_SPECIFIED_WARNING % values, 
stacklevel=2)
+                    warnings.warn(
+                        self.NO_PARSER_SPECIFIED_WARNING % values,
+                        GuessedAtParserWarning, stacklevel=2
+                    )
         else:
             if kwargs:
                 warnings.warn("Keyword arguments to the BeautifulSoup 
constructor will be ignored. These would normally be passed into the 
TreeBuilder constructor, but a TreeBuilder instance was passed in as 
`builder`.")
@@ -309,7 +329,8 @@
                 warnings.warn(
                     '"%s" looks like a filename, not markup. You should'
                     ' probably open this file and pass the filehandle into'
-                    ' Beautiful Soup.' % self._decode_markup(markup)
+                    ' Beautiful Soup.' % self._decode_markup(markup),
+                    MarkupResemblesLocatorWarning
                 )
             self._check_markup_is_url(markup)
 
@@ -396,7 +417,8 @@
                     ' requests to get the document behind the URL, and feed'
                     ' that document to Beautiful Soup.' % cls._decode_markup(
                         markup
-                    )
+                    ),
+                    MarkupResemblesLocatorWarning
                 )
 
     def _feed(self):
@@ -428,7 +450,21 @@
 
     def new_tag(self, name, namespace=None, nsprefix=None, attrs={},
                 sourceline=None, sourcepos=None, **kwattrs):
-        """Create a new Tag associated with this BeautifulSoup object."""
+        """Create a new Tag associated with this BeautifulSoup object.
+
+        :param name: The name of the new Tag.
+        :param namespace: The URI of the new Tag's XML namespace, if any.
+        :param prefix: The prefix for the new Tag's XML namespace, if any.
+        :param attrs: A dictionary of this Tag's attribute values; can
+            be used instead of `kwattrs` for attributes like 'class'
+            that are reserved words in Python.
+        :param sourceline: The line number where this tag was
+            (purportedly) found in its source document.
+        :param sourcepos: The character position within `sourceline` where this
+            tag was (purportedly) found.
+        :param kwattrs: Keyword arguments for the new Tag's attribute values.
+
+        """
         kwattrs.update(attrs)
         return self.element_classes.get(Tag, Tag)(
             None, self.builder, name, namespace, nsprefix, kwattrs,
@@ -477,14 +513,14 @@
             self.preserve_whitespace_tag_stack.pop()
         if self.string_container_stack and tag == 
self.string_container_stack[-1]:
             self.string_container_stack.pop()
-        #print "Pop", tag.name
+        #print("Pop", tag.name)
         if self.tagStack:
             self.currentTag = self.tagStack[-1]
         return self.currentTag
 
     def pushTag(self, tag):
         """Internal method called by handle_starttag when a tag is opened."""
-        #print "Push", tag.name
+        #print("Push", tag.name)
         if self.currentTag is not None:
             self.currentTag.contents.append(tag)
         self.tagStack.append(tag)
@@ -607,7 +643,7 @@
           to but *not* including the most recent instqance of the
           given tag.
         """
-        #print "Popping to %s" % name
+        #print("Popping to %s" % name)
         if name == self.ROOT_TAG_NAME:
             # The BeautifulSoup object itself can never be popped.
             return
@@ -642,7 +678,7 @@
         in the document. For instance, if this was a self-closing tag,
         don't call handle_endtag.
         """
-        # print "Start tag %s: %s" % (name, attrs)
+        # print("Start tag %s: %s" % (name, attrs))
         self.endData()
 
         if (self.parse_only and len(self.tagStack) <= 1
@@ -669,14 +705,14 @@
         :param name: Name of the tag.
         :param nsprefix: Namespace prefix for the tag.
         """
-        #print "End tag: " + name
+        #print("End tag: " + name)
         self.endData()
         self._popToTag(name, nsprefix)
 
     def handle_data(self, data):
         """Called by the tree builder when a chunk of textual data is 
encountered."""
         self.current_data.append(data)
-
+       
     def decode(self, pretty_print=False,
                eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                formatter="minimal"):
@@ -691,6 +727,11 @@
         if self.is_xml:
             # Print the XML declaration
             encoding_part = ''
+            if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
+                # This is a special Python encoding; it can't actually
+                # go into an XML document because it means nothing
+                # outside of Python.
+                eventual_encoding = None
             if eventual_encoding != None:
                 encoding_part = ' encoding="%s"' % eventual_encoding
             prefix = u'<?xml version="1.0"%s?>\n' % encoding_part
@@ -733,4 +774,4 @@
 if __name__ == '__main__':
     import sys
     soup = BeautifulSoup(sys.stdin)
-    print soup.prettify()
+    print(soup.prettify())
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/bs4/builder/__init__.py 
new/beautifulsoup4-4.9.1/bs4/builder/__init__.py
--- old/beautifulsoup4-4.9.0/bs4/builder/__init__.py    2020-04-05 
21:41:03.000000000 +0200
+++ new/beautifulsoup4-4.9.1/bs4/builder/__init__.py    2020-05-17 
19:57:41.000000000 +0200
@@ -334,11 +334,11 @@
 
     def startElement(self, name, attrs):
         attrs = dict((key[1], value) for key, value in list(attrs.items()))
-        #print "Start %s, %r" % (name, attrs)
+        #print("Start %s, %r" % (name, attrs))
         self.soup.handle_starttag(name, attrs)
 
     def endElement(self, name):
-        #print "End %s" % name
+        #print("End %s" % name)
         self.soup.handle_endtag(name)
 
     def startElementNS(self, nsTuple, nodeName, attrs):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/bs4/builder/_html5lib.py 
new/beautifulsoup4-4.9.1/bs4/builder/_html5lib.py
--- old/beautifulsoup4-4.9.0/bs4/builder/_html5lib.py   2020-04-05 
21:35:38.000000000 +0200
+++ new/beautifulsoup4-4.9.1/bs4/builder/_html5lib.py   2020-05-17 
19:56:27.000000000 +0200
@@ -375,9 +375,9 @@
 
     def reparentChildren(self, new_parent):
         """Move all of this tag's children into another tag."""
-        # print "MOVE", self.element.contents
-        # print "FROM", self.element
-        # print "TO", new_parent.element
+        # print("MOVE", self.element.contents)
+        # print("FROM", self.element)
+        # print("TO", new_parent.element)
 
         element = self.element
         new_parent_element = new_parent.element
@@ -435,9 +435,9 @@
         element.contents = []
         element.next_element = final_next_element
 
-        # print "DONE WITH MOVE"
-        # print "FROM", self.element
-        # print "TO", new_parent_element
+        # print("DONE WITH MOVE")
+        # print("FROM", self.element)
+        # print("TO", new_parent_element)
 
     def cloneNode(self):
         tag = self.soup.new_tag(self.element.name, self.namespace)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/bs4/builder/_htmlparser.py 
new/beautifulsoup4-4.9.1/bs4/builder/_htmlparser.py
--- old/beautifulsoup4-4.9.0/bs4/builder/_htmlparser.py 2019-12-24 
16:33:36.000000000 +0100
+++ new/beautifulsoup4-4.9.1/bs4/builder/_htmlparser.py 2020-05-17 
19:56:43.000000000 +0200
@@ -57,8 +57,26 @@
     listens for HTMLParser events and translates them into calls
     to Beautiful Soup's tree construction API.
     """
+
+    # Strategies for handling duplicate attributes
+    IGNORE = 'ignore'
+    REPLACE = 'replace'
     
     def __init__(self, *args, **kwargs):
+        """Constructor.
+
+        :param on_duplicate_attribute: A strategy for what to do if a
+            tag includes the same attribute more than once. Accepted
+            values are: REPLACE (replace earlier values with later
+            ones, the default), IGNORE (keep the earliest value
+            encountered), or a callable. A callable must take three
+            arguments: the dictionary of attributes already processed,
+            the name of the duplicate attribute, and the most recent value
+            encountered.           
+        """
+        self.on_duplicate_attribute = kwargs.pop(
+            'on_duplicate_attribute', self.REPLACE
+        )
         HTMLParser.__init__(self, *args, **kwargs)
 
         # Keep a list of empty-element tags that were encountered
@@ -114,9 +132,21 @@
             # for consistency with the other tree builders.
             if value is None:
                 value = ''
-            attr_dict[key] = value
+            if key in attr_dict:
+                # A single attribute shows up multiple times in this
+                # tag. How to handle it depends on the
+                # on_duplicate_attribute setting.
+                on_dupe = self.on_duplicate_attribute
+                if on_dupe == self.IGNORE:
+                    pass
+                elif on_dupe in (None, self.REPLACE):
+                    attr_dict[key] = value
+                else:
+                    on_dupe(attr_dict, key, value)
+            else:
+                attr_dict[key] = value
             attrvalue = '""'
-        #print "START", name
+        #print("START", name)
         sourceline, sourcepos = self.getpos()
         tag = self.soup.handle_starttag(
             name, None, None, attr_dict, sourceline=sourceline,
@@ -146,12 +176,12 @@
            be the closing portion of an empty-element tag,
            e.g. '<tag></tag>'.
         """
-        #print "END", name
+        #print("END", name)
         if check_already_closed and name in self.already_closed_empty_element:
             # This is a redundant end tag for an empty-element tag.
             # We've already called handle_endtag() for it, so just
             # check it off the list.
-            # print "ALREADY CLOSED", name
+            # print("ALREADY CLOSED", name)
             self.already_closed_empty_element.remove(name)
         else:
             self.soup.handle_endtag(name)
@@ -273,7 +303,7 @@
     # The html.parser knows which line number and position in the
     # original file is the source of an element.
     TRACKS_LINE_NUMBERS = True
-    
+
     def __init__(self, parser_args=None, parser_kwargs=None, **kwargs):
         """Constructor.
 
@@ -285,15 +315,23 @@
             invoked.
         :param kwargs: Keyword arguments for the superclass constructor.
         """
+        # Some keyword arguments will be pulled out of kwargs and placed
+        # into parser_kwargs.
+        extra_parser_kwargs = dict()
+        for arg in ('on_duplicate_attribute',):
+            if arg in kwargs:
+                value = kwargs.pop(arg)
+                extra_parser_kwargs[arg] = value
         super(HTMLParserTreeBuilder, self).__init__(**kwargs)
         parser_args = parser_args or []
         parser_kwargs = parser_kwargs or {}
+        parser_kwargs.update(extra_parser_kwargs)
         if CONSTRUCTOR_TAKES_STRICT and not CONSTRUCTOR_STRICT_IS_DEPRECATED:
             parser_kwargs['strict'] = False
         if CONSTRUCTOR_TAKES_CONVERT_CHARREFS:
             parser_kwargs['convert_charrefs'] = False
         self.parser_args = (parser_args, parser_kwargs)
-
+        
     def prepare_markup(self, markup, user_specified_encoding=None,
                        document_declared_encoding=None, 
exclude_encodings=None):
 
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/bs4/dammit.py 
new/beautifulsoup4-4.9.1/bs4/dammit.py
--- old/beautifulsoup4-4.9.0/bs4/dammit.py      2019-12-24 15:05:38.000000000 
+0100
+++ new/beautifulsoup4-4.9.1/bs4/dammit.py      2020-05-17 19:56:04.000000000 
+0200
@@ -506,16 +506,16 @@
             markup = smart_quotes_compiled.sub(self._sub_ms_char, markup)
 
         try:
-            #print "Trying to convert document to %s (errors=%s)" % (
-            #    proposed, errors)
+            #print("Trying to convert document to %s (errors=%s)" % (
+            #    proposed, errors))
             u = self._to_unicode(markup, proposed, errors)
             self.markup = u
             self.original_encoding = proposed
         except Exception as e:
-            #print "That didn't work!"
-            #print e
+            #print("That didn't work!")
+            #print(e)
             return None
-        #print "Correct encoding: %s" % proposed
+        #print("Correct encoding: %s" % proposed)
         return self.markup
 
     def _to_unicode(self, data, encoding, errors="strict"):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/bs4/diagnose.py 
new/beautifulsoup4-4.9.1/bs4/diagnose.py
--- old/beautifulsoup4-4.9.0/bs4/diagnose.py    2019-12-24 15:59:17.000000000 
+0100
+++ new/beautifulsoup4-4.9.1/bs4/diagnose.py    2020-05-17 19:55:43.000000000 
+0200
@@ -25,8 +25,8 @@
     :param data: A string containing markup that needs to be explained.
     :return: None; diagnostics are printed to standard output.
     """
-    print "Diagnostic running on Beautiful Soup %s" % __version__
-    print "Python version %s" % sys.version
+    print("Diagnostic running on Beautiful Soup %s" % __version__)
+    print("Python version %s" % sys.version)
 
     basic_parsers = ["html.parser", "html5lib", "lxml"]
     for name in basic_parsers:
@@ -35,7 +35,7 @@
                 break
         else:
             basic_parsers.remove(name)
-            print (
+            print(
                 "I noticed that %s is not installed. Installing it may help." %
                 name)
 
@@ -43,52 +43,52 @@
         basic_parsers.append("lxml-xml")
         try:
             from lxml import etree
-            print "Found lxml version %s" % 
".".join(map(str,etree.LXML_VERSION))
+            print("Found lxml version %s" % 
".".join(map(str,etree.LXML_VERSION)))
         except ImportError, e:
-            print (
+            print(
                 "lxml is not installed or couldn't be imported.")
 
 
     if 'html5lib' in basic_parsers:
         try:
             import html5lib
-            print "Found html5lib version %s" % html5lib.__version__
+            print("Found html5lib version %s" % html5lib.__version__)
         except ImportError, e:
-            print (
+            print(
                 "html5lib is not installed or couldn't be imported.")
 
     if hasattr(data, 'read'):
         data = data.read()
     elif data.startswith("http:") or data.startswith("https:"):
-        print '"%s" looks like a URL. Beautiful Soup is not an HTTP client.' % 
data
-        print "You need to use some other library to get the document behind 
the URL, and feed that document to Beautiful Soup."
+        print('"%s" looks like a URL. Beautiful Soup is not an HTTP client.' % 
data)
+        print("You need to use some other library to get the document behind 
the URL, and feed that document to Beautiful Soup.")
         return
     else:
         try:
             if os.path.exists(data):
-                print '"%s" looks like a filename. Reading data from the 
file.' % data
+                print('"%s" looks like a filename. Reading data from the 
file.' % data)
                 with open(data) as fp:
                     data = fp.read()
         except ValueError:
             # This can happen on some platforms when the 'filename' is
             # too long. Assume it's data and not a filename.
             pass
-        print
+        print("")
 
     for parser in basic_parsers:
-        print "Trying to parse your markup with %s" % parser
+        print("Trying to parse your markup with %s" % parser)
         success = False
         try:
             soup = BeautifulSoup(data, features=parser)
             success = True
         except Exception, e:
-            print "%s could not parse the markup." % parser
+            print("%s could not parse the markup." % parser)
             traceback.print_exc()
         if success:
-            print "Here's what %s did with the markup:" % parser
-            print soup.prettify()
+            print("Here's what %s did with the markup:" % parser)
+            print(soup.prettify())
 
-        print "-" * 80
+        print("-" * 80)
 
 def lxml_trace(data, html=True, **kwargs):
     """Print out the lxml events that occur during parsing.
@@ -193,9 +193,9 @@
 
 def benchmark_parsers(num_elements=100000):
     """Very basic head-to-head performance benchmark."""
-    print "Comparative parser benchmark on Beautiful Soup %s" % __version__
+    print("Comparative parser benchmark on Beautiful Soup %s" % __version__)
     data = rdoc(num_elements)
-    print "Generated a large invalid HTML document (%d bytes)." % len(data)
+    print("Generated a large invalid HTML document (%d bytes)." % len(data))
     
     for parser in ["lxml", ["lxml", "html"], "html5lib", "html.parser"]:
         success = False
@@ -205,23 +205,23 @@
             b = time.time()
             success = True
         except Exception, e:
-            print "%s could not parse the markup." % parser
+            print("%s could not parse the markup." % parser)
             traceback.print_exc()
         if success:
-            print "BS4+%s parsed the markup in %.2fs." % (parser, b-a)
+            print("BS4+%s parsed the markup in %.2fs." % (parser, b-a))
 
     from lxml import etree
     a = time.time()
     etree.HTML(data)
     b = time.time()
-    print "Raw lxml parsed the markup in %.2fs." % (b-a)
+    print("Raw lxml parsed the markup in %.2fs." % (b-a))
 
     import html5lib
     parser = html5lib.HTMLParser()
     a = time.time()
     parser.parse(data)
     b = time.time()
-    print "Raw html5lib parsed the markup in %.2fs." % (b-a)
+    print("Raw html5lib parsed the markup in %.2fs." % (b-a))
 
 def profile(num_elements=100000, parser="lxml"):
     """Use Python's profiler on a randomly generated document."""
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/bs4/element.py 
new/beautifulsoup4-4.9.1/bs4/element.py
--- old/beautifulsoup4-4.9.0/bs4/element.py     2020-04-05 21:15:36.000000000 
+0200
+++ new/beautifulsoup4-4.9.1/bs4/element.py     2020-05-17 19:55:29.000000000 
+0200
@@ -43,6 +43,35 @@
     return alias
 
 
+# These encodings are recognized by Python (so PageElement.encode
+# could theoretically support them) but XML and HTML don't recognize
+# them (so they should not show up in an XML or HTML document as that
+# document's encoding).
+#
+# If an XML document is encoded in one of these encodings, no encoding
+# will be mentioned in the XML declaration. If an HTML document is
+# encoded in one of these encodings, and the HTML document has a
+# <meta> tag that mentions an encoding, the encoding will be given as
+# the empty string.
+#
+# Source:
+# https://docs.python.org/3/library/codecs.html#python-specific-encodings
+PYTHON_SPECIFIC_ENCODINGS = set([
+    u"idna",
+    u"mbcs",
+    u"oem",
+    u"palmos",
+    u"punycode",
+    u"raw_unicode_escape",
+    u"undefined",
+    u"unicode_escape",
+    u"raw-unicode-escape",
+    u"unicode-escape",
+    u"string-escape",
+    u"string_escape",
+])
+    
+
 class NamespacedAttribute(unicode):
     """A namespaced string (e.g. 'xml:lang') that remembers the namespace
     ('xml') and the name ('lang') that were used to create it.
@@ -85,6 +114,8 @@
         """When an HTML document is being encoded to a given encoding, the
         value of a meta tag's 'charset' is the name of the encoding.
         """
+        if encoding in PYTHON_SPECIFIC_ENCODINGS:
+            return ''
         return encoding
 
 
@@ -110,6 +141,8 @@
         return obj
 
     def encode(self, encoding):
+        if encoding in PYTHON_SPECIFIC_ENCODINGS:
+            return ''
         def rewrite(match):
             return match.group(1) + encoding
         return self.CHARSET_RE.sub(rewrite, self.original_value)
@@ -1399,7 +1432,7 @@
 
     def __getattr__(self, tag):
         """Calling tag.subtag is the same as calling tag.find(name="subtag")"""
-        #print "Getattr %s.%s" % (self.__class__, tag)
+        #print("Getattr %s.%s" % (self.__class__, tag))
         if len(tag) > 3 and tag.endswith('Tag'):
             # BS3: soup.aTag -> "soup.find("a")
             tag_name = tag[:-3]
@@ -1724,7 +1757,7 @@
         if l:
             r = l[0]
         return r
-    findChild = find
+    findChild = find #BS2
 
     def find_all(self, name=None, attrs={}, recursive=True, text=None,
                  limit=None, **kwargs):
@@ -2002,7 +2035,7 @@
 
         :param markup: A PageElement or a list of them.
         """
-        # print 'looking for %s in %s' % (self, markup)
+        # print('looking for %s in %s' % (self, markup))
         found = None
         # If given a list of items, scan it for a text element that
         # matches.
@@ -2028,7 +2061,7 @@
         return found
 
     def _matches(self, markup, match_against, already_tried=None):
-        # print u"Matching %s against %s" % (markup, match_against)
+        # print(u"Matching %s against %s" % (markup, match_against))
         result = False
         if isinstance(markup, list) or isinstance(markup, tuple):
             # This should only happen when searching a multi-valued attribute
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/bs4/testing.py 
new/beautifulsoup4-4.9.1/bs4/testing.py
--- old/beautifulsoup4-4.9.0/bs4/testing.py     2020-04-05 20:07:32.000000000 
+0200
+++ new/beautifulsoup4-4.9.1/bs4/testing.py     2020-04-25 03:40:19.000000000 
+0200
@@ -15,6 +15,7 @@
     Comment,
     ContentMetaAttributeValue,
     Doctype,
+    PYTHON_SPECIFIC_ENCODINGS,
     SoupStrainer,
     Script,
     Stylesheet,
@@ -821,6 +822,29 @@
         # encoding.
         self.assertEqual('utf8', charset.encode("utf8"))
 
+    def test_python_specific_encodings_not_used_in_charset(self):
+        # You can encode an HTML document using a Python-specific
+        # encoding, but that encoding won't be mentioned _inside_ the
+        # resulting document. Instead, the document will appear to
+        # have no encoding.
+        for markup in [
+            b'<meta charset="utf8"></head>'
+            b'<meta id="encoding" charset="utf-8" />'
+        ]:
+            soup = self.soup(markup)
+            for encoding in PYTHON_SPECIFIC_ENCODINGS:
+                if encoding in (
+                    u'idna', u'mbcs', u'oem', u'undefined',
+                    u'string_escape', u'string-escape'
+                ):
+                    # For one reason or another, these will raise an
+                    # exception if we actually try to use them, so don't
+                    # bother.
+                    continue
+                encoded = soup.encode(encoding)
+                assert b'meta charset=""' in encoded
+                assert encoding.encode("ascii") not in encoded
+        
     def test_tag_with_no_attributes_can_have_attributes_added(self):
         data = self.soup("<a>text</a>")
         data.a['foo'] = 'bar'
@@ -854,6 +878,25 @@
         soup = self.soup(markup)
         self.assertEqual(markup, soup.encode("utf8"))
 
+    def test_python_specific_encodings_not_used_in_xml_declaration(self):
+        # You can encode an XML document using a Python-specific
+        # encoding, but that encoding won't be mentioned _inside_ the
+        # resulting document.
+        markup = b"""<?xml version="1.0"?>\n<foo/>"""
+        soup = self.soup(markup)
+        for encoding in PYTHON_SPECIFIC_ENCODINGS:
+            if encoding in (
+                u'idna', u'mbcs', u'oem', u'undefined',
+                u'string_escape', u'string-escape'
+            ):
+                # For one reason or another, these will raise an
+                # exception if we actually try to use them, so don't
+                # bother.
+                continue
+            encoded = soup.encode(encoding)
+            assert b'<?xml version="1.0"?>' in encoded
+            assert encoding.encode("ascii") not in encoded
+
     def test_processing_instruction(self):
         markup = b"""<?xml version="1.0" encoding="utf8"?>\n<?PITarget 
PIContent?>"""
         soup = self.soup(markup)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/bs4/tests/test_htmlparser.py 
new/beautifulsoup4-4.9.1/bs4/tests/test_htmlparser.py
--- old/beautifulsoup4-4.9.0/bs4/tests/test_htmlparser.py       2020-04-05 
21:54:12.000000000 +0200
+++ new/beautifulsoup4-4.9.1/bs4/tests/test_htmlparser.py       2020-05-17 
18:18:11.000000000 +0200
@@ -51,7 +51,43 @@
         self.assertEqual("sourceline", soup.p.sourceline.name)
         self.assertEqual("sourcepos", soup.p.sourcepos.name)
 
+    def test_on_duplicate_attribute(self):
+        # The html.parser tree builder has a variety of ways of
+        # handling a tag that contains the same attribute multiple times.
+
+        markup = '<a class="cls" href="url1" href="url2" href="url3" id="id">'
+
+        # If you don't provide any particular value for
+        # on_duplicate_attribute, later values replace earlier values.
+        soup = self.soup(markup)
+        self.assertEquals("url3", soup.a['href'])
+        self.assertEquals(["cls"], soup.a['class'])
+        self.assertEquals("id", soup.a['id'])
         
+        # You can also get this behavior explicitly.
+        def assert_attribute(on_duplicate_attribute, expected):
+            soup = self.soup(
+                markup, on_duplicate_attribute=on_duplicate_attribute
+            )
+            self.assertEquals(expected, soup.a['href'])
+
+            # Verify that non-duplicate attributes are treated normally.
+            self.assertEquals(["cls"], soup.a['class'])
+            self.assertEquals("id", soup.a['id'])
+        assert_attribute(None, "url3")
+        assert_attribute(BeautifulSoupHTMLParser.REPLACE, "url3")
+
+        # You can ignore subsequent values in favor of the first.
+        assert_attribute(BeautifulSoupHTMLParser.IGNORE, "url1")
+
+        # And you can pass in a callable that does whatever you want.
+        def accumulate(attrs, key, value):
+            if not isinstance(attrs[key], list):
+                attrs[key] = [attrs[key]]
+            attrs[key].append(value)
+        assert_attribute(accumulate, ["url1", "url2", "url3"])            
+
+
 class TestHTMLParserSubclass(SoupTest):
     def test_error(self):
         """Verify that our HTMLParser subclass implements error() in a way
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/bs4/tests/test_soup.py 
new/beautifulsoup4-4.9.1/bs4/tests/test_soup.py
--- old/beautifulsoup4-4.9.0/bs4/tests/test_soup.py     2020-04-05 
21:54:12.000000000 +0200
+++ new/beautifulsoup4-4.9.1/bs4/tests/test_soup.py     2020-04-21 
14:12:33.000000000 +0200
@@ -10,6 +10,8 @@
 from bs4 import (
     BeautifulSoup,
     BeautifulStoneSoup,
+    GuessedAtParserWarning,
+    MarkupResemblesLocatorWarning,
 )
 from bs4.builder import (
     TreeBuilder,
@@ -224,25 +226,32 @@
 
 class TestWarnings(SoupTest):
 
-    def _no_parser_specified(self, s, is_there=True):
-        v = s.startswith(BeautifulSoup.NO_PARSER_SPECIFIED_WARNING[:80])
-        self.assertTrue(v)
+    def _assert_warning(self, warnings, cls):
+        for w in warnings:
+            if isinstance(w.message, cls):
+                return w
+        raise Exception("%s warning not found in %r" % cls, warnings)
+    
+    def _assert_no_parser_specified(self, w):
+        warning = self._assert_warning(w, GuessedAtParserWarning)
+        message = str(warning.message)
+        self.assertTrue(
+            message.startswith(BeautifulSoup.NO_PARSER_SPECIFIED_WARNING[:60])
+        )
 
     def test_warning_if_no_parser_specified(self):
         with warnings.catch_warnings(record=True) as w:
-            soup = self.soup("<a><b></b></a>")
-        msg = str(w[0].message)
-        self._assert_no_parser_specified(msg)
+            soup = BeautifulSoup("<a><b></b></a>")
+        self._assert_no_parser_specified(w)
 
     def test_warning_if_parser_specified_too_vague(self):
         with warnings.catch_warnings(record=True) as w:
-            soup = self.soup("<a><b></b></a>", "html")
-        msg = str(w[0].message)
-        self._assert_no_parser_specified(msg)
+            soup = BeautifulSoup("<a><b></b></a>", "html")
+        self._assert_no_parser_specified(w)
 
     def test_no_warning_if_explicit_parser_specified(self):
         with warnings.catch_warnings(record=True) as w:
-            soup = self.soup("<a><b></b></a>", "html.parser")
+            soup = BeautifulSoup("<a><b></b></a>", "html.parser")
         self.assertEqual([], w)
 
     def test_parseOnlyThese_renamed_to_parse_only(self):
@@ -266,41 +275,43 @@
         self.assertRaises(
             TypeError, self.soup, "<a>", no_such_argument=True)
 
-class TestWarnings(SoupTest):
-
     def test_disk_file_warning(self):
         filehandle = tempfile.NamedTemporaryFile()
         filename = filehandle.name
         try:
             with warnings.catch_warnings(record=True) as w:
                 soup = self.soup(filename)
-            msg = str(w[0].message)
-            self.assertTrue("looks like a filename" in msg)
+            warning = self._assert_warning(w, MarkupResemblesLocatorWarning)
+            self.assertTrue("looks like a filename" in str(warning.message))
         finally:
             filehandle.close()
 
         # The file no longer exists, so Beautiful Soup will no longer issue 
the warning.
         with warnings.catch_warnings(record=True) as w:
             soup = self.soup(filename)
-        self.assertEqual(0, len(w))
+        self.assertEqual([], w)
 
     def test_url_warning_with_bytes_url(self):
         with warnings.catch_warnings(record=True) as warning_list:
             soup = self.soup(b"http://www.crummybytes.com/";)
-        # Be aware this isn't the only warning that can be raised during
-        # execution..
-        self.assertTrue(any("looks like a URL" in str(w.message) 
-            for w in warning_list))
+        warning = self._assert_warning(
+            warning_list, MarkupResemblesLocatorWarning
+        )
+        self.assertTrue("looks like a URL" in str(warning.message))
 
     def test_url_warning_with_unicode_url(self):
         with warnings.catch_warnings(record=True) as warning_list:
             # note - this url must differ from the bytes one otherwise
             # python's warnings system swallows the second warning
             soup = self.soup(u"http://www.crummyunicode.com/";)
-        self.assertTrue(any("looks like a URL" in str(w.message) 
-            for w in warning_list))
+        warning = self._assert_warning(
+            warning_list, MarkupResemblesLocatorWarning
+        )
+        self.assertTrue("looks like a URL" in str(warning.message))
 
     def test_url_warning_with_bytes_and_space(self):
+        # Here the markup contains something besides a URL, so no warning
+        # is issued.
         with warnings.catch_warnings(record=True) as warning_list:
             soup = self.soup(b"http://www.crummybytes.com/ is great")
         self.assertFalse(any("looks like a URL" in str(w.message) 
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/bs4/tests/test_tree.py 
new/beautifulsoup4-4.9.1/bs4/tests/test_tree.py
--- old/beautifulsoup4-4.9.0/bs4/tests/test_tree.py     2020-04-05 
21:54:12.000000000 +0200
+++ new/beautifulsoup4-4.9.1/bs4/tests/test_tree.py     2020-04-12 
20:20:16.000000000 +0200
@@ -37,6 +37,7 @@
     SoupTest,
     skipIf,
 )
+from soupsieve import SelectorSyntaxError
 
 XML_BUILDER_PRESENT = (builder_registry.lookup("xml") is not None)
 LXML_PRESENT = (builder_registry.lookup("lxml") is not None)
@@ -2018,7 +2019,7 @@
         self.assertEqual(len(self.soup.select('del')), 0)
 
     def test_invalid_tag(self):
-        self.assertRaises(SyntaxError, self.soup.select, 'tag%t')
+        self.assertRaises(SelectorSyntaxError, self.soup.select, 'tag%t')
 
     def test_select_dashed_tag_ids(self):
         self.assertSelects('custom-dashed-tag', ['dash1', 'dash2'])
@@ -2209,7 +2210,7 @@
             NotImplementedError, self.soup.select, "a:no-such-pseudoclass")
 
         self.assertRaises(
-            SyntaxError, self.soup.select, "a:nth-of-type(a)")
+            SelectorSyntaxError, self.soup.select, "a:nth-of-type(a)")
 
     def test_nth_of_type(self):
         # Try to select first paragraph
@@ -2265,7 +2266,7 @@
         self.assertEqual([], self.soup.select('#inner ~ h2'))
 
     def test_dangling_combinator(self):
-        self.assertRaises(SyntaxError, self.soup.select, 'h1 >')
+        self.assertRaises(SelectorSyntaxError, self.soup.select, 'h1 >')
 
     def test_sibling_combinator_wont_select_same_tag_twice(self):
         self.assertSelects('p[lang] ~ p', ['lang-en-gb', 'lang-en-us', 
'lang-fr'])
@@ -2296,8 +2297,8 @@
         self.assertSelects('div x,y,  z', ['xid', 'yid', 'zida', 'zidb', 
'zidab', 'zidac'])
 
     def test_invalid_multiple_select(self):
-        self.assertRaises(SyntaxError, self.soup.select, ',x, y')
-        self.assertRaises(SyntaxError, self.soup.select, 'x,,y')
+        self.assertRaises(SelectorSyntaxError, self.soup.select, ',x, y')
+        self.assertRaises(SelectorSyntaxError, self.soup.select, 'x,,y')
 
     def test_multiple_select_attrs(self):
         self.assertSelects('p[lang=en], p[lang=en-gb]', ['lang-en', 
'lang-en-gb'])
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/doc/source/index.rst 
new/beautifulsoup4-4.9.1/doc/source/index.rst
--- old/beautifulsoup4-4.9.0/doc/source/index.rst       2020-04-05 
21:30:06.000000000 +0200
+++ new/beautifulsoup4-4.9.1/doc/source/index.rst       2020-05-17 
19:48:43.000000000 +0200
@@ -18,7 +18,7 @@
 how to use it, how to make it do what you want, and what to do when it
 violates your expectations.
 
-This document covers Beautiful Soup version 4.8.1. The examples in
+This document covers Beautiful Soup version 4.9.0. The examples in
 this documentation should work the same way in Python 2.7 and Python
 3.2.
 
@@ -290,10 +290,9 @@
 
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
 
 If you can, I recommend you install and use lxml for speed. If you're
-using a version of Python 2 earlier than 2.7.3, or a version of Python
-3 earlier than 3.2.2, it's `essential` that you install lxml or
-html5lib--Python's built-in HTML parser is just not very good in older
-versions.
+using a very old version of Python -- earlier than 2.7.3 or 3.2.2 --
+it's `essential` that you install lxml or html5lib. Python's built-in
+HTML parser is just not very good in those old versions.
 
 Note that if a document is invalid, different parsers will generate
 different Beautiful Soup trees for it. See `Differences
@@ -310,13 +309,13 @@
  with open("index.html") as fp:
      soup = BeautifulSoup(fp)
 
- soup = BeautifulSoup("<html>data</html>")
+ soup = BeautifulSoup("<html>a web page</html>")
 
 First, the document is converted to Unicode, and HTML entities are
 converted to Unicode characters::
 
- BeautifulSoup("Sacr&eacute; bleu!")
- <html><head></head><body>Sacré bleu!</body></html>
+ print(BeautifulSoup("<html><head></head><body>Sacr&eacute; 
bleu!</body></html>"))
+ # <html><head></head><body>Sacré bleu!</body></html>
 
 Beautiful Soup then parses the document using the best available
 parser. It will use an HTML parser unless you specifically tell it to
@@ -569,8 +568,8 @@
 ``<template>`` tag). These classes work exactly the same way as
 ``NavigableString``; their only purpose is to make it easier to pick
 out the main body of the page, by ignoring strings that represent
-something else. (These classes are new in Beautiful Soup 4.9.0, and
-the html5lib parser doesn't use them.)
+something else. `(These classes are new in Beautiful Soup 4.9.0, and
+the html5lib parser doesn't use them.)`
  
 Beautiful Soup defines classes for anything else that might show up in
 an XML document: ``CData``, ``ProcessingInstruction``,
@@ -1708,18 +1707,17 @@
 CSS selectors
 -------------
 
-As of version 4.7.0, Beautiful Soup supports most CSS4 selectors via
-the `SoupSieve <https://facelessuser.github.io/soupsieve/>`_
-project. If you installed Beautiful Soup through ``pip``, SoupSieve
-was installed at the same time, so you don't have to do anything extra.
-
-``BeautifulSoup`` has a ``.select()`` method which uses SoupSieve to
-run a CSS selector against a parsed document and return all the
-matching elements. ``Tag`` has a similar method which runs a CSS
-selector against the contents of a single tag.
-
-(Earlier versions of Beautiful Soup also have the ``.select()``
-method, but only the most commonly-used CSS selectors are supported.)
+``BeautifulSoup`` has a ``.select()`` method which uses the `SoupSieve
+<https://facelessuser.github.io/soupsieve/>`_ package to run a CSS
+selector against a parsed document and return all the matching
+elements. ``Tag`` has a similar method which runs a CSS selector
+against the contents of a single tag.
+
+(The SoupSieve integration was added in Beautiful Soup 4.7.0. Earlier
+versions also have the ``.select()`` method, but only the most
+commonly-used CSS selectors are supported. If you installed Beautiful
+Soup through ``pip``, SoupSieve was installed at the same time, so you
+don't have to do anything extra.)
 
 The SoupSieve `documentation
 <https://facelessuser.github.io/soupsieve/>`_ lists all the currently
@@ -1959,7 +1957,7 @@
    tag.contents
    # [u'Hello', u' there', u'Nice to see you.']
 
-(This is a new feature in Beautiful Soup 4.4.0.)
+`(This is a new feature in Beautiful Soup 4.4.0.)`
 
 What if you need to create a whole new tag?  The best solution is to
 call the factory method ``BeautifulSoup.new_tag()``::
@@ -2087,7 +2085,7 @@
 The behavior of a decomposed ``Tag`` or ``NavigableString`` is not
 defined and you should not use it for anything. If you're not sure
 whether something has been decomposed, you can check its
-``.decomposed`` property (new in 4.9.0)::
+``.decomposed`` property `(new in Beautiful Soup 4.9.0)`::
 
   i_tag.decomposed
   # True
@@ -2182,7 +2180,7 @@
  #  A one, a two
  # </p>
 
-The ``smooth()`` method is new in Beautiful Soup 4.8.0.
+`The ``smooth()`` method is new in Beautiful Soup 4.8.0.`
 
 Output
 ======
@@ -2406,7 +2404,7 @@
 ``get_text()``
 --------------
 
-If you only want the text part of a document or tag, you can use the
+If you only want the human-readable text inside a document or tag, you can use 
the
 ``get_text()`` method. It returns all the text in a document or
 beneath a tag, as a single Unicode string::
 
@@ -2436,6 +2434,12 @@
  [text for text in soup.stripped_strings]
  # [u'I linked to', u'example.com']
 
+*As of Beautiful Soup version 4.9.0, when lxml or html.parser are in
+use, the contents of <script>, <style>, and <template>
+tags are not considered to be 'text', since those tags are not part of
+the human-visible content of the page.*
+
+ 
 Specifying the parser to use
 ============================
 
@@ -2476,20 +2480,20 @@
 parsers, but each parser is different. Different parsers will create
 different parse trees from the same document. The biggest differences
 are between the HTML parsers and the XML parsers. Here's a short
-document, parsed as HTML::
+document, parsed as HTML using the parser that comes with Python::
 
- BeautifulSoup("<a><b /></a>")
- # <html><head></head><body><a><b></b></a></body></html>
+ BeautifulSoup("<a><b/></a>", "html.parser")
+ # <a><b></b></a>
 
-Since an empty <b /> tag is not valid HTML, the parser turns it into a
-<b></b> tag pair.
+Since a standalone <b/> tag is not valid HTML, html.parser turns it into
+a <b></b> tag pair.
 
 Here's the same document parsed as XML (running this requires that you
-have lxml installed). Note that the empty <b /> tag is left alone, and
+have lxml installed). Note that the standalone <b/> tag is left alone, and
 that the document is given an XML declaration instead of being put
 into an <html> tag.::
 
- BeautifulSoup("<a><b /></a>", "xml")
+ print(BeautifulSoup("<a><b/></a>", "xml"))
  # <?xml version="1.0" encoding="utf-8"?>
  # <a><b/></a>
 
@@ -2501,8 +2505,8 @@
 
 But if the document is not perfectly-formed, different parsers will
 give different results. Here's a short, invalid document parsed using
-lxml's HTML parser. Note that the dangling </p> tag is simply
-ignored::
+lxml's HTML parser. Note that the <a> tag gets wrapped in <body> and
+<html> tags, and the dangling </p> tag is simply ignored::
 
  BeautifulSoup("<a></p>", "lxml")
  # <html><body><a></a></body></html>
@@ -2513,8 +2517,8 @@
  # <html><head></head><body><a><p></p></a></body></html>
 
 Instead of ignoring the dangling </p> tag, html5lib pairs it with an
-opening <p> tag. This parser also adds an empty <head> tag to the
-document.
+opening <p> tag. html5lib also adds an empty <head> tag; lxml didn't
+bother.
 
 Here's the same document parsed with Python's built-in HTML
 parser::
@@ -2523,21 +2527,20 @@
  # <a></a>
 
 Like html5lib, this parser ignores the closing </p> tag. Unlike
-html5lib, this parser makes no attempt to create a well-formed HTML
-document by adding a <body> tag. Unlike lxml, it doesn't even bother
-to add an <html> tag.
+html5lib or lxml, this parser makes no attempt to create a
+well-formed HTML document by adding <html> or <body> tags.
 
 Since the document "<a></p>" is invalid, none of these techniques is
-the "correct" way to handle it. The html5lib parser uses techniques
+the 'correct' way to handle it. The html5lib parser uses techniques
 that are part of the HTML5 standard, so it has the best claim on being
-the "correct" way, but all three techniques are legitimate.
+the 'correct' way, but all three techniques are legitimate.
 
 Differences between parsers can affect your script. If you're planning
 on distributing your script to other people, or running it on multiple
 machines, you should specify a parser in the ``BeautifulSoup``
 constructor. That will reduce the chances that your users parse a
 document differently from the way you parse it.
-   
+
 Encodings
 =========
 
@@ -2790,7 +2793,7 @@
 Line numbers
 ============
 
-The ``html.parser` and ``html5lib`` parsers can keep track of where in
+The ``html.parser`` and ``html5lib`` parsers can keep track of where in
 the original document each Tag was found. You can access this
 information as ``Tag.sourceline`` (line number) and ``Tag.sourcepos``
 (position of the start tag within a line)::
@@ -2821,8 +2824,8 @@
    soup.p.sourceline
    # None
   
-This feature is new in 4.8.1, and the parsers based on lxml don't
-support it.
+`This feature is new in 4.8.1, and the parsers based on lxml don't
+support it.`
 
 Comparing objects for equality
 ==============================
@@ -2878,9 +2881,15 @@
 This is because two different ``Tag`` objects can't occupy the same
 space at the same time.
 
+Advanced parser customization
+=============================
+
+Beautiful Soup offers a number of ways to customize how the parser
+treats incoming HTML and XML. This section covers the most commonly
+used customization techniques.
 
 Parsing only part of a document
-===============================
+-------------------------------
 
 Let's say you want to use Beautiful Soup look at a document's <a>
 tags. It's a waste of time and memory to parse the entire document and
@@ -2899,7 +2908,7 @@
 built-in parser.)
 
 ``SoupStrainer``
-----------------
+^^^^^^^^^^^^^^^^
 
 The ``SoupStrainer`` class takes the same arguments as a typical
 method from `Searching the tree`_: :ref:`name <name>`, :ref:`attrs
@@ -2969,6 +2978,116 @@
  # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
  #  u'\n\n', u'...', u'\n']
 
+Customizing multi-valued attributes
+-----------------------------------
+
+In an HTML document, an attribute like ``class`` is given a list of
+values, and an attribute like ``id`` is given a single value, because
+the HTML specification treats those attributes differently::
+
+  markup = '<a class="cls1 cls2" id="id1 id2">'
+  soup = BeautifulSoup(markup)
+  soup.a['class']
+  # ['cls1', 'cls2']
+  soup.a['id']
+  # 'id1 id2'
+
+You can turn this off by passing in
+``multi_valued_attributes=None``. Than all attributes will be given a
+single value::
+
+  soup = BeautifulSoup(markup, multi_valued_attributes=None)
+  soup.a['class']
+  # 'cls1 cls2'
+  soup.a['id']
+  # 'id1 id2'
+
+You can customize this behavior quite a bit by passing in a
+dictionary for ``multi_valued_attributes``. If you need this, look at
+``HTMLTreeBuilder.DEFAULT_CDATA_LIST_ATTRIBUTES`` to see the
+configuration Beautiful Soup uses by default, which is based on the
+HTML specification.
+
+`(This is a new feature in Beautiful Soup 4.8.0.)`
+
+Handling duplicate attributes
+-----------------------------
+
+When using the ``html.parser`` parser, you can use the
+``on_duplicate_attribute`` constructor argument to customize what
+Beautiful Soup does when it encounters a tag that defines the same
+attribute more than once::
+
+  markup = '<a href="http://url1/"; href="http://url2/";>'
+
+The default behavior is to use the last value found for the tag::
+
+  soup = BeautifulSoup(markup, 'html.parser')
+  soup.a['href']
+  # http://url2/
+
+  soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='replace')
+  soup.a['href']
+  # http://url2/
+  
+With ``on_duplicate_attribute='ignore'`` you can tell Beautiful Soup
+to use the `first` value found and ignore the rest::
+
+  soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='ignore')
+  soup.a['href']
+  # http://url1/
+
+(lxml and html5lib always do it this way; their behavior can't be
+configured from within Beautiful Soup.)
+
+If you need more, you can pass in a function that's called on each duplicate 
value::
+
+  def accumulate(attributes_so_far, key, value):
+      if not isinstance(attributes_so_far[key], list):
+          attributes_so_far[key] = [attributes_so_far[key]]
+      attributes_so_far[key].append(value)
+
+  soup = BeautifulSoup(markup, 'html.parser', 
on_duplicate_attribute=accumulate)
+  soup.a['href']
+  # ["http://url1/";, "http://url2/";]
+
+`(This is a new feature in Beautiful Soup 4.9.1.)`
+
+Instantiating custom subclasses
+-------------------------------
+
+When a parser tells Beautiful Soup about a tag or a string, Beautiful
+Soup will instantiate a ``Tag`` or ``NavigableString`` object to
+contain that information. Instead of that default behavior, you can
+tell Beautiful Soup to instantiate `subclasses` of ``Tag`` or
+``NavigableString``, subclasses you define with custom behavior::
+
+  from bs4 import Tag, NavigableString
+  class MyTag(Tag):
+      pass
+  
+  class MyString(NavigableString):
+      pass
+
+  markup = "<div>some text</div>"
+  soup = BeautifulSoup(markup)
+  isinstance(soup.div, MyTag)
+  # False
+  isinstance(soup.div.string, MyString)
+  # False 
+
+  my_classes = { Tag: MyTag, NavigableString: MyString }
+  soup = BeautifulSoup(markup, element_classes=my_classes)
+  isinstance(soup.div, MyTag)
+  # True
+  isinstance(soup.div.string, MyString)
+  # True  
+
+This can be useful when incorporating Beautiful Soup into a test
+framework.
+
+`(This is a new feature in Beautiful Soup 4.8.1.)`
+
 Troubleshooting
 ===============
 
@@ -3278,6 +3397,18 @@
 * ``Tag.next`` -> ``Tag.next_element``
 * ``Tag.previous`` -> ``Tag.previous_element``
 
+These methods are left over from the Beautiful Soup 2 API. They've
+been deprecated since 2006, and should not be used at all:
+
+* ``Tag.fetchNextSiblings``
+* ``Tag.fetchPreviousSiblings``
+* ``Tag.fetchPrevious``
+* ``Tag.fetchPreviousSiblings``
+* ``Tag.fetchParents``
+* ``Tag.findChild``
+* ``Tag.findChildren``
+
+
 Generators
 ^^^^^^^^^^
 
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.9.0/setup.py 
new/beautifulsoup4-4.9.1/setup.py
--- old/beautifulsoup4-4.9.0/setup.py   2020-04-05 21:54:18.000000000 +0200
+++ new/beautifulsoup4-4.9.1/setup.py   2020-05-17 19:53:03.000000000 +0200
@@ -16,7 +16,7 @@
     # NOTE: We can't import __version__ from bs4 because bs4/__init__.py is 
Python 2 code,
     # and converting it to Python 3 means going through this code to run 2to3.
     # So we have to specify it twice for the time being.
-    version = '4.9.0',
+    version = '4.9.1',
     author="Leonard Richardson",
     author_email='leona...@segfault.org',
     url="http://www.crummy.com/software/BeautifulSoup/bs4/";,

commit python-beautifulsoup4 for openSUSE:Factory

Reply via email to