commit python-beautifulsoup4 for openSUSE:Factory

root Tue, 30 Jul 2019 04:05:27 -0700

Hello community,

here is the log from the commit of package python-beautifulsoup4 for 
openSUSE:Factory checked in at 2019-07-30 13:05:12
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/python-beautifulsoup4 (Old)
 and      /work/SRC/openSUSE:Factory/.python-beautifulsoup4.new.4126 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Package is "python-beautifulsoup4"

Tue Jul 30 13:05:12 2019 rev:29 rq:717648 version:4.8.0

Changes:
--------
--- 
/work/SRC/openSUSE:Factory/python-beautifulsoup4/python-beautifulsoup4.changes  
    2019-03-04 09:11:05.132700786 +0100
+++ 
/work/SRC/openSUSE:Factory/.python-beautifulsoup4.new.4126/python-beautifulsoup4.changes
    2019-07-30 13:05:15.146390127 +0200
@@ -1,0 +2,23 @@
+Mon Jul 22 16:18:23 UTC 2019 - Todd R <toddrme2...@gmail.com>
+
+- Update to 4.8.0
+  * It's now possible to customize the TreeBuilder object by passing
+    keyword arguments into the BeautifulSoup constructor. The main
+    reason to do this right now is to change how which attributes are
+    treated as multi-valued attributes (the way 'class' is treated by
+    default). You can do this with the `multi_valued_attributes` argument.
+  * The role of Formatter objects has been greatly expanded. The Formatter
+    class now controls the following:
+    > The function to call to perform entity substitution. (This was
+      previously Formatter's only job.)
+    > Which tags should be treated as containing CDATA and have their
+      contents exempt from entity substitution.
+    > The order in which a tag's attributes are output.
+    > Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>'
+    All preexisting code should work as before.
+  * Added a new method to the API, Tag.smooth(), which consolidates
+    multiple adjacent NavigableString elements.
+  * &apos; (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is now
+    recognized as a named entity and converted to a single quote.
+
+-------------------------------------------------------------------

Old:
----
  beautifulsoup4-4.7.1.tar.gz

New:
----
  beautifulsoup4-4.8.0.tar.gz

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Other differences:
------------------
++++++ python-beautifulsoup4.spec ++++++
--- /var/tmp/diff_new_pack.SDgoFB/_old  2019-07-30 13:05:15.818389950 +0200
+++ /var/tmp/diff_new_pack.SDgoFB/_new  2019-07-30 13:05:15.822389949 +0200
@@ -18,7 +18,7 @@
 
 %{?!python_module:%define python_module() python-%{**} python3-%{**}}
 Name:           python-beautifulsoup4
-Version:        4.7.1
+Version:        4.8.0
 Release:        0
 Summary:        HTML/XML Parser for Quick-Turnaround Applications Like 
Screen-Scraping
 License:        MIT

++++++ beautifulsoup4-4.7.1.tar.gz -> beautifulsoup4-4.8.0.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/NEWS.txt 
new/beautifulsoup4-4.8.0/NEWS.txt
--- old/beautifulsoup4-4.7.1/NEWS.txt   2019-01-07 01:36:52.000000000 +0100
+++ new/beautifulsoup4-4.8.0/NEWS.txt   2019-07-20 01:41:41.000000000 +0200
@@ -1,3 +1,30 @@
+= 4.8.0 (20190720, "One Small Soup")
+
+* It's now possible to customize the TreeBuilder object by passing
+  keyword arguments into the BeautifulSoup constructor. The main
+  reason to do this right now is to change how which attributes are
+  treated as multi-valued attributes (the way 'class' is treated by
+  default). You can do this with the `multi_valued_attributes` argument.
+  [bug=1832978]
+
+* The role of Formatter objects has been greatly expanded. The Formatter
+  class now controls the following:
+
+  - The function to call to perform entity substitution. (This was
+    previously Formatter's only job.)
+  - Which tags should be treated as containing CDATA and have their
+    contents exempt from entity substitution.
+  - The order in which a tag's attributes are output. [bug=1812422]
+  - Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>'
+
+  All preexisting code should work as before.
+
+* Added a new method to the API, Tag.smooth(), which consolidates
+  multiple adjacent NavigableString elements.
+
+* &apos; (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is now
+  recognized as a named entity and converted to a single quote. [bug=1818721]
+
 = 4.7.1 (20190106)
 
 * Fixed a significant performance problem introduced in 4.7.0. [bug=1810617]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/PKG-INFO 
new/beautifulsoup4-4.8.0/PKG-INFO
--- old/beautifulsoup4-4.7.1/PKG-INFO   2019-01-07 01:51:37.000000000 +0100
+++ new/beautifulsoup4-4.8.0/PKG-INFO   2019-07-20 13:29:22.000000000 +0200
@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: beautifulsoup4
-Version: 4.7.1
+Version: 4.8.0
 Summary: Screen-scraping library
 Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
 Author: Leonard Richardson
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' 
old/beautifulsoup4-4.7.1/beautifulsoup4.egg-info/PKG-INFO 
new/beautifulsoup4-4.8.0/beautifulsoup4.egg-info/PKG-INFO
--- old/beautifulsoup4-4.7.1/beautifulsoup4.egg-info/PKG-INFO   2019-01-07 
01:51:37.000000000 +0100
+++ new/beautifulsoup4-4.8.0/beautifulsoup4.egg-info/PKG-INFO   2019-07-20 
13:29:22.000000000 +0200
@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: beautifulsoup4
-Version: 4.7.1
+Version: 4.8.0
 Summary: Screen-scraping library
 Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
 Author: Leonard Richardson
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' 
old/beautifulsoup4-4.7.1/beautifulsoup4.egg-info/SOURCES.txt 
new/beautifulsoup4-4.8.0/beautifulsoup4.egg-info/SOURCES.txt
--- old/beautifulsoup4-4.7.1/beautifulsoup4.egg-info/SOURCES.txt        
2019-01-07 01:51:37.000000000 +0100
+++ new/beautifulsoup4-4.8.0/beautifulsoup4.egg-info/SOURCES.txt        
2019-07-20 13:29:22.000000000 +0200
@@ -17,6 +17,7 @@
 bs4/dammit.py
 bs4/diagnose.py
 bs4/element.py
+bs4/formatter.py
 bs4/testing.py
 bs4/builder/__init__.py
 bs4/builder/_html5lib.py
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/__init__.py 
new/beautifulsoup4-4.8.0/bs4/__init__.py
--- old/beautifulsoup4-4.7.1/bs4/__init__.py    2019-01-07 01:50:44.000000000 
+0100
+++ new/beautifulsoup4-4.8.0/bs4/__init__.py    2019-07-17 03:31:51.000000000 
+0200
@@ -18,7 +18,7 @@
 """
 
 __author__ = "Leonard Richardson (leona...@segfault.org)"
-__version__ = "4.7.1"
+__version__ = "4.8.0"
 __copyright__ = "Copyright (c) 2004-2019 Leonard Richardson"
 # Use of this source code is governed by the MIT license.
 __license__ = "MIT"
@@ -98,8 +98,10 @@
         name a specific parser, so that Beautiful Soup gives you the
         same results across platforms and virtual environments.
 
-        :param builder: A specific TreeBuilder to use instead of looking one
-        up based on `features`. You shouldn't need to use this.
+        :param builder: A TreeBuilder subclass to instantiate (or
+        instance to use) instead of looking one up based on
+        `features`. You only need to use this if you've implemented a
+        custom TreeBuilder.
 
         :param parse_only: A SoupStrainer. Only parts of the document
         matching the SoupStrainer will be considered. This is useful
@@ -118,11 +120,17 @@
         :param kwargs: For backwards compatibility purposes, the
         constructor accepts certain keyword arguments used in
         Beautiful Soup 3. None of these arguments do anything in
-        Beautiful Soup 4 and there's no need to actually pass keyword
-        arguments into the constructor.
+        Beautiful Soup 4; they will result in a warning and then be ignored.
+
+        Apart from this, any keyword arguments passed into the BeautifulSoup
+        constructor are propagated to the TreeBuilder constructor. This
+        makes it possible to configure a TreeBuilder beyond saying
+        which one to use.
+
         """
 
         if 'convertEntities' in kwargs:
+            del kwargs['convertEntities']
             warnings.warn(
                 "BS4 does not respect the convertEntities argument to the "
                 "BeautifulSoup constructor. Entities are always converted "
@@ -177,13 +185,17 @@
             warnings.warn("You provided Unicode markup but also provided a 
value for from_encoding. Your from_encoding will be ignored.")
             from_encoding = None
 
-        if len(kwargs) > 0:
-            arg = kwargs.keys().pop()
-            raise TypeError(
-                "__init__() got an unexpected keyword argument '%s'" % arg)
-
-        if builder is None:
-            original_features = features
+        # We need this information to track whether or not the builder
+        # was specified well enough that we can omit the 'you need to
+        # specify a parser' warning.
+        original_builder = builder
+        original_features = features
+            
+        if isinstance(builder, type):
+            # A builder class was passed in; it needs to be instantiated.
+            builder_class = builder
+            builder = None
+        elif builder is None:
             if isinstance(features, basestring):
                 features = [features]
             if features is None or len(features) == 0:
@@ -194,9 +206,16 @@
                     "Couldn't find a tree builder with the features you "
                     "requested: %s. Do you need to install a parser library?"
                     % ",".join(features))
-            builder = builder_class()
-            if not (original_features == builder.NAME or
-                    original_features in builder.ALTERNATE_NAMES):
+
+        # At this point either we have a TreeBuilder instance in
+        # builder, or we have a builder_class that we can instantiate
+        # with the remaining **kwargs.
+        if builder is None:
+            builder = builder_class(**kwargs)
+            if not original_builder and not (
+                    original_features == builder.NAME or
+                    original_features in builder.ALTERNATE_NAMES
+            ):
                 if builder.is_xml:
                     markup_type = "XML"
                 else:
@@ -231,7 +250,10 @@
                         markup_type=markup_type
                     )
                     warnings.warn(self.NO_PARSER_SPECIFIED_WARNING % values, 
stacklevel=2)
-
+        else:
+            if kwargs:
+                warnings.warn("Keyword arguments to the BeautifulSoup 
constructor will be ignored. These would normally be passed into the 
TreeBuilder constructor, but a TreeBuilder instance was passed in as 
`builder`.")
+                    
         self.builder = builder
         self.is_xml = builder.is_xml
         self.known_xml = self.is_xml
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/builder/__init__.py 
new/beautifulsoup4-4.8.0/bs4/builder/__init__.py
--- old/beautifulsoup4-4.7.1/bs4/builder/__init__.py    2018-12-31 
02:49:46.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/builder/__init__.py    2019-07-14 
22:16:04.000000000 +0200
@@ -7,7 +7,6 @@
 from bs4.element import (
     CharsetMetaAttributeValue,
     ContentMetaAttributeValue,
-    HTMLAwareEntitySubstitution,
     nonwhitespace_re
     )
 
@@ -90,18 +89,40 @@
 
     is_xml = False
     picklable = False
-    preserve_whitespace_tags = set()
     empty_element_tags = None # A tag will be considered an empty-element
                               # tag when and only when it has no contents.
     
     # A value for these tag/attribute combinations is a space- or
     # comma-separated list of CDATA, rather than a single CDATA.
-    cdata_list_attributes = {}
+    DEFAULT_CDATA_LIST_ATTRIBUTES = {}
 
+    DEFAULT_PRESERVE_WHITESPACE_TAGS = set()
+    
+    USE_DEFAULT = object()
+    
+    def __init__(self, multi_valued_attributes=USE_DEFAULT, 
preserve_whitespace_tags=USE_DEFAULT):
+        """Constructor.
 
-    def __init__(self):
-        self.soup = None
+        :param multi_valued_attributes: If this is set to None, the
+        TreeBuilder will not turn any values for attributes like
+        'class' into lists. Setting this do a dictionary will
+        customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES
+        for an example.
+
+        Internally, these are called "CDATA list attributes", but that
+        probably doesn't make sense to an end-user, so the argument name
+        is `multi_valued_attributes`.
 
+        :param preserve_whitespace_tags:
+        """
+        self.soup = None
+        if multi_valued_attributes is self.USE_DEFAULT:
+            multi_valued_attributes = self.DEFAULT_CDATA_LIST_ATTRIBUTES
+        self.cdata_list_attributes = multi_valued_attributes
+        if preserve_whitespace_tags is self.USE_DEFAULT:
+            preserve_whitespace_tags = self.DEFAULT_PRESERVE_WHITESPACE_TAGS
+        self.preserve_whitespace_tags = preserve_whitespace_tags
+            
     def initialize_soup(self, soup):
         """The BeautifulSoup object has been initialized and is now
         being associated with the TreeBuilder.
@@ -131,7 +152,7 @@
         if self.empty_element_tags is None:
             return True
         return tag_name in self.empty_element_tags
-        
+    
     def feed(self, markup):
         raise NotImplementedError()
 
@@ -237,7 +258,6 @@
     Such as which tags are empty-element tags.
     """
 
-    preserve_whitespace_tags = 
HTMLAwareEntitySubstitution.preserve_whitespace_tags
     empty_element_tags = set([
         # These are from HTML5.
         'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 
'link', 'menuitem', 'meta', 'param', 'source', 'track', 'wbr',
@@ -259,7 +279,7 @@
     # encounter one of these attributes, we will parse its value into
     # a list of values if possible. Upon output, the list will be
     # converted back into a string.
-    cdata_list_attributes = {
+    DEFAULT_CDATA_LIST_ATTRIBUTES = {
         "*" : ['class', 'accesskey', 'dropzone'],
         "a" : ['rel', 'rev'],
         "link" :  ['rel', 'rev'],
@@ -276,6 +296,8 @@
         "output" : ["for"],
         }
 
+    DEFAULT_PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
+    
     def set_up_substitutions(self, tag):
         # We are only interested in <meta> tags
         if tag.name != 'meta':
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/builder/_html5lib.py 
new/beautifulsoup4-4.8.0/bs4/builder/_html5lib.py
--- old/beautifulsoup4-4.7.1/bs4/builder/_html5lib.py   2018-12-31 
02:50:27.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/builder/_html5lib.py   2019-07-08 
03:59:55.000000000 +0200
@@ -199,7 +199,7 @@
     def __setitem__(self, name, value):
         # If this attribute is a multi-valued attribute for this element,
         # turn its value into a list.
-        list_attr = HTML5TreeBuilder.cdata_list_attributes
+        list_attr = self.element.cdata_list_attributes
         if (name in list_attr['*']
             or (self.element.name in list_attr
                 and name in list_attr[self.element.name])):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/builder/_htmlparser.py 
new/beautifulsoup4-4.8.0/bs4/builder/_htmlparser.py
--- old/beautifulsoup4-4.7.1/bs4/builder/_htmlparser.py 2018-12-24 
16:32:39.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/builder/_htmlparser.py 2019-07-07 
20:09:37.000000000 +0200
@@ -214,12 +214,15 @@
     NAME = HTMLPARSER
     features = [NAME, HTML, STRICT]
 
-    def __init__(self, *args, **kwargs):
+    def __init__(self, parser_args=None, parser_kwargs=None, **kwargs):
+        super(HTMLParserTreeBuilder, self).__init__(**kwargs)
+        parser_args = parser_args or []
+        parser_kwargs = parser_kwargs or {}
         if CONSTRUCTOR_TAKES_STRICT and not CONSTRUCTOR_STRICT_IS_DEPRECATED:
-            kwargs['strict'] = False
+            parser_kwargs['strict'] = False
         if CONSTRUCTOR_TAKES_CONVERT_CHARREFS:
-            kwargs['convert_charrefs'] = False
-        self.parser_args = (args, kwargs)
+            parser_kwargs['convert_charrefs'] = False
+        self.parser_args = (parser_args, parser_kwargs)
 
     def prepare_markup(self, markup, user_specified_encoding=None,
                        document_declared_encoding=None, 
exclude_encodings=None):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/builder/_lxml.py 
new/beautifulsoup4-4.8.0/bs4/builder/_lxml.py
--- old/beautifulsoup4-4.7.1/bs4/builder/_lxml.py       2019-01-07 
00:41:32.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/builder/_lxml.py       2019-07-08 
03:59:55.000000000 +0200
@@ -94,7 +94,7 @@
             parser = parser(target=self, strip_cdata=False, encoding=encoding)
         return parser
 
-    def __init__(self, parser=None, empty_element_tags=None):
+    def __init__(self, parser=None, empty_element_tags=None, **kwargs):
         # TODO: Issue a warning if parser is present but not a
         # callable, since that means there's no way to create new
         # parsers for different encodings.
@@ -103,6 +103,7 @@
             self.empty_element_tags = set(empty_element_tags)
         self.soup = None
         self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
+        super(LXMLTreeBuilderForXML, self).__init__(**kwargs)
         
     def _getNsTag(self, tag):
         # Split the namespace URL out of a fully-qualified lxml tag
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/dammit.py 
new/beautifulsoup4-4.8.0/bs4/dammit.py
--- old/beautifulsoup4-4.7.1/bs4/dammit.py      2018-12-24 16:31:48.000000000 
+0100
+++ new/beautifulsoup4-4.8.0/bs4/dammit.py      2019-07-08 03:45:20.000000000 
+0200
@@ -57,15 +57,24 @@
         lookup = {}
         reverse_lookup = {}
         characters_for_re = []
-        for codepoint, name in list(codepoint2name.items()):
+
+        # &apos is an XHTML entity and an HTML 5, but not an HTML 4
+        # entity. We don't want to use it, but we want to recognize it on the 
way in.
+        #
+        # TODO: Ideally we would be able to recognize all HTML 5 named
+        # entities, but that's a little tricky.
+        extra = [(39, 'apos')]
+        for codepoint, name in list(codepoint2name.items()) + extra:
             character = unichr(codepoint)
-            if codepoint != 34:
+            if codepoint not in (34, 39):
                 # There's no point in turning the quotation mark into
-                # &quot;, unless it happens within an attribute value, which
-                # is handled elsewhere.
+                # &quot; or the single quote into &apos;, unless it
+                # happens within an attribute value, which is handled
+                # elsewhere.
                 characters_for_re.append(character)
                 lookup[character] = name
-            # But we do want to turn &quot; into the quotation mark.
+            # But we do want to recognize those entities on the way in and
+            # convert them to Unicode characters.
             reverse_lookup[name] = character
         re_definition = "[%s]" % "".join(characters_for_re)
         return lookup, reverse_lookup, re.compile(re_definition)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/element.py 
new/beautifulsoup4-4.8.0/bs4/element.py
--- old/beautifulsoup4-4.7.1/bs4/element.py     2019-01-07 01:35:05.000000000 
+0100
+++ new/beautifulsoup4-4.8.0/bs4/element.py     2019-07-16 22:46:05.000000000 
+0200
@@ -16,7 +16,11 @@
         'The soupsieve package is not installed. CSS selectors cannot be used.'
     )
 
-from bs4.dammit import EntitySubstitution
+from bs4.formatter import (
+    Formatter,
+    HTMLFormatter,
+    XMLFormatter,
+)
 
 DEFAULT_OUTPUT_ENCODING = "utf-8"
 PY3K = (sys.version_info[0] > 2)
@@ -99,138 +103,71 @@
             return match.group(1) + encoding
         return self.CHARSET_RE.sub(rewrite, self.original_value)
 
-class HTMLAwareEntitySubstitution(EntitySubstitution):
-
-    """Entity substitution rules that are aware of some HTML quirks.
-
-    Specifically, the contents of <script> and <style> tags should not
-    undergo entity substitution.
-
-    Incoming NavigableString objects are checked to see if they're the
-    direct children of a <script> or <style> tag.
-    """
-
-    cdata_containing_tags = set(["script", "style"])
-
-    preformatted_tags = set(["pre"])
-
-    preserve_whitespace_tags = set(['pre', 'textarea'])
-
-    @classmethod
-    def _substitute_if_appropriate(cls, ns, f):
-        if (isinstance(ns, NavigableString)
-            and ns.parent is not None
-            and ns.parent.name in cls.cdata_containing_tags):
-            # Do nothing.
-            return ns
-        # Substitute.
-        return f(ns)
-
-    @classmethod
-    def substitute_html(cls, ns):
-        return cls._substitute_if_appropriate(
-            ns, EntitySubstitution.substitute_html)
-
-    @classmethod
-    def substitute_xml(cls, ns):
-        return cls._substitute_if_appropriate(
-            ns, EntitySubstitution.substitute_xml)
-
-class Formatter(object):
-    """Contains information about how to format a parse tree."""
-    
-    # By default, represent void elements as <tag/> rather than <tag>
-    void_element_close_prefix = '/'
-
-    def substitute_entities(self, *args, **kwargs):
-        """Transform certain characters into named entities."""
-        raise NotImplementedError()
-
-class HTMLFormatter(Formatter):
-    """The default HTML formatter."""
-    def substitute(self, *args, **kwargs):
-        return HTMLAwareEntitySubstitution.substitute_html(*args, **kwargs)
-
-class MinimalHTMLFormatter(Formatter):
-    """A minimal HTML formatter."""
-    def substitute(self, *args, **kwargs):
-        return HTMLAwareEntitySubstitution.substitute_xml(*args, **kwargs)
-    
-class HTML5Formatter(HTMLFormatter):
-    """An HTML formatter that omits the slash in a void tag."""
-    void_element_close_prefix = None
-
-class XMLFormatter(Formatter):
-    """Substitute only the essential XML entities."""
-    def substitute(self, *args, **kwargs):
-        return EntitySubstitution.substitute_xml(*args, **kwargs)
-
-class HTMLXMLFormatter(Formatter):
-    """Format XML using HTML rules."""
-    def substitute(self, *args, **kwargs):
-        return HTMLAwareEntitySubstitution.substitute_html(*args, **kwargs)
-
     
 class PageElement(object):
     """Contains the navigational information for some part of the page
     (either a tag or a piece of text)"""
+   
+    def setup(self, parent=None, previous_element=None, next_element=None,
+              previous_sibling=None, next_sibling=None):
+        """Sets up the initial relations between this element and
+        other elements."""
+        self.parent = parent
+
+        self.previous_element = previous_element
+        if previous_element is not None:
+            self.previous_element.next_element = self
+
+        self.next_element = next_element
+        if self.next_element is not None:
+            self.next_element.previous_element = self
 
-    # There are five possible values for the "formatter" argument passed in
-    # to methods like encode() and prettify():
-    #
-    # "html" - All Unicode characters with corresponding HTML entities
-    #   are converted to those entities on output.
-    # "html5" - The same as "html", but empty void tags are represented as
-    #   <tag> rather than <tag/>
-    # "minimal" - Bare ampersands and angle brackets are converted to
-    #   XML entities: &amp; &lt; &gt;
-    # None - The null formatter. Unicode characters are never
-    #   converted to entities.  This is not recommended, but it's
-    #   faster than "minimal".
-    # A callable function - it will be called on every string that needs to 
undergo entity substitution.
-    # A Formatter instance - Formatter.substitute(string) will be called on 
every string that
-    #  needs to undergo entity substitution.
-    #
-
-    # In an HTML document, the default "html", "html5", and "minimal"
-    # functions will leave the contents of <script> and <style> tags
-    # alone. For an XML document, all tags will be given the same
-    # treatment.
-
-    HTML_FORMATTERS = {
-        "html" : HTMLFormatter(),
-        "html5" : HTML5Formatter(),
-        "minimal" : MinimalHTMLFormatter(),
-        None : None
-        }
-
-    XML_FORMATTERS = {
-        "html" : HTMLXMLFormatter(),
-        "minimal" : XMLFormatter(),
-        None : None
-        }
+        self.next_sibling = next_sibling
+        if self.next_sibling is not None:
+            self.next_sibling.previous_sibling = self
 
-    def format_string(self, s, formatter='minimal'):
+        if (previous_sibling is None
+            and self.parent is not None and self.parent.contents):
+            previous_sibling = self.parent.contents[-1]
+
+        self.previous_sibling = previous_sibling
+        if previous_sibling is not None:
+            self.previous_sibling.next_sibling = self
+
+    def format_string(self, s, formatter):
         """Format the given string using the given formatter."""
-        if isinstance(formatter, basestring):
-            formatter = self._formatter_for_name(formatter)
         if formatter is None:
-            output = s
-        else:
-            if isinstance(formatter, Callable):
-                # Backwards compatibility -- you used to pass in a formatting 
method.
-                output = formatter(s)
-            else:
-                output = formatter.substitute(s)
+            return s
+        if not isinstance(formatter, Formatter):
+            formatter = self.formatter_for_name(formatter)
+        output = formatter.substitute(s)
         return output
 
+    def formatter_for_name(self, formatter):
+        """Look up or create a Formatter for the given identifier,
+        if necessary.
+
+        :param formatter: Can be a Formatter object (used as-is), a
+        function (used as the entity substitution hook for an
+        XMLFormatter or HTMLFormatter), or a string (used to look up
+        an XMLFormatter or HTMLFormatter in the appropriate registry.
+        """
+        if isinstance(formatter, Formatter):
+            return formatter
+        if self._is_xml:
+            c = XMLFormatter
+        else:
+            c = HTMLFormatter
+        if callable(formatter):
+            return c(entity_substitution=formatter)
+        return c.REGISTRY[formatter]
+
     @property
     def _is_xml(self):
         """Is this element part of an XML tree or an HTML tree?
 
-        This is used when mapping a formatter name ("minimal") to an
-        appropriate function (one that performs entity-substitution on
-        the contents of <script> and <style> tags, or not). It can be
+        This is used in formatter_for_name, when deciding whether an
+        XMLFormatter or HTMLFormatter is more appropriate. It can be
         inefficient, but it should be called very rarely.
         """
         if self.known_xml is not None:
@@ -248,46 +185,13 @@
             return getattr(self, 'is_xml', False)
         return self.parent._is_xml
 
-    def _formatter_for_name(self, name):
-        "Look up a formatter function based on its name and the tree."
-        if self._is_xml:
-            return self.XML_FORMATTERS.get(name, XMLFormatter())
-        else:
-            return self.HTML_FORMATTERS.get(name, HTMLFormatter())
-
-    def setup(self, parent=None, previous_element=None, next_element=None,
-              previous_sibling=None, next_sibling=None):
-        """Sets up the initial relations between this element and
-        other elements."""
-        self.parent = parent
-
-        self.previous_element = previous_element
-        if previous_element is not None:
-            self.previous_element.next_element = self
-
-        self.next_element = next_element
-        if self.next_element is not None:
-            self.next_element.previous_element = self
-
-        self.next_sibling = next_sibling
-        if self.next_sibling is not None:
-            self.next_sibling.previous_sibling = self
-
-        if (previous_sibling is None
-            and self.parent is not None and self.parent.contents):
-            previous_sibling = self.parent.contents[-1]
-
-        self.previous_sibling = previous_sibling
-        if previous_sibling is not None:
-            self.previous_sibling.next_sibling = self
-
     nextSibling = _alias("next_sibling")  # BS3
     previousSibling = _alias("previous_sibling")  # BS3
 
     def replace_with(self, replace_with):
         if self.parent is None:
             raise ValueError(
-                "Cannot replace one element with another when the"
+                "Cannot replace one element with another when the "
                 "element to be replaced is not part of a tree.")
         if replace_with is self:
             return
@@ -742,6 +646,7 @@
                     self.__class__.__name__, attr))
 
     def output_ready(self, formatter="minimal"):
+        """Run the string through the provided formatter."""
         output = self.format_string(self, formatter)
         return self.PREFIX + output + self.SUFFIX
 
@@ -760,10 +665,12 @@
     but the return value will be ignored.
     """
 
-    def output_ready(self, formatter="minimal"):
-        """CData strings are passed into the formatter.
-        But the return value is ignored."""
-        self.format_string(self, formatter)
+    def output_ready(self, formatter=None):
+        """CData strings are passed into the formatter, purely
+        for any side effects. The return value is ignored.
+        """
+        if formatter is not None:
+            ignore = self.format_string(self, formatter)
         return self.PREFIX + self + self.SUFFIX
 
 class CData(PreformattedString):
@@ -831,14 +738,6 @@
         self.name = name
         self.namespace = namespace
         self.prefix = prefix
-        if builder is not None:
-            preserve_whitespace_tags = builder.preserve_whitespace_tags
-        else:
-            if is_xml:
-                preserve_whitespace_tags = []
-            else:
-                preserve_whitespace_tags = 
HTMLAwareEntitySubstitution.preserve_whitespace_tags
-        self.preserve_whitespace_tags = preserve_whitespace_tags
         if attrs is None:
             attrs = {}
         elif attrs:
@@ -861,12 +760,31 @@
         self.setup(parent, previous)
         self.hidden = False
 
-        # Set up any substitutions, such as the charset in a META tag.
-        if builder is not None:
+        if builder is None:
+            # In the absence of a TreeBuilder, assume this tag is nothing
+            # special.
+            self.can_be_empty_element = False
+            self.cdata_list_attributes = None
+        else:
+            # Set up any substitutions for this tag, such as the charset in a 
META tag.
             builder.set_up_substitutions(self)
+
+            # Ask the TreeBuilder whether this tag might be an empty-element 
tag.
             self.can_be_empty_element = builder.can_be_empty_element(name)
-        else:
-            self.can_be_empty_element = False
+
+            # Keep track of the list of attributes of this tag that
+            # might need to be treated as a list.
+            #
+            # For performance reasons, we store the whole data structure
+            # rather than asking the question of every tag. Asking would
+            # require building a new data structure every time, and
+            # (unlike can_be_empty_element), we almost never need
+            # to check this.
+            self.cdata_list_attributes = builder.cdata_list_attributes
+
+            # Keep track of the names that might cause this tag to be treated 
as a
+            # whitespace-preserved tag.
+            self.preserve_whitespace_tags = builder.preserve_whitespace_tags
             
     parserClass = _alias("parser_class")  # BS3
 
@@ -981,6 +899,43 @@
             for element in self.contents[:]:
                 element.extract()
 
+    def smooth(self):
+        """Smooth out this element's children by consolidating consecutive 
strings.
+
+        This makes pretty-printed output look more natural following a
+        lot of operations that modified the tree.
+        """
+        # Mark the first position of every pair of children that need
+        # to be consolidated.  Do this rather than making a copy of
+        # self.contents, since in most cases very few strings will be
+        # affected.
+        marked = []
+        for i, a in enumerate(self.contents):
+            if isinstance(a, Tag):
+                # Recursively smooth children.
+                a.smooth()
+            if i == len(self.contents)-1:
+                # This is the last item in .contents, and it's not a
+                # tag. There's no chance it needs any work.
+                continue
+            b = self.contents[i+1]
+            if (isinstance(a, NavigableString)
+                and isinstance(b, NavigableString)
+                and not isinstance(a, PreformattedString)
+                and not isinstance(b, PreformattedString)
+            ):
+                marked.append(i)
+
+        # Go over the marked positions in reverse order, so that
+        # removing items from .contents won't affect the remaining
+        # positions.
+        for i in reversed(marked):
+            a = self.contents[i]
+            b = self.contents[i+1]
+            b.extract()
+            n = NavigableString(a+b)
+            a.replace_with(n)
+
     def index(self, element):
         """
         Find the index of a child by identity, not value. Avoids issues with
@@ -1115,14 +1070,6 @@
         u = self.decode(indent_level, encoding, formatter)
         return u.encode(encoding, errors)
 
-    def _should_pretty_print(self, indent_level):
-        """Should this tag be pretty-printed?"""
-
-        return (
-            indent_level is not None
-            and self.name not in self.preserve_whitespace_tags
-        )
-
     def decode(self, indent_level=None,
                eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                formatter="minimal"):
@@ -1136,30 +1083,32 @@
            encoding.
         """
 
-        # First off, turn a string formatter into a Formatter object. This
-        # will stop the lookup from happening over and over again.
-        if not isinstance(formatter, Formatter) and not isinstance(formatter, 
Callable):
-            formatter = self._formatter_for_name(formatter)
+        # First off, turn a non-Formatter `formatter` into a Formatter
+        # object. This will stop the lookup from happening over and
+        # over again.
+        if not isinstance(formatter, Formatter):
+            formatter = self.formatter_for_name(formatter)
+        attributes = formatter.attributes(self)
         attrs = []
-        if self.attrs:
-            for key, val in sorted(self.attrs.items()):
-                if val is None:
-                    decoded = key
-                else:
-                    if isinstance(val, list) or isinstance(val, tuple):
-                        val = ' '.join(val)
-                    elif not isinstance(val, basestring):
-                        val = unicode(val)
-                    elif (
+        for key, val in attributes:
+            if val is None:
+                decoded = key
+            else:
+                if isinstance(val, list) or isinstance(val, tuple):
+                    val = ' '.join(val)
+                elif not isinstance(val, basestring):
+                    val = unicode(val)
+                elif (
                         isinstance(val, AttributeValueWithCharsetSubstitution)
-                        and eventual_encoding is not None):
-                        val = val.encode(eventual_encoding)
-
-                    text = self.format_string(val, formatter)
-                    decoded = (
-                        unicode(key) + '='
-                        + EntitySubstitution.quoted_attribute_value(text))
-                attrs.append(decoded)
+                        and eventual_encoding is not None
+                ):
+                    val = val.encode(eventual_encoding)
+
+                text = formatter.attribute_value(val)
+                decoded = (
+                    unicode(key) + '='
+                    + formatter.quoted_attribute_value(text))
+            attrs.append(decoded)
         close = ''
         closeTag = ''
 
@@ -1168,9 +1117,7 @@
             prefix = self.prefix + ":"
 
         if self.is_empty_element:
-            close = ''
-            if isinstance(formatter, Formatter):
-                close = formatter.void_element_close_prefix or close
+            close = formatter.void_element_close_prefix or ''
         else:
             closeTag = '</%s%s>' % (prefix, self.name)
 
@@ -1185,7 +1132,8 @@
         else:
             indent_contents = None
         contents = self.decode_contents(
-            indent_contents, eventual_encoding, formatter)
+            indent_contents, eventual_encoding, formatter
+        )
 
         if self.hidden:
             # This is the 'document root' object.
@@ -1217,6 +1165,13 @@
             s = ''.join(s)
         return s
 
+    def _should_pretty_print(self, indent_level):
+        """Should this tag be pretty-printed?"""
+        return (
+            indent_level is not None
+            and self.name not in self.preserve_whitespace_tags
+        )
+
     def prettify(self, encoding=None, formatter="minimal"):
         if encoding is None:
             return self.decode(True, formatter=formatter)
@@ -1232,19 +1187,19 @@
            indented this many spaces.
 
         :param eventual_encoding: The tag is destined to be
-           encoded into this encoding. This method is _not_
+           encoded into this encoding. decode_contents() is _not_
            responsible for performing that encoding. This information
            is passed in so that it can be substituted in if the
            document contains a <META> tag that mentions the document's
            encoding.
 
-        :param formatter: The output formatter responsible for converting
-           entities to Unicode characters.
+        :param formatter: A Formatter object, or a string naming one of
+            the standard Formatters.
         """
         # First off, turn a string formatter into a Formatter object. This
         # will stop the lookup from happening over and over again.
-        if not isinstance(formatter, Formatter) and not isinstance(formatter, 
Callable):
-            formatter = self._formatter_for_name(formatter)
+        if not isinstance(formatter, Formatter):
+            formatter = self.formatter_for_name(formatter)
 
         pretty_print = (indent_level is not None)
         s = []
@@ -1255,16 +1210,19 @@
             elif isinstance(c, Tag):
                 s.append(c.decode(indent_level, eventual_encoding,
                                   formatter))
-            if text and indent_level and not self.name == 'pre':
+            preserve_whitespace = (
+                self.preserve_whitespace_tags and self.name in 
self.preserve_whitespace_tags
+            )
+            if text and indent_level and not preserve_whitespace:
                 text = text.strip()
             if text:
-                if pretty_print and not self.name == 'pre':
+                if pretty_print and not preserve_whitespace:
                     s.append(" " * (indent_level - 1))
                 s.append(text)
-                if pretty_print and not self.name == 'pre':
+                if pretty_print and not preserve_whitespace:
                     s.append("\n")
         return ''.join(s)
-
+       
     def encode_contents(
         self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
         formatter="minimal"):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/formatter.py 
new/beautifulsoup4-4.8.0/bs4/formatter.py
--- old/beautifulsoup4-4.7.1/bs4/formatter.py   1970-01-01 01:00:00.000000000 
+0100
+++ new/beautifulsoup4-4.8.0/bs4/formatter.py   2019-07-16 22:46:05.000000000 
+0200
@@ -0,0 +1,99 @@
+from bs4.dammit import EntitySubstitution
+
+class Formatter(EntitySubstitution):
+    """Describes a strategy to use when outputting a parse tree to a string.
+
+    Some parts of this strategy come from the distinction between
+    HTML4, HTML5, and XML. Others are configurable by the user.
+    """
+    # Registries of XML and HTML formatters.
+    XML_FORMATTERS = {}
+    HTML_FORMATTERS = {}
+
+    HTML = 'html'
+    XML = 'xml'
+
+    HTML_DEFAULTS = dict(
+        cdata_containing_tags=set(["script", "style"]),
+    )
+
+    def _default(self, language, value, kwarg):
+        if value is not None:
+            return value
+        if language == self.XML:
+            return set()
+        return self.HTML_DEFAULTS[kwarg]
+
+    def __init__(
+            self, language=None, entity_substitution=None,
+            void_element_close_prefix='/', cdata_containing_tags=None,
+    ):
+        """
+
+        :param void_element_close_prefix: By default, represent void
+        elements as <tag/> rather than <tag>
+        """
+        self.language = language
+        self.entity_substitution = entity_substitution
+        self.void_element_close_prefix = void_element_close_prefix
+        self.cdata_containing_tags = self._default(
+            language, cdata_containing_tags, 'cdata_containing_tags'
+        )
+            
+    def substitute(self, ns):
+        """Process a string that needs to undergo entity substitution."""
+        if not self.entity_substitution:
+            return ns
+        from element import NavigableString
+        if (isinstance(ns, NavigableString)
+            and ns.parent is not None
+            and ns.parent.name in self.cdata_containing_tags):
+            # Do nothing.
+            return ns
+        # Substitute.
+        return self.entity_substitution(ns)
+
+    def attribute_value(self, value):
+        """Process the value of an attribute."""
+        return self.substitute(value)
+    
+    def attributes(self, tag):
+        """Reorder a tag's attributes however you want."""
+        return sorted(tag.attrs.items())
+
+   
+class HTMLFormatter(Formatter):
+    REGISTRY = {}
+    def __init__(self, *args, **kwargs):
+        return super(HTMLFormatter, self).__init__(self.HTML, *args, **kwargs)
+
+    
+class XMLFormatter(Formatter):
+    REGISTRY = {}
+    def __init__(self, *args, **kwargs):
+        return super(XMLFormatter, self).__init__(self.XML, *args, **kwargs)
+
+
+# Set up aliases for the default formatters.
+HTMLFormatter.REGISTRY['html'] = HTMLFormatter(
+    entity_substitution=EntitySubstitution.substitute_html
+)
+HTMLFormatter.REGISTRY["html5"] = HTMLFormatter(
+    entity_substitution=EntitySubstitution.substitute_html,
+    void_element_close_prefix = None
+)
+HTMLFormatter.REGISTRY["minimal"] = HTMLFormatter(
+    entity_substitution=EntitySubstitution.substitute_xml
+)
+HTMLFormatter.REGISTRY[None] = HTMLFormatter(
+    entity_substitution=None
+)
+XMLFormatter.REGISTRY["html"] =  XMLFormatter(
+    entity_substitution=EntitySubstitution.substitute_html
+)
+XMLFormatter.REGISTRY["minimal"] = XMLFormatter(
+    entity_substitution=EntitySubstitution.substitute_xml
+)
+XMLFormatter.REGISTRY[None] = Formatter(
+    Formatter(Formatter.XML, entity_substitution=None)
+)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/testing.py 
new/beautifulsoup4-4.8.0/bs4/testing.py
--- old/beautifulsoup4-4.7.1/bs4/testing.py     2018-12-31 03:11:14.000000000 
+0100
+++ new/beautifulsoup4-4.8.0/bs4/testing.py     2019-07-08 03:59:55.000000000 
+0200
@@ -63,19 +63,19 @@
 
     @property
     def default_builder(self):
-        return default_builder()
+        return default_builder
 
     def soup(self, markup, **kwargs):
         """Build a Beautiful Soup object from markup."""
         builder = kwargs.pop('builder', self.default_builder)
         return BeautifulSoup(markup, builder=builder, **kwargs)
 
-    def document_for(self, markup):
+    def document_for(self, markup, **kwargs):
         """Turn an HTML fragment into a document.
 
         The details depend on the builder.
         """
-        return self.default_builder.test_fragment_to_document(markup)
+        return self.default_builder(**kwargs).test_fragment_to_document(markup)
 
     def assertSoupEquals(self, to_parse, compare_parsed_to=None):
         builder = self.default_builder
@@ -232,7 +232,7 @@
             soup = self.soup("")
             new_tag = soup.new_tag(name)
             self.assertEqual(True, new_tag.is_empty_element)
-    
+
     def test_pickle_and_unpickle_identity(self):
         # Pickling a tree, then unpickling it, yields a tree identical
         # to the original.
@@ -491,6 +491,12 @@
             u"<p>\u2022 AT&amp;T is in the s&amp;p 500</p>"
         )
 
+    def test_apos_entity(self):
+        self.assertSoupEquals(
+            u"<p>Bob&apos;s Bar</p>",
+            u"<p>Bob's Bar</p>",
+        )
+        
     def test_entities_in_foreign_document_encoding(self):
         # &#147; and &#148; are invalid numeric entities referencing
         # Windows-1252 characters. &#45; references a character common
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_html5lib.py 
new/beautifulsoup4-4.8.0/bs4/tests/test_html5lib.py
--- old/beautifulsoup4-4.7.1/bs4/tests/test_html5lib.py 2018-12-23 
23:16:18.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/tests/test_html5lib.py 2019-07-07 
21:54:34.000000000 +0200
@@ -22,7 +22,7 @@
 
     @property
     def default_builder(self):
-        return HTML5TreeBuilder()
+        return HTML5TreeBuilder
 
     def test_soupstrainer(self):
         # The html5lib tree builder does not support SoupStrainers.
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_htmlparser.py 
new/beautifulsoup4-4.8.0/bs4/tests/test_htmlparser.py
--- old/beautifulsoup4-4.7.1/bs4/tests/test_htmlparser.py       2018-07-15 
14:26:01.000000000 +0200
+++ new/beautifulsoup4-4.8.0/bs4/tests/test_htmlparser.py       2019-07-07 
21:52:25.000000000 +0200
@@ -9,9 +9,7 @@
 
 class HTMLParserTreeBuilderSmokeTest(SoupTest, HTMLTreeBuilderSmokeTest):
 
-    @property
-    def default_builder(self):
-        return HTMLParserTreeBuilder()
+    default_builder = HTMLParserTreeBuilder
 
     def test_namespaced_system_doctype(self):
         # html.parser can't handle namespaced doctypes, so skip this one.
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_lxml.py 
new/beautifulsoup4-4.8.0/bs4/tests/test_lxml.py
--- old/beautifulsoup4-4.7.1/bs4/tests/test_lxml.py     2019-01-07 
00:41:32.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/tests/test_lxml.py     2019-07-07 
21:54:54.000000000 +0200
@@ -36,7 +36,7 @@
 
     @property
     def default_builder(self):
-        return LXMLTreeBuilder()
+        return LXMLTreeBuilder
 
     def test_out_of_range_entity(self):
         self.assertSoupEquals(
@@ -79,7 +79,7 @@
 
     @property
     def default_builder(self):
-        return LXMLTreeBuilderForXML()
+        return LXMLTreeBuilderForXML
 
     def test_namespace_indexing(self):
         # We should not track un-prefixed namespaces as we can only hold one
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_soup.py 
new/beautifulsoup4-4.8.0/bs4/tests/test_soup.py
--- old/beautifulsoup4-4.7.1/bs4/tests/test_soup.py     2016-07-27 
03:27:42.000000000 +0200
+++ new/beautifulsoup4-4.8.0/bs4/tests/test_soup.py     2019-07-16 
22:46:05.000000000 +0200
@@ -24,6 +24,7 @@
     EncodingDetector,
 )
 from bs4.testing import (
+    default_builder,
     SoupTest,
     skipIf,
 )
@@ -54,7 +55,72 @@
         soup = self.soup(utf8_data, exclude_encodings=["utf-8"])
         self.assertEqual("windows-1252", soup.original_encoding)
 
+    def test_custom_builder_class(self):
+        # Verify that you can pass in a custom Builder class and
+        # it'll be instantiated with the appropriate keyword arguments.
+        class Mock(object):
+            def __init__(self, **kwargs):
+                self.called_with = kwargs
+                self.is_xml = True
+            def initialize_soup(self, soup):
+                pass
+            def prepare_markup(self, *args, **kwargs):
+                return ''
+                
+        kwargs = dict(
+            var="value",
+            # This is a deprecated BS3-era keyword argument, which
+            # will be stripped out.
+            convertEntities=True,
+        )
+        with warnings.catch_warnings(record=True):
+            soup = BeautifulSoup('', builder=Mock, **kwargs)
+        assert isinstance(soup.builder, Mock)
+        self.assertEqual(dict(var="value"), soup.builder.called_with)
+
+        # You can also instantiate the TreeBuilder yourself. In this
+        # case, that specific object is used and any keyword arguments
+        # to the BeautifulSoup constructor are ignored.
+        builder = Mock(**kwargs)
+        with warnings.catch_warnings(record=True) as w:
+            soup = BeautifulSoup(
+                '', builder=builder, ignored_value=True,
+            )
+        msg = str(w[0].message)
+        assert msg.startswith("Keyword arguments to the BeautifulSoup 
constructor will be ignored.")
+        self.assertEqual(builder, soup.builder)
+        self.assertEqual(kwargs, builder.called_with)
+
+    def test_cdata_list_attributes(self):
+        # Most attribute values are represented as scalars, but the
+        # HTML standard says that some attributes, like 'class' have
+        # space-separated lists as values.
+        markup = '<a id=" an id " class=" a class "></a>'
+        soup = self.soup(markup)
+
+        # Note that the spaces are stripped for 'class' but not for 'id'.
+        a = soup.a
+        self.assertEqual(" an id ", a['id'])
+        self.assertEqual(["a", "class"], a['class'])
+
+        # TreeBuilder takes an argument called 'mutli_valued_attributes'  
which lets
+        # you customize or disable this. As always, you can customize the 
TreeBuilder
+        # by passing in a keyword argument to the BeautifulSoup constructor.
+        soup = self.soup(markup, builder=default_builder, 
multi_valued_attributes=None)
+        self.assertEqual(" a class ", soup.a['class'])
+
+        # Here are two ways of saying that `id` is a multi-valued
+        # attribute in this context, but 'class' is not.
+        for switcheroo in ({'*': 'id'}, {'a': 'id'}):
+            with warnings.catch_warnings(record=True) as w:
+                # This will create a warning about not explicitly
+                # specifying a parser, but we'll ignore it.
+                soup = self.soup(markup, builder=None, 
multi_valued_attributes=switcheroo)
+            a = soup.a
+            self.assertEqual(["an", "id"], a['id'])
+            self.assertEqual(" a class ", a['class'])
 
+        
 class TestWarnings(SoupTest):
 
     def _no_parser_specified(self, s, is_there=True):
@@ -217,7 +283,7 @@
         self.assertEqual(
             self.sub.substitute_xml_containing_entities("&Aacute;T&T"),
             "&Aacute;T&amp;T")
-
+       
     def test_quotes_not_html_substituted(self):
         """There's no need to do this except inside attribute values."""
         text = 'Bob\'s "bar"'
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_tree.py 
new/beautifulsoup4-4.8.0/bs4/tests/test_tree.py
--- old/beautifulsoup4-4.7.1/bs4/tests/test_tree.py     2019-01-07 
00:46:26.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/tests/test_tree.py     2019-07-16 
22:46:05.000000000 +0200
@@ -25,6 +25,7 @@
     Comment,
     Declaration,
     Doctype,
+    Formatter,
     NavigableString,
     SoupStrainer,
     Tag,
@@ -416,6 +417,48 @@
         self.assertEqual([], soup.find_all(id=1, text="bar"))
 
 
+class TestSmooth(TreeTest):
+    """Test Tag.smooth."""
+
+    def test_smooth(self):
+        soup = self.soup("<div>a</div>")
+        div = soup.div
+        div.append("b")
+        div.append("c")
+        div.append(Comment("Comment 1"))
+        div.append(Comment("Comment 2"))
+        div.append("d")
+        builder = self.default_builder()
+        span = Tag(soup, builder, 'span')
+        span.append('1')
+        span.append('2')
+        div.append(span)
+
+        # At this point the tree has a bunch of adjacent
+        # NavigableStrings. This is normal, but it has no meaning in
+        # terms of HTML, so we may want to smooth things out for
+        # output.
+
+        # Since the <span> tag has two children, its .string is None.
+        self.assertEquals(None, div.span.string)
+
+        self.assertEqual(7, len(div.contents))
+        div.smooth()
+        self.assertEqual(5, len(div.contents))
+
+        # The three strings at the beginning of div.contents have been
+        # merged into on string.
+        #
+        self.assertEqual('abc', div.contents[0])
+
+        # The call is recursive -- the <span> tag was also smoothed.
+        self.assertEqual('12', div.span.string)
+
+        # The two comments have _not_ been merged, even though
+        # comments are strings. Merging comments would change the
+        # meaning of the HTML.
+        self.assertEqual('Comment 1', div.contents[1])
+        self.assertEqual('Comment 2', div.contents[2])
 
 
 class TestIndex(TreeTest):
@@ -896,7 +939,7 @@
         self.assertEqual(soup.a.contents[0].next_element, "bar")
 
     def test_insert_tag(self):
-        builder = self.default_builder
+        builder = self.default_builder()
         soup = self.soup(
             "<a><b>Find</b><c>lady!</c><d></d></a>", builder=builder)
         magic_tag = Tag(soup, builder, 'magictag')
@@ -1532,7 +1575,7 @@
         # callable is called on every string.
         self.assertEqual(
             decoded,
-            self.document_for(u"<b><FOO></b><b>BAR</b><br>"))
+            self.document_for(u"<b><FOO></b><b>BAR</b><br/>"))
 
     def test_formatter_is_run_on_attribute_values(self):
         markup = u'<a href="http://a.com?a=b&c=é";>e</a>'
@@ -1570,11 +1613,11 @@
         self.assertTrue(b"< < hey > >" in encoded)
 
     def test_prettify_leaves_preformatted_text_alone(self):
-        soup = self.soup("<div>  foo  <pre>  \tbar\n  \n  </pre>  baz  ")
+        soup = self.soup("<div>  foo  <pre>  \tbar\n  \n  </pre>  baz  
<textarea> eee\nfff\t</textarea></div>")
         # Everything outside the <pre> tag is reformatted, but everything
         # inside is left alone.
         self.assertEqual(
-            u'<div>\n foo\n <pre>  \tbar\n  \n  </pre>\n baz\n</div>',
+            u'<div>\n foo\n <pre>  \tbar\n  \n  </pre>\n baz\n <textarea> 
eee\nfff\t</textarea>\n</div>',
             soup.div.prettify())
 
     def test_prettify_accepts_formatter_function(self):
@@ -1683,6 +1726,29 @@
         else:
             self.assertEqual(b'<b>\\u2603</b>', repr(soup))
 
+class TestFormatter(SoupTest):
+
+    def test_sort_attributes(self):
+        # Test the ability to override Formatter.attributes() to,
+        # e.g., disable the normal sorting of attributes.
+        class UnsortedFormatter(Formatter):
+            def attributes(self, tag):
+                self.called_with = tag
+                for k, v in sorted(tag.attrs.items()):
+                    if k == 'ignore':
+                        continue
+                    yield k,v
+
+        soup = self.soup('<p cval="1" aval="2" ignore="ignored"></p>')
+        formatter = UnsortedFormatter()
+        decoded = soup.decode(formatter=formatter)
+
+        # attributes() was called on the <p> tag. It filtered out one
+        # attribute and sorted the other two.
+        self.assertEquals(formatter.called_with, soup.p)
+        self.assertEquals(u'<p aval="2" cval="1"></p>', decoded)
+
+
 class TestNavigableStringSubclasses(SoupTest):
 
     def test_cdata(self):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/doc/source/index.rst 
new/beautifulsoup4-4.8.0/doc/source/index.rst
--- old/beautifulsoup4-4.7.1/doc/source/index.rst       2018-12-31 
17:52:43.000000000 +0100
+++ new/beautifulsoup4-4.8.0/doc/source/index.rst       2019-07-17 
03:31:27.000000000 +0200
@@ -31,7 +31,7 @@
 
 * `这篇文档当然还有中文版. <http://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/>`_
 * このページは日本語で利用できます(`外部リンク <http://kondou.com/BS4/>`_)
-* 이 문서는 한국어 번역도 가능합니다. (`외부 링크 
<http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_)
+* 이 문서는 한국어 번역도 가능합니다. (`외부 링크 
<https://web.archive.org/web/20150319200824/http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_)
 
 Getting help
 ------------
@@ -266,9 +266,9 @@
 
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
 | Parser               | Typical usage                              | 
Advantages                     | Disadvantages            |
 
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
-| Python's html.parser | ``BeautifulSoup(markup, "html.parser")``   | * 
Batteries included           | * Not very lenient       |
-|                      |                                            | * Decent 
speed                 |   (before Python 2.7.3   |
-|                      |                                            | * 
Lenient (as of Python 2.7.3  |   or 3.2.2)              |
+| Python's html.parser | ``BeautifulSoup(markup, "html.parser")``   | * 
Batteries included           | * Not as fast as lxml,   |
+|                      |                                            | * Decent 
speed                 |   less lenient than      |
+|                      |                                            | * 
Lenient (As of Python 2.7.3  |   html5lib.              |
 |                      |                                            |   and 
3.2.)                    |                          |
 
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
 | lxml's HTML parser   | ``BeautifulSoup(markup, "lxml")``          | * Very 
fast                    | * External C dependency  |
@@ -428,8 +428,15 @@
  print(rel_soup.p)
  # <p>Back to the <a rel="index contents">homepage</a></p>
 
-You can use ```get_attribute_list`` to get a value that's always a list,
-string, whether or not it's a multi-valued atribute
+ You can disable this by passing ``multi_valued_attributes=None`` as a
+keyword argument into the ``BeautifulSoup`` constructor::
+
+  no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html', 
multi_valued_attributes=None)
+  no_list_soup.p['class']
+  # u'body strikeout'
+
+You can use ```get_attribute_list`` to get a value that's always a
+list, whether or not it's a multi-valued atribute::
 
   id_soup.p.get_attribute_list('id')
   # ["my id"]
@@ -440,8 +447,20 @@
  xml_soup.p['class']
  # u'body strikeout'
 
+Again, you can configure this using the ``multi_valued_attributes`` argument::
+
+  class_is_multi= { '*' : 'class'}
+  xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', 
multi_valued_attributes=class_is_multi)
+  xml_soup.p['class']
+  # [u'body', u'strikeout']
 
+You probably won't need to do this, but if you do, use the defaults as
+a guide. They implement the rules described in the HTML specification::
 
+  from bs4.builder import builder_registry
+  builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES
+
+  
 ``NavigableString``
 -------------------
 
@@ -2093,6 +2112,40 @@
 Like ``replace_with()``, ``unwrap()`` returns the tag
 that was replaced.
 
+``smooth()``
+---------------------------
+
+After calling a bunch of methods that modify the parse tree, you may end up 
with two or more ``NavigableString`` objects next to each other. Beautiful Soup 
doesn't have any problems with this, but since it can't happen in a freshly 
parsed document, you might not expect behavior like the following::
+
+  soup = BeautifulSoup("<p>A one</p>")
+  soup.p.append(", a two")
+
+  soup.p.contents
+  # [u'A one', u', a two']
+
+  print(soup.p.encode())
+  # <p>A one, a two</p>
+
+  print(soup.p.prettify())
+  # <p>
+  #  A one
+  #  , a two
+  # </p>
+
+You can call ``Tag.smooth()`` to clean up the parse tree by consolidating 
adjacent strings::
+
+ soup.smooth()
+
+ soup.p.contents
+ # [u'A one, a two']
+
+ print(soup.p.prettify())
+ # <p>
+ #  A one, a two
+ # </p>
+
+The ``smooth()`` method is new in Beautiful Soup 4.8.0.
+
 Output
 ======
 
@@ -2103,7 +2156,7 @@
 
 The ``prettify()`` method will turn a Beautiful Soup parse tree into a
 nicely formatted Unicode string, with a separate line for each
-tag and each string:
+tag and each string::
 
   markup = '<a href="http://example.com/";>I linked to <i>example.com</i></a>'
   soup = BeautifulSoup(markup)
@@ -2216,7 +2269,7 @@
  #  </body>
  # </html>
 
- If you pass in ``formatter="html5"``, it's the same as
+If you pass in ``formatter="html5"``, it's the same as
 ``formatter="html5"``, but Beautiful Soup will
 omit the closing slash in HTML void tags like "br"::
 
@@ -2245,16 +2298,17 @@
  print(link_soup.a.encode(formatter=None))
  # <a href="http://example.com/?foo=val1&bar=val2";>A link</a>
 
-Finally, if you pass in a function for ``formatter``, Beautiful Soup
-will call that function once for every string and attribute value in
-the document. You can do whatever you want in this function. Here's a
-formatter that converts strings to uppercase and does absolutely
-nothing else::
+If you need more sophisticated control over your output, you can
+use Beautiful Soup's ``Formatter`` class. Here's a formatter that
+converts strings to uppercase, whether they occur in a text node or in an
+attribute value::
 
+ from bs4.formatter import HTMLFormatter
  def uppercase(str):
      return str.upper()
+ formatter = HTMLFormatter(uppercase)
 
- print(soup.prettify(formatter=uppercase))
+ print(soup.prettify(formatter=formatter))
  # <html>
  #  <body>
  #   <p>
@@ -2263,34 +2317,31 @@
  #  </body>
  # </html>
 
- print(link_soup.a.prettify(formatter=uppercase))
+ print(link_soup.a.prettify(formatter=formatter))
  # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2">
  #  A LINK
  # </a>
 
-If you're writing your own function, you should know about the
-``EntitySubstitution`` class in the ``bs4.dammit`` module. This class
-implements Beautiful Soup's standard formatters as class methods: the
-"html" formatter is ``EntitySubstitution.substitute_html``, and the
-"minimal" formatter is ``EntitySubstitution.substitute_xml``. You can
-use these functions to simulate ``formatter=html`` or
-``formatter==minimal``, but then do something extra.
-
-Here's an example that replaces Unicode characters with HTML entities
-whenever possible, but `also` converts all strings to uppercase::
-
- from bs4.dammit import EntitySubstitution
- def uppercase_and_substitute_html_entities(str):
-     return EntitySubstitution.substitute_html(str.upper())
-
- print(soup.prettify(formatter=uppercase_and_substitute_html_entities))
- # <html>
- #  <body>
- #   <p>
- #    IL A DIT &lt;&lt;SACR&Eacute; BLEU!&gt;&gt;
- #   </p>
- #  </body>
- # </html>
+Subclassing ``HTMLFormatter`` or ``XMLFormatter`` will give you even
+more control over the output. For example, Beautiful Soup sorts the
+attributes in every tag by default::
+
+ attr_soup = BeautifulSoup(b'<p z="1" m="2" a="3"></p>')
+ print(attr_soup.p.encode())
+ # <p a="3" m="2" z="1"></p>
+
+To turn this off, you can subclass the ``Formatter.attributes()``
+method, which controls which attributes are output and in what
+order. This implementation also filters out out one of the attributes.
+
+ class UnsortedAttributes(HTMLFormatter):
+     def attributes(self, tag):
+         for k, v in tag.attrs.items():
+             if k == 'm':
+                continue
+             yield k, v
+ print(attr_soup.p.encode(formatter=UnsortedAttributes())) 
+ # <p z="1" a="3"></p>
 
 One last caveat: if you create a ``CData`` object, the text inside
 that object is always presented `exactly as it appears, with no
@@ -3097,6 +3148,7 @@
 * ``findPrevious`` -> ``find_previous``
 * ``findPreviousSibling`` -> ``find_previous_sibling``
 * ``findPreviousSiblings`` -> ``find_previous_siblings``
+* ``getText`` -> ``get_text``
 * ``nextSibling`` -> ``next_sibling``
 * ``previousSibling`` -> ``previous_sibling``
 
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/setup.py 
new/beautifulsoup4-4.8.0/setup.py
--- old/beautifulsoup4-4.7.1/setup.py   2019-01-07 01:47:07.000000000 +0100
+++ new/beautifulsoup4-4.8.0/setup.py   2019-07-20 01:50:29.000000000 +0200
@@ -8,7 +8,7 @@
 
 setup(
     name="beautifulsoup4",
-    version = "4.7.1",
+    version = "4.8.0",
     author="Leonard Richardson",
     author_email='leona...@segfault.org',
     url="http://www.crummy.com/software/BeautifulSoup/bs4/";,

commit python-beautifulsoup4 for openSUSE:Factory

Reply via email to