Hello community,
here is the log from the commit of package python-beautifulsoup4 for
openSUSE:Factory checked in at 2019-07-30 13:05:12
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/python-beautifulsoup4 (Old)
and /work/SRC/openSUSE:Factory/.python-beautifulsoup4.new.4126 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-beautifulsoup4"
Tue Jul 30 13:05:12 2019 rev:29 rq:717648 version:4.8.0
Changes:
--------
---
/work/SRC/openSUSE:Factory/python-beautifulsoup4/python-beautifulsoup4.changes
2019-03-04 09:11:05.132700786 +0100
+++
/work/SRC/openSUSE:Factory/.python-beautifulsoup4.new.4126/python-beautifulsoup4.changes
2019-07-30 13:05:15.146390127 +0200
@@ -1,0 +2,23 @@
+Mon Jul 22 16:18:23 UTC 2019 - Todd R <[email protected]>
+
+- Update to 4.8.0
+ * It's now possible to customize the TreeBuilder object by passing
+ keyword arguments into the BeautifulSoup constructor. The main
+ reason to do this right now is to change how which attributes are
+ treated as multi-valued attributes (the way 'class' is treated by
+ default). You can do this with the `multi_valued_attributes` argument.
+ * The role of Formatter objects has been greatly expanded. The Formatter
+ class now controls the following:
+ > The function to call to perform entity substitution. (This was
+ previously Formatter's only job.)
+ > Which tags should be treated as containing CDATA and have their
+ contents exempt from entity substitution.
+ > The order in which a tag's attributes are output.
+ > Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>'
+ All preexisting code should work as before.
+ * Added a new method to the API, Tag.smooth(), which consolidates
+ multiple adjacent NavigableString elements.
+ * ' (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is now
+ recognized as a named entity and converted to a single quote.
+
+-------------------------------------------------------------------
Old:
----
beautifulsoup4-4.7.1.tar.gz
New:
----
beautifulsoup4-4.8.0.tar.gz
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Other differences:
------------------
++++++ python-beautifulsoup4.spec ++++++
--- /var/tmp/diff_new_pack.SDgoFB/_old 2019-07-30 13:05:15.818389950 +0200
+++ /var/tmp/diff_new_pack.SDgoFB/_new 2019-07-30 13:05:15.822389949 +0200
@@ -18,7 +18,7 @@
%{?!python_module:%define python_module() python-%{**} python3-%{**}}
Name: python-beautifulsoup4
-Version: 4.7.1
+Version: 4.8.0
Release: 0
Summary: HTML/XML Parser for Quick-Turnaround Applications Like
Screen-Scraping
License: MIT
++++++ beautifulsoup4-4.7.1.tar.gz -> beautifulsoup4-4.8.0.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/NEWS.txt
new/beautifulsoup4-4.8.0/NEWS.txt
--- old/beautifulsoup4-4.7.1/NEWS.txt 2019-01-07 01:36:52.000000000 +0100
+++ new/beautifulsoup4-4.8.0/NEWS.txt 2019-07-20 01:41:41.000000000 +0200
@@ -1,3 +1,30 @@
+= 4.8.0 (20190720, "One Small Soup")
+
+* It's now possible to customize the TreeBuilder object by passing
+ keyword arguments into the BeautifulSoup constructor. The main
+ reason to do this right now is to change how which attributes are
+ treated as multi-valued attributes (the way 'class' is treated by
+ default). You can do this with the `multi_valued_attributes` argument.
+ [bug=1832978]
+
+* The role of Formatter objects has been greatly expanded. The Formatter
+ class now controls the following:
+
+ - The function to call to perform entity substitution. (This was
+ previously Formatter's only job.)
+ - Which tags should be treated as containing CDATA and have their
+ contents exempt from entity substitution.
+ - The order in which a tag's attributes are output. [bug=1812422]
+ - Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>'
+
+ All preexisting code should work as before.
+
+* Added a new method to the API, Tag.smooth(), which consolidates
+ multiple adjacent NavigableString elements.
+
+* ' (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is now
+ recognized as a named entity and converted to a single quote. [bug=1818721]
+
= 4.7.1 (20190106)
* Fixed a significant performance problem introduced in 4.7.0. [bug=1810617]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/PKG-INFO
new/beautifulsoup4-4.8.0/PKG-INFO
--- old/beautifulsoup4-4.7.1/PKG-INFO 2019-01-07 01:51:37.000000000 +0100
+++ new/beautifulsoup4-4.8.0/PKG-INFO 2019-07-20 13:29:22.000000000 +0200
@@ -1,6 +1,6 @@
Metadata-Version: 2.1
Name: beautifulsoup4
-Version: 4.7.1
+Version: 4.8.0
Summary: Screen-scraping library
Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/beautifulsoup4-4.7.1/beautifulsoup4.egg-info/PKG-INFO
new/beautifulsoup4-4.8.0/beautifulsoup4.egg-info/PKG-INFO
--- old/beautifulsoup4-4.7.1/beautifulsoup4.egg-info/PKG-INFO 2019-01-07
01:51:37.000000000 +0100
+++ new/beautifulsoup4-4.8.0/beautifulsoup4.egg-info/PKG-INFO 2019-07-20
13:29:22.000000000 +0200
@@ -1,6 +1,6 @@
Metadata-Version: 2.1
Name: beautifulsoup4
-Version: 4.7.1
+Version: 4.8.0
Summary: Screen-scraping library
Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/beautifulsoup4-4.7.1/beautifulsoup4.egg-info/SOURCES.txt
new/beautifulsoup4-4.8.0/beautifulsoup4.egg-info/SOURCES.txt
--- old/beautifulsoup4-4.7.1/beautifulsoup4.egg-info/SOURCES.txt
2019-01-07 01:51:37.000000000 +0100
+++ new/beautifulsoup4-4.8.0/beautifulsoup4.egg-info/SOURCES.txt
2019-07-20 13:29:22.000000000 +0200
@@ -17,6 +17,7 @@
bs4/dammit.py
bs4/diagnose.py
bs4/element.py
+bs4/formatter.py
bs4/testing.py
bs4/builder/__init__.py
bs4/builder/_html5lib.py
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/__init__.py
new/beautifulsoup4-4.8.0/bs4/__init__.py
--- old/beautifulsoup4-4.7.1/bs4/__init__.py 2019-01-07 01:50:44.000000000
+0100
+++ new/beautifulsoup4-4.8.0/bs4/__init__.py 2019-07-17 03:31:51.000000000
+0200
@@ -18,7 +18,7 @@
"""
__author__ = "Leonard Richardson ([email protected])"
-__version__ = "4.7.1"
+__version__ = "4.8.0"
__copyright__ = "Copyright (c) 2004-2019 Leonard Richardson"
# Use of this source code is governed by the MIT license.
__license__ = "MIT"
@@ -98,8 +98,10 @@
name a specific parser, so that Beautiful Soup gives you the
same results across platforms and virtual environments.
- :param builder: A specific TreeBuilder to use instead of looking one
- up based on `features`. You shouldn't need to use this.
+ :param builder: A TreeBuilder subclass to instantiate (or
+ instance to use) instead of looking one up based on
+ `features`. You only need to use this if you've implemented a
+ custom TreeBuilder.
:param parse_only: A SoupStrainer. Only parts of the document
matching the SoupStrainer will be considered. This is useful
@@ -118,11 +120,17 @@
:param kwargs: For backwards compatibility purposes, the
constructor accepts certain keyword arguments used in
Beautiful Soup 3. None of these arguments do anything in
- Beautiful Soup 4 and there's no need to actually pass keyword
- arguments into the constructor.
+ Beautiful Soup 4; they will result in a warning and then be ignored.
+
+ Apart from this, any keyword arguments passed into the BeautifulSoup
+ constructor are propagated to the TreeBuilder constructor. This
+ makes it possible to configure a TreeBuilder beyond saying
+ which one to use.
+
"""
if 'convertEntities' in kwargs:
+ del kwargs['convertEntities']
warnings.warn(
"BS4 does not respect the convertEntities argument to the "
"BeautifulSoup constructor. Entities are always converted "
@@ -177,13 +185,17 @@
warnings.warn("You provided Unicode markup but also provided a
value for from_encoding. Your from_encoding will be ignored.")
from_encoding = None
- if len(kwargs) > 0:
- arg = kwargs.keys().pop()
- raise TypeError(
- "__init__() got an unexpected keyword argument '%s'" % arg)
-
- if builder is None:
- original_features = features
+ # We need this information to track whether or not the builder
+ # was specified well enough that we can omit the 'you need to
+ # specify a parser' warning.
+ original_builder = builder
+ original_features = features
+
+ if isinstance(builder, type):
+ # A builder class was passed in; it needs to be instantiated.
+ builder_class = builder
+ builder = None
+ elif builder is None:
if isinstance(features, basestring):
features = [features]
if features is None or len(features) == 0:
@@ -194,9 +206,16 @@
"Couldn't find a tree builder with the features you "
"requested: %s. Do you need to install a parser library?"
% ",".join(features))
- builder = builder_class()
- if not (original_features == builder.NAME or
- original_features in builder.ALTERNATE_NAMES):
+
+ # At this point either we have a TreeBuilder instance in
+ # builder, or we have a builder_class that we can instantiate
+ # with the remaining **kwargs.
+ if builder is None:
+ builder = builder_class(**kwargs)
+ if not original_builder and not (
+ original_features == builder.NAME or
+ original_features in builder.ALTERNATE_NAMES
+ ):
if builder.is_xml:
markup_type = "XML"
else:
@@ -231,7 +250,10 @@
markup_type=markup_type
)
warnings.warn(self.NO_PARSER_SPECIFIED_WARNING % values,
stacklevel=2)
-
+ else:
+ if kwargs:
+ warnings.warn("Keyword arguments to the BeautifulSoup
constructor will be ignored. These would normally be passed into the
TreeBuilder constructor, but a TreeBuilder instance was passed in as
`builder`.")
+
self.builder = builder
self.is_xml = builder.is_xml
self.known_xml = self.is_xml
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/builder/__init__.py
new/beautifulsoup4-4.8.0/bs4/builder/__init__.py
--- old/beautifulsoup4-4.7.1/bs4/builder/__init__.py 2018-12-31
02:49:46.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/builder/__init__.py 2019-07-14
22:16:04.000000000 +0200
@@ -7,7 +7,6 @@
from bs4.element import (
CharsetMetaAttributeValue,
ContentMetaAttributeValue,
- HTMLAwareEntitySubstitution,
nonwhitespace_re
)
@@ -90,18 +89,40 @@
is_xml = False
picklable = False
- preserve_whitespace_tags = set()
empty_element_tags = None # A tag will be considered an empty-element
# tag when and only when it has no contents.
# A value for these tag/attribute combinations is a space- or
# comma-separated list of CDATA, rather than a single CDATA.
- cdata_list_attributes = {}
+ DEFAULT_CDATA_LIST_ATTRIBUTES = {}
+ DEFAULT_PRESERVE_WHITESPACE_TAGS = set()
+
+ USE_DEFAULT = object()
+
+ def __init__(self, multi_valued_attributes=USE_DEFAULT,
preserve_whitespace_tags=USE_DEFAULT):
+ """Constructor.
- def __init__(self):
- self.soup = None
+ :param multi_valued_attributes: If this is set to None, the
+ TreeBuilder will not turn any values for attributes like
+ 'class' into lists. Setting this do a dictionary will
+ customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES
+ for an example.
+
+ Internally, these are called "CDATA list attributes", but that
+ probably doesn't make sense to an end-user, so the argument name
+ is `multi_valued_attributes`.
+ :param preserve_whitespace_tags:
+ """
+ self.soup = None
+ if multi_valued_attributes is self.USE_DEFAULT:
+ multi_valued_attributes = self.DEFAULT_CDATA_LIST_ATTRIBUTES
+ self.cdata_list_attributes = multi_valued_attributes
+ if preserve_whitespace_tags is self.USE_DEFAULT:
+ preserve_whitespace_tags = self.DEFAULT_PRESERVE_WHITESPACE_TAGS
+ self.preserve_whitespace_tags = preserve_whitespace_tags
+
def initialize_soup(self, soup):
"""The BeautifulSoup object has been initialized and is now
being associated with the TreeBuilder.
@@ -131,7 +152,7 @@
if self.empty_element_tags is None:
return True
return tag_name in self.empty_element_tags
-
+
def feed(self, markup):
raise NotImplementedError()
@@ -237,7 +258,6 @@
Such as which tags are empty-element tags.
"""
- preserve_whitespace_tags =
HTMLAwareEntitySubstitution.preserve_whitespace_tags
empty_element_tags = set([
# These are from HTML5.
'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen',
'link', 'menuitem', 'meta', 'param', 'source', 'track', 'wbr',
@@ -259,7 +279,7 @@
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
- cdata_list_attributes = {
+ DEFAULT_CDATA_LIST_ATTRIBUTES = {
"*" : ['class', 'accesskey', 'dropzone'],
"a" : ['rel', 'rev'],
"link" : ['rel', 'rev'],
@@ -276,6 +296,8 @@
"output" : ["for"],
}
+ DEFAULT_PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
+
def set_up_substitutions(self, tag):
# We are only interested in <meta> tags
if tag.name != 'meta':
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/builder/_html5lib.py
new/beautifulsoup4-4.8.0/bs4/builder/_html5lib.py
--- old/beautifulsoup4-4.7.1/bs4/builder/_html5lib.py 2018-12-31
02:50:27.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/builder/_html5lib.py 2019-07-08
03:59:55.000000000 +0200
@@ -199,7 +199,7 @@
def __setitem__(self, name, value):
# If this attribute is a multi-valued attribute for this element,
# turn its value into a list.
- list_attr = HTML5TreeBuilder.cdata_list_attributes
+ list_attr = self.element.cdata_list_attributes
if (name in list_attr['*']
or (self.element.name in list_attr
and name in list_attr[self.element.name])):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/builder/_htmlparser.py
new/beautifulsoup4-4.8.0/bs4/builder/_htmlparser.py
--- old/beautifulsoup4-4.7.1/bs4/builder/_htmlparser.py 2018-12-24
16:32:39.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/builder/_htmlparser.py 2019-07-07
20:09:37.000000000 +0200
@@ -214,12 +214,15 @@
NAME = HTMLPARSER
features = [NAME, HTML, STRICT]
- def __init__(self, *args, **kwargs):
+ def __init__(self, parser_args=None, parser_kwargs=None, **kwargs):
+ super(HTMLParserTreeBuilder, self).__init__(**kwargs)
+ parser_args = parser_args or []
+ parser_kwargs = parser_kwargs or {}
if CONSTRUCTOR_TAKES_STRICT and not CONSTRUCTOR_STRICT_IS_DEPRECATED:
- kwargs['strict'] = False
+ parser_kwargs['strict'] = False
if CONSTRUCTOR_TAKES_CONVERT_CHARREFS:
- kwargs['convert_charrefs'] = False
- self.parser_args = (args, kwargs)
+ parser_kwargs['convert_charrefs'] = False
+ self.parser_args = (parser_args, parser_kwargs)
def prepare_markup(self, markup, user_specified_encoding=None,
document_declared_encoding=None,
exclude_encodings=None):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/builder/_lxml.py
new/beautifulsoup4-4.8.0/bs4/builder/_lxml.py
--- old/beautifulsoup4-4.7.1/bs4/builder/_lxml.py 2019-01-07
00:41:32.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/builder/_lxml.py 2019-07-08
03:59:55.000000000 +0200
@@ -94,7 +94,7 @@
parser = parser(target=self, strip_cdata=False, encoding=encoding)
return parser
- def __init__(self, parser=None, empty_element_tags=None):
+ def __init__(self, parser=None, empty_element_tags=None, **kwargs):
# TODO: Issue a warning if parser is present but not a
# callable, since that means there's no way to create new
# parsers for different encodings.
@@ -103,6 +103,7 @@
self.empty_element_tags = set(empty_element_tags)
self.soup = None
self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
+ super(LXMLTreeBuilderForXML, self).__init__(**kwargs)
def _getNsTag(self, tag):
# Split the namespace URL out of a fully-qualified lxml tag
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/dammit.py
new/beautifulsoup4-4.8.0/bs4/dammit.py
--- old/beautifulsoup4-4.7.1/bs4/dammit.py 2018-12-24 16:31:48.000000000
+0100
+++ new/beautifulsoup4-4.8.0/bs4/dammit.py 2019-07-08 03:45:20.000000000
+0200
@@ -57,15 +57,24 @@
lookup = {}
reverse_lookup = {}
characters_for_re = []
- for codepoint, name in list(codepoint2name.items()):
+
+ # &apos is an XHTML entity and an HTML 5, but not an HTML 4
+ # entity. We don't want to use it, but we want to recognize it on the
way in.
+ #
+ # TODO: Ideally we would be able to recognize all HTML 5 named
+ # entities, but that's a little tricky.
+ extra = [(39, 'apos')]
+ for codepoint, name in list(codepoint2name.items()) + extra:
character = unichr(codepoint)
- if codepoint != 34:
+ if codepoint not in (34, 39):
# There's no point in turning the quotation mark into
- # ", unless it happens within an attribute value, which
- # is handled elsewhere.
+ # " or the single quote into ', unless it
+ # happens within an attribute value, which is handled
+ # elsewhere.
characters_for_re.append(character)
lookup[character] = name
- # But we do want to turn " into the quotation mark.
+ # But we do want to recognize those entities on the way in and
+ # convert them to Unicode characters.
reverse_lookup[name] = character
re_definition = "[%s]" % "".join(characters_for_re)
return lookup, reverse_lookup, re.compile(re_definition)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/element.py
new/beautifulsoup4-4.8.0/bs4/element.py
--- old/beautifulsoup4-4.7.1/bs4/element.py 2019-01-07 01:35:05.000000000
+0100
+++ new/beautifulsoup4-4.8.0/bs4/element.py 2019-07-16 22:46:05.000000000
+0200
@@ -16,7 +16,11 @@
'The soupsieve package is not installed. CSS selectors cannot be used.'
)
-from bs4.dammit import EntitySubstitution
+from bs4.formatter import (
+ Formatter,
+ HTMLFormatter,
+ XMLFormatter,
+)
DEFAULT_OUTPUT_ENCODING = "utf-8"
PY3K = (sys.version_info[0] > 2)
@@ -99,138 +103,71 @@
return match.group(1) + encoding
return self.CHARSET_RE.sub(rewrite, self.original_value)
-class HTMLAwareEntitySubstitution(EntitySubstitution):
-
- """Entity substitution rules that are aware of some HTML quirks.
-
- Specifically, the contents of <script> and <style> tags should not
- undergo entity substitution.
-
- Incoming NavigableString objects are checked to see if they're the
- direct children of a <script> or <style> tag.
- """
-
- cdata_containing_tags = set(["script", "style"])
-
- preformatted_tags = set(["pre"])
-
- preserve_whitespace_tags = set(['pre', 'textarea'])
-
- @classmethod
- def _substitute_if_appropriate(cls, ns, f):
- if (isinstance(ns, NavigableString)
- and ns.parent is not None
- and ns.parent.name in cls.cdata_containing_tags):
- # Do nothing.
- return ns
- # Substitute.
- return f(ns)
-
- @classmethod
- def substitute_html(cls, ns):
- return cls._substitute_if_appropriate(
- ns, EntitySubstitution.substitute_html)
-
- @classmethod
- def substitute_xml(cls, ns):
- return cls._substitute_if_appropriate(
- ns, EntitySubstitution.substitute_xml)
-
-class Formatter(object):
- """Contains information about how to format a parse tree."""
-
- # By default, represent void elements as <tag/> rather than <tag>
- void_element_close_prefix = '/'
-
- def substitute_entities(self, *args, **kwargs):
- """Transform certain characters into named entities."""
- raise NotImplementedError()
-
-class HTMLFormatter(Formatter):
- """The default HTML formatter."""
- def substitute(self, *args, **kwargs):
- return HTMLAwareEntitySubstitution.substitute_html(*args, **kwargs)
-
-class MinimalHTMLFormatter(Formatter):
- """A minimal HTML formatter."""
- def substitute(self, *args, **kwargs):
- return HTMLAwareEntitySubstitution.substitute_xml(*args, **kwargs)
-
-class HTML5Formatter(HTMLFormatter):
- """An HTML formatter that omits the slash in a void tag."""
- void_element_close_prefix = None
-
-class XMLFormatter(Formatter):
- """Substitute only the essential XML entities."""
- def substitute(self, *args, **kwargs):
- return EntitySubstitution.substitute_xml(*args, **kwargs)
-
-class HTMLXMLFormatter(Formatter):
- """Format XML using HTML rules."""
- def substitute(self, *args, **kwargs):
- return HTMLAwareEntitySubstitution.substitute_html(*args, **kwargs)
-
class PageElement(object):
"""Contains the navigational information for some part of the page
(either a tag or a piece of text)"""
+
+ def setup(self, parent=None, previous_element=None, next_element=None,
+ previous_sibling=None, next_sibling=None):
+ """Sets up the initial relations between this element and
+ other elements."""
+ self.parent = parent
+
+ self.previous_element = previous_element
+ if previous_element is not None:
+ self.previous_element.next_element = self
+
+ self.next_element = next_element
+ if self.next_element is not None:
+ self.next_element.previous_element = self
- # There are five possible values for the "formatter" argument passed in
- # to methods like encode() and prettify():
- #
- # "html" - All Unicode characters with corresponding HTML entities
- # are converted to those entities on output.
- # "html5" - The same as "html", but empty void tags are represented as
- # <tag> rather than <tag/>
- # "minimal" - Bare ampersands and angle brackets are converted to
- # XML entities: & < >
- # None - The null formatter. Unicode characters are never
- # converted to entities. This is not recommended, but it's
- # faster than "minimal".
- # A callable function - it will be called on every string that needs to
undergo entity substitution.
- # A Formatter instance - Formatter.substitute(string) will be called on
every string that
- # needs to undergo entity substitution.
- #
-
- # In an HTML document, the default "html", "html5", and "minimal"
- # functions will leave the contents of <script> and <style> tags
- # alone. For an XML document, all tags will be given the same
- # treatment.
-
- HTML_FORMATTERS = {
- "html" : HTMLFormatter(),
- "html5" : HTML5Formatter(),
- "minimal" : MinimalHTMLFormatter(),
- None : None
- }
-
- XML_FORMATTERS = {
- "html" : HTMLXMLFormatter(),
- "minimal" : XMLFormatter(),
- None : None
- }
+ self.next_sibling = next_sibling
+ if self.next_sibling is not None:
+ self.next_sibling.previous_sibling = self
- def format_string(self, s, formatter='minimal'):
+ if (previous_sibling is None
+ and self.parent is not None and self.parent.contents):
+ previous_sibling = self.parent.contents[-1]
+
+ self.previous_sibling = previous_sibling
+ if previous_sibling is not None:
+ self.previous_sibling.next_sibling = self
+
+ def format_string(self, s, formatter):
"""Format the given string using the given formatter."""
- if isinstance(formatter, basestring):
- formatter = self._formatter_for_name(formatter)
if formatter is None:
- output = s
- else:
- if isinstance(formatter, Callable):
- # Backwards compatibility -- you used to pass in a formatting
method.
- output = formatter(s)
- else:
- output = formatter.substitute(s)
+ return s
+ if not isinstance(formatter, Formatter):
+ formatter = self.formatter_for_name(formatter)
+ output = formatter.substitute(s)
return output
+ def formatter_for_name(self, formatter):
+ """Look up or create a Formatter for the given identifier,
+ if necessary.
+
+ :param formatter: Can be a Formatter object (used as-is), a
+ function (used as the entity substitution hook for an
+ XMLFormatter or HTMLFormatter), or a string (used to look up
+ an XMLFormatter or HTMLFormatter in the appropriate registry.
+ """
+ if isinstance(formatter, Formatter):
+ return formatter
+ if self._is_xml:
+ c = XMLFormatter
+ else:
+ c = HTMLFormatter
+ if callable(formatter):
+ return c(entity_substitution=formatter)
+ return c.REGISTRY[formatter]
+
@property
def _is_xml(self):
"""Is this element part of an XML tree or an HTML tree?
- This is used when mapping a formatter name ("minimal") to an
- appropriate function (one that performs entity-substitution on
- the contents of <script> and <style> tags, or not). It can be
+ This is used in formatter_for_name, when deciding whether an
+ XMLFormatter or HTMLFormatter is more appropriate. It can be
inefficient, but it should be called very rarely.
"""
if self.known_xml is not None:
@@ -248,46 +185,13 @@
return getattr(self, 'is_xml', False)
return self.parent._is_xml
- def _formatter_for_name(self, name):
- "Look up a formatter function based on its name and the tree."
- if self._is_xml:
- return self.XML_FORMATTERS.get(name, XMLFormatter())
- else:
- return self.HTML_FORMATTERS.get(name, HTMLFormatter())
-
- def setup(self, parent=None, previous_element=None, next_element=None,
- previous_sibling=None, next_sibling=None):
- """Sets up the initial relations between this element and
- other elements."""
- self.parent = parent
-
- self.previous_element = previous_element
- if previous_element is not None:
- self.previous_element.next_element = self
-
- self.next_element = next_element
- if self.next_element is not None:
- self.next_element.previous_element = self
-
- self.next_sibling = next_sibling
- if self.next_sibling is not None:
- self.next_sibling.previous_sibling = self
-
- if (previous_sibling is None
- and self.parent is not None and self.parent.contents):
- previous_sibling = self.parent.contents[-1]
-
- self.previous_sibling = previous_sibling
- if previous_sibling is not None:
- self.previous_sibling.next_sibling = self
-
nextSibling = _alias("next_sibling") # BS3
previousSibling = _alias("previous_sibling") # BS3
def replace_with(self, replace_with):
if self.parent is None:
raise ValueError(
- "Cannot replace one element with another when the"
+ "Cannot replace one element with another when the "
"element to be replaced is not part of a tree.")
if replace_with is self:
return
@@ -742,6 +646,7 @@
self.__class__.__name__, attr))
def output_ready(self, formatter="minimal"):
+ """Run the string through the provided formatter."""
output = self.format_string(self, formatter)
return self.PREFIX + output + self.SUFFIX
@@ -760,10 +665,12 @@
but the return value will be ignored.
"""
- def output_ready(self, formatter="minimal"):
- """CData strings are passed into the formatter.
- But the return value is ignored."""
- self.format_string(self, formatter)
+ def output_ready(self, formatter=None):
+ """CData strings are passed into the formatter, purely
+ for any side effects. The return value is ignored.
+ """
+ if formatter is not None:
+ ignore = self.format_string(self, formatter)
return self.PREFIX + self + self.SUFFIX
class CData(PreformattedString):
@@ -831,14 +738,6 @@
self.name = name
self.namespace = namespace
self.prefix = prefix
- if builder is not None:
- preserve_whitespace_tags = builder.preserve_whitespace_tags
- else:
- if is_xml:
- preserve_whitespace_tags = []
- else:
- preserve_whitespace_tags =
HTMLAwareEntitySubstitution.preserve_whitespace_tags
- self.preserve_whitespace_tags = preserve_whitespace_tags
if attrs is None:
attrs = {}
elif attrs:
@@ -861,12 +760,31 @@
self.setup(parent, previous)
self.hidden = False
- # Set up any substitutions, such as the charset in a META tag.
- if builder is not None:
+ if builder is None:
+ # In the absence of a TreeBuilder, assume this tag is nothing
+ # special.
+ self.can_be_empty_element = False
+ self.cdata_list_attributes = None
+ else:
+ # Set up any substitutions for this tag, such as the charset in a
META tag.
builder.set_up_substitutions(self)
+
+ # Ask the TreeBuilder whether this tag might be an empty-element
tag.
self.can_be_empty_element = builder.can_be_empty_element(name)
- else:
- self.can_be_empty_element = False
+
+ # Keep track of the list of attributes of this tag that
+ # might need to be treated as a list.
+ #
+ # For performance reasons, we store the whole data structure
+ # rather than asking the question of every tag. Asking would
+ # require building a new data structure every time, and
+ # (unlike can_be_empty_element), we almost never need
+ # to check this.
+ self.cdata_list_attributes = builder.cdata_list_attributes
+
+ # Keep track of the names that might cause this tag to be treated
as a
+ # whitespace-preserved tag.
+ self.preserve_whitespace_tags = builder.preserve_whitespace_tags
parserClass = _alias("parser_class") # BS3
@@ -981,6 +899,43 @@
for element in self.contents[:]:
element.extract()
+ def smooth(self):
+ """Smooth out this element's children by consolidating consecutive
strings.
+
+ This makes pretty-printed output look more natural following a
+ lot of operations that modified the tree.
+ """
+ # Mark the first position of every pair of children that need
+ # to be consolidated. Do this rather than making a copy of
+ # self.contents, since in most cases very few strings will be
+ # affected.
+ marked = []
+ for i, a in enumerate(self.contents):
+ if isinstance(a, Tag):
+ # Recursively smooth children.
+ a.smooth()
+ if i == len(self.contents)-1:
+ # This is the last item in .contents, and it's not a
+ # tag. There's no chance it needs any work.
+ continue
+ b = self.contents[i+1]
+ if (isinstance(a, NavigableString)
+ and isinstance(b, NavigableString)
+ and not isinstance(a, PreformattedString)
+ and not isinstance(b, PreformattedString)
+ ):
+ marked.append(i)
+
+ # Go over the marked positions in reverse order, so that
+ # removing items from .contents won't affect the remaining
+ # positions.
+ for i in reversed(marked):
+ a = self.contents[i]
+ b = self.contents[i+1]
+ b.extract()
+ n = NavigableString(a+b)
+ a.replace_with(n)
+
def index(self, element):
"""
Find the index of a child by identity, not value. Avoids issues with
@@ -1115,14 +1070,6 @@
u = self.decode(indent_level, encoding, formatter)
return u.encode(encoding, errors)
- def _should_pretty_print(self, indent_level):
- """Should this tag be pretty-printed?"""
-
- return (
- indent_level is not None
- and self.name not in self.preserve_whitespace_tags
- )
-
def decode(self, indent_level=None,
eventual_encoding=DEFAULT_OUTPUT_ENCODING,
formatter="minimal"):
@@ -1136,30 +1083,32 @@
encoding.
"""
- # First off, turn a string formatter into a Formatter object. This
- # will stop the lookup from happening over and over again.
- if not isinstance(formatter, Formatter) and not isinstance(formatter,
Callable):
- formatter = self._formatter_for_name(formatter)
+ # First off, turn a non-Formatter `formatter` into a Formatter
+ # object. This will stop the lookup from happening over and
+ # over again.
+ if not isinstance(formatter, Formatter):
+ formatter = self.formatter_for_name(formatter)
+ attributes = formatter.attributes(self)
attrs = []
- if self.attrs:
- for key, val in sorted(self.attrs.items()):
- if val is None:
- decoded = key
- else:
- if isinstance(val, list) or isinstance(val, tuple):
- val = ' '.join(val)
- elif not isinstance(val, basestring):
- val = unicode(val)
- elif (
+ for key, val in attributes:
+ if val is None:
+ decoded = key
+ else:
+ if isinstance(val, list) or isinstance(val, tuple):
+ val = ' '.join(val)
+ elif not isinstance(val, basestring):
+ val = unicode(val)
+ elif (
isinstance(val, AttributeValueWithCharsetSubstitution)
- and eventual_encoding is not None):
- val = val.encode(eventual_encoding)
-
- text = self.format_string(val, formatter)
- decoded = (
- unicode(key) + '='
- + EntitySubstitution.quoted_attribute_value(text))
- attrs.append(decoded)
+ and eventual_encoding is not None
+ ):
+ val = val.encode(eventual_encoding)
+
+ text = formatter.attribute_value(val)
+ decoded = (
+ unicode(key) + '='
+ + formatter.quoted_attribute_value(text))
+ attrs.append(decoded)
close = ''
closeTag = ''
@@ -1168,9 +1117,7 @@
prefix = self.prefix + ":"
if self.is_empty_element:
- close = ''
- if isinstance(formatter, Formatter):
- close = formatter.void_element_close_prefix or close
+ close = formatter.void_element_close_prefix or ''
else:
closeTag = '</%s%s>' % (prefix, self.name)
@@ -1185,7 +1132,8 @@
else:
indent_contents = None
contents = self.decode_contents(
- indent_contents, eventual_encoding, formatter)
+ indent_contents, eventual_encoding, formatter
+ )
if self.hidden:
# This is the 'document root' object.
@@ -1217,6 +1165,13 @@
s = ''.join(s)
return s
+ def _should_pretty_print(self, indent_level):
+ """Should this tag be pretty-printed?"""
+ return (
+ indent_level is not None
+ and self.name not in self.preserve_whitespace_tags
+ )
+
def prettify(self, encoding=None, formatter="minimal"):
if encoding is None:
return self.decode(True, formatter=formatter)
@@ -1232,19 +1187,19 @@
indented this many spaces.
:param eventual_encoding: The tag is destined to be
- encoded into this encoding. This method is _not_
+ encoded into this encoding. decode_contents() is _not_
responsible for performing that encoding. This information
is passed in so that it can be substituted in if the
document contains a <META> tag that mentions the document's
encoding.
- :param formatter: The output formatter responsible for converting
- entities to Unicode characters.
+ :param formatter: A Formatter object, or a string naming one of
+ the standard Formatters.
"""
# First off, turn a string formatter into a Formatter object. This
# will stop the lookup from happening over and over again.
- if not isinstance(formatter, Formatter) and not isinstance(formatter,
Callable):
- formatter = self._formatter_for_name(formatter)
+ if not isinstance(formatter, Formatter):
+ formatter = self.formatter_for_name(formatter)
pretty_print = (indent_level is not None)
s = []
@@ -1255,16 +1210,19 @@
elif isinstance(c, Tag):
s.append(c.decode(indent_level, eventual_encoding,
formatter))
- if text and indent_level and not self.name == 'pre':
+ preserve_whitespace = (
+ self.preserve_whitespace_tags and self.name in
self.preserve_whitespace_tags
+ )
+ if text and indent_level and not preserve_whitespace:
text = text.strip()
if text:
- if pretty_print and not self.name == 'pre':
+ if pretty_print and not preserve_whitespace:
s.append(" " * (indent_level - 1))
s.append(text)
- if pretty_print and not self.name == 'pre':
+ if pretty_print and not preserve_whitespace:
s.append("\n")
return ''.join(s)
-
+
def encode_contents(
self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
formatter="minimal"):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/formatter.py
new/beautifulsoup4-4.8.0/bs4/formatter.py
--- old/beautifulsoup4-4.7.1/bs4/formatter.py 1970-01-01 01:00:00.000000000
+0100
+++ new/beautifulsoup4-4.8.0/bs4/formatter.py 2019-07-16 22:46:05.000000000
+0200
@@ -0,0 +1,99 @@
+from bs4.dammit import EntitySubstitution
+
+class Formatter(EntitySubstitution):
+ """Describes a strategy to use when outputting a parse tree to a string.
+
+ Some parts of this strategy come from the distinction between
+ HTML4, HTML5, and XML. Others are configurable by the user.
+ """
+ # Registries of XML and HTML formatters.
+ XML_FORMATTERS = {}
+ HTML_FORMATTERS = {}
+
+ HTML = 'html'
+ XML = 'xml'
+
+ HTML_DEFAULTS = dict(
+ cdata_containing_tags=set(["script", "style"]),
+ )
+
+ def _default(self, language, value, kwarg):
+ if value is not None:
+ return value
+ if language == self.XML:
+ return set()
+ return self.HTML_DEFAULTS[kwarg]
+
+ def __init__(
+ self, language=None, entity_substitution=None,
+ void_element_close_prefix='/', cdata_containing_tags=None,
+ ):
+ """
+
+ :param void_element_close_prefix: By default, represent void
+ elements as <tag/> rather than <tag>
+ """
+ self.language = language
+ self.entity_substitution = entity_substitution
+ self.void_element_close_prefix = void_element_close_prefix
+ self.cdata_containing_tags = self._default(
+ language, cdata_containing_tags, 'cdata_containing_tags'
+ )
+
+ def substitute(self, ns):
+ """Process a string that needs to undergo entity substitution."""
+ if not self.entity_substitution:
+ return ns
+ from element import NavigableString
+ if (isinstance(ns, NavigableString)
+ and ns.parent is not None
+ and ns.parent.name in self.cdata_containing_tags):
+ # Do nothing.
+ return ns
+ # Substitute.
+ return self.entity_substitution(ns)
+
+ def attribute_value(self, value):
+ """Process the value of an attribute."""
+ return self.substitute(value)
+
+ def attributes(self, tag):
+ """Reorder a tag's attributes however you want."""
+ return sorted(tag.attrs.items())
+
+
+class HTMLFormatter(Formatter):
+ REGISTRY = {}
+ def __init__(self, *args, **kwargs):
+ return super(HTMLFormatter, self).__init__(self.HTML, *args, **kwargs)
+
+
+class XMLFormatter(Formatter):
+ REGISTRY = {}
+ def __init__(self, *args, **kwargs):
+ return super(XMLFormatter, self).__init__(self.XML, *args, **kwargs)
+
+
+# Set up aliases for the default formatters.
+HTMLFormatter.REGISTRY['html'] = HTMLFormatter(
+ entity_substitution=EntitySubstitution.substitute_html
+)
+HTMLFormatter.REGISTRY["html5"] = HTMLFormatter(
+ entity_substitution=EntitySubstitution.substitute_html,
+ void_element_close_prefix = None
+)
+HTMLFormatter.REGISTRY["minimal"] = HTMLFormatter(
+ entity_substitution=EntitySubstitution.substitute_xml
+)
+HTMLFormatter.REGISTRY[None] = HTMLFormatter(
+ entity_substitution=None
+)
+XMLFormatter.REGISTRY["html"] = XMLFormatter(
+ entity_substitution=EntitySubstitution.substitute_html
+)
+XMLFormatter.REGISTRY["minimal"] = XMLFormatter(
+ entity_substitution=EntitySubstitution.substitute_xml
+)
+XMLFormatter.REGISTRY[None] = Formatter(
+ Formatter(Formatter.XML, entity_substitution=None)
+)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/testing.py
new/beautifulsoup4-4.8.0/bs4/testing.py
--- old/beautifulsoup4-4.7.1/bs4/testing.py 2018-12-31 03:11:14.000000000
+0100
+++ new/beautifulsoup4-4.8.0/bs4/testing.py 2019-07-08 03:59:55.000000000
+0200
@@ -63,19 +63,19 @@
@property
def default_builder(self):
- return default_builder()
+ return default_builder
def soup(self, markup, **kwargs):
"""Build a Beautiful Soup object from markup."""
builder = kwargs.pop('builder', self.default_builder)
return BeautifulSoup(markup, builder=builder, **kwargs)
- def document_for(self, markup):
+ def document_for(self, markup, **kwargs):
"""Turn an HTML fragment into a document.
The details depend on the builder.
"""
- return self.default_builder.test_fragment_to_document(markup)
+ return self.default_builder(**kwargs).test_fragment_to_document(markup)
def assertSoupEquals(self, to_parse, compare_parsed_to=None):
builder = self.default_builder
@@ -232,7 +232,7 @@
soup = self.soup("")
new_tag = soup.new_tag(name)
self.assertEqual(True, new_tag.is_empty_element)
-
+
def test_pickle_and_unpickle_identity(self):
# Pickling a tree, then unpickling it, yields a tree identical
# to the original.
@@ -491,6 +491,12 @@
u"<p>\u2022 AT&T is in the s&p 500</p>"
)
+ def test_apos_entity(self):
+ self.assertSoupEquals(
+ u"<p>Bob's Bar</p>",
+ u"<p>Bob's Bar</p>",
+ )
+
def test_entities_in_foreign_document_encoding(self):
# “ and ” are invalid numeric entities referencing
# Windows-1252 characters. - references a character common
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_html5lib.py
new/beautifulsoup4-4.8.0/bs4/tests/test_html5lib.py
--- old/beautifulsoup4-4.7.1/bs4/tests/test_html5lib.py 2018-12-23
23:16:18.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/tests/test_html5lib.py 2019-07-07
21:54:34.000000000 +0200
@@ -22,7 +22,7 @@
@property
def default_builder(self):
- return HTML5TreeBuilder()
+ return HTML5TreeBuilder
def test_soupstrainer(self):
# The html5lib tree builder does not support SoupStrainers.
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_htmlparser.py
new/beautifulsoup4-4.8.0/bs4/tests/test_htmlparser.py
--- old/beautifulsoup4-4.7.1/bs4/tests/test_htmlparser.py 2018-07-15
14:26:01.000000000 +0200
+++ new/beautifulsoup4-4.8.0/bs4/tests/test_htmlparser.py 2019-07-07
21:52:25.000000000 +0200
@@ -9,9 +9,7 @@
class HTMLParserTreeBuilderSmokeTest(SoupTest, HTMLTreeBuilderSmokeTest):
- @property
- def default_builder(self):
- return HTMLParserTreeBuilder()
+ default_builder = HTMLParserTreeBuilder
def test_namespaced_system_doctype(self):
# html.parser can't handle namespaced doctypes, so skip this one.
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_lxml.py
new/beautifulsoup4-4.8.0/bs4/tests/test_lxml.py
--- old/beautifulsoup4-4.7.1/bs4/tests/test_lxml.py 2019-01-07
00:41:32.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/tests/test_lxml.py 2019-07-07
21:54:54.000000000 +0200
@@ -36,7 +36,7 @@
@property
def default_builder(self):
- return LXMLTreeBuilder()
+ return LXMLTreeBuilder
def test_out_of_range_entity(self):
self.assertSoupEquals(
@@ -79,7 +79,7 @@
@property
def default_builder(self):
- return LXMLTreeBuilderForXML()
+ return LXMLTreeBuilderForXML
def test_namespace_indexing(self):
# We should not track un-prefixed namespaces as we can only hold one
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_soup.py
new/beautifulsoup4-4.8.0/bs4/tests/test_soup.py
--- old/beautifulsoup4-4.7.1/bs4/tests/test_soup.py 2016-07-27
03:27:42.000000000 +0200
+++ new/beautifulsoup4-4.8.0/bs4/tests/test_soup.py 2019-07-16
22:46:05.000000000 +0200
@@ -24,6 +24,7 @@
EncodingDetector,
)
from bs4.testing import (
+ default_builder,
SoupTest,
skipIf,
)
@@ -54,7 +55,72 @@
soup = self.soup(utf8_data, exclude_encodings=["utf-8"])
self.assertEqual("windows-1252", soup.original_encoding)
+ def test_custom_builder_class(self):
+ # Verify that you can pass in a custom Builder class and
+ # it'll be instantiated with the appropriate keyword arguments.
+ class Mock(object):
+ def __init__(self, **kwargs):
+ self.called_with = kwargs
+ self.is_xml = True
+ def initialize_soup(self, soup):
+ pass
+ def prepare_markup(self, *args, **kwargs):
+ return ''
+
+ kwargs = dict(
+ var="value",
+ # This is a deprecated BS3-era keyword argument, which
+ # will be stripped out.
+ convertEntities=True,
+ )
+ with warnings.catch_warnings(record=True):
+ soup = BeautifulSoup('', builder=Mock, **kwargs)
+ assert isinstance(soup.builder, Mock)
+ self.assertEqual(dict(var="value"), soup.builder.called_with)
+
+ # You can also instantiate the TreeBuilder yourself. In this
+ # case, that specific object is used and any keyword arguments
+ # to the BeautifulSoup constructor are ignored.
+ builder = Mock(**kwargs)
+ with warnings.catch_warnings(record=True) as w:
+ soup = BeautifulSoup(
+ '', builder=builder, ignored_value=True,
+ )
+ msg = str(w[0].message)
+ assert msg.startswith("Keyword arguments to the BeautifulSoup
constructor will be ignored.")
+ self.assertEqual(builder, soup.builder)
+ self.assertEqual(kwargs, builder.called_with)
+
+ def test_cdata_list_attributes(self):
+ # Most attribute values are represented as scalars, but the
+ # HTML standard says that some attributes, like 'class' have
+ # space-separated lists as values.
+ markup = '<a id=" an id " class=" a class "></a>'
+ soup = self.soup(markup)
+
+ # Note that the spaces are stripped for 'class' but not for 'id'.
+ a = soup.a
+ self.assertEqual(" an id ", a['id'])
+ self.assertEqual(["a", "class"], a['class'])
+
+ # TreeBuilder takes an argument called 'mutli_valued_attributes'
which lets
+ # you customize or disable this. As always, you can customize the
TreeBuilder
+ # by passing in a keyword argument to the BeautifulSoup constructor.
+ soup = self.soup(markup, builder=default_builder,
multi_valued_attributes=None)
+ self.assertEqual(" a class ", soup.a['class'])
+
+ # Here are two ways of saying that `id` is a multi-valued
+ # attribute in this context, but 'class' is not.
+ for switcheroo in ({'*': 'id'}, {'a': 'id'}):
+ with warnings.catch_warnings(record=True) as w:
+ # This will create a warning about not explicitly
+ # specifying a parser, but we'll ignore it.
+ soup = self.soup(markup, builder=None,
multi_valued_attributes=switcheroo)
+ a = soup.a
+ self.assertEqual(["an", "id"], a['id'])
+ self.assertEqual(" a class ", a['class'])
+
class TestWarnings(SoupTest):
def _no_parser_specified(self, s, is_there=True):
@@ -217,7 +283,7 @@
self.assertEqual(
self.sub.substitute_xml_containing_entities("ÁT&T"),
"ÁT&T")
-
+
def test_quotes_not_html_substituted(self):
"""There's no need to do this except inside attribute values."""
text = 'Bob\'s "bar"'
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_tree.py
new/beautifulsoup4-4.8.0/bs4/tests/test_tree.py
--- old/beautifulsoup4-4.7.1/bs4/tests/test_tree.py 2019-01-07
00:46:26.000000000 +0100
+++ new/beautifulsoup4-4.8.0/bs4/tests/test_tree.py 2019-07-16
22:46:05.000000000 +0200
@@ -25,6 +25,7 @@
Comment,
Declaration,
Doctype,
+ Formatter,
NavigableString,
SoupStrainer,
Tag,
@@ -416,6 +417,48 @@
self.assertEqual([], soup.find_all(id=1, text="bar"))
+class TestSmooth(TreeTest):
+ """Test Tag.smooth."""
+
+ def test_smooth(self):
+ soup = self.soup("<div>a</div>")
+ div = soup.div
+ div.append("b")
+ div.append("c")
+ div.append(Comment("Comment 1"))
+ div.append(Comment("Comment 2"))
+ div.append("d")
+ builder = self.default_builder()
+ span = Tag(soup, builder, 'span')
+ span.append('1')
+ span.append('2')
+ div.append(span)
+
+ # At this point the tree has a bunch of adjacent
+ # NavigableStrings. This is normal, but it has no meaning in
+ # terms of HTML, so we may want to smooth things out for
+ # output.
+
+ # Since the <span> tag has two children, its .string is None.
+ self.assertEquals(None, div.span.string)
+
+ self.assertEqual(7, len(div.contents))
+ div.smooth()
+ self.assertEqual(5, len(div.contents))
+
+ # The three strings at the beginning of div.contents have been
+ # merged into on string.
+ #
+ self.assertEqual('abc', div.contents[0])
+
+ # The call is recursive -- the <span> tag was also smoothed.
+ self.assertEqual('12', div.span.string)
+
+ # The two comments have _not_ been merged, even though
+ # comments are strings. Merging comments would change the
+ # meaning of the HTML.
+ self.assertEqual('Comment 1', div.contents[1])
+ self.assertEqual('Comment 2', div.contents[2])
class TestIndex(TreeTest):
@@ -896,7 +939,7 @@
self.assertEqual(soup.a.contents[0].next_element, "bar")
def test_insert_tag(self):
- builder = self.default_builder
+ builder = self.default_builder()
soup = self.soup(
"<a><b>Find</b><c>lady!</c><d></d></a>", builder=builder)
magic_tag = Tag(soup, builder, 'magictag')
@@ -1532,7 +1575,7 @@
# callable is called on every string.
self.assertEqual(
decoded,
- self.document_for(u"<b><FOO></b><b>BAR</b><br>"))
+ self.document_for(u"<b><FOO></b><b>BAR</b><br/>"))
def test_formatter_is_run_on_attribute_values(self):
markup = u'<a href="http://a.com?a=b&c=é">e</a>'
@@ -1570,11 +1613,11 @@
self.assertTrue(b"< < hey > >" in encoded)
def test_prettify_leaves_preformatted_text_alone(self):
- soup = self.soup("<div> foo <pre> \tbar\n \n </pre> baz ")
+ soup = self.soup("<div> foo <pre> \tbar\n \n </pre> baz
<textarea> eee\nfff\t</textarea></div>")
# Everything outside the <pre> tag is reformatted, but everything
# inside is left alone.
self.assertEqual(
- u'<div>\n foo\n <pre> \tbar\n \n </pre>\n baz\n</div>',
+ u'<div>\n foo\n <pre> \tbar\n \n </pre>\n baz\n <textarea>
eee\nfff\t</textarea>\n</div>',
soup.div.prettify())
def test_prettify_accepts_formatter_function(self):
@@ -1683,6 +1726,29 @@
else:
self.assertEqual(b'<b>\\u2603</b>', repr(soup))
+class TestFormatter(SoupTest):
+
+ def test_sort_attributes(self):
+ # Test the ability to override Formatter.attributes() to,
+ # e.g., disable the normal sorting of attributes.
+ class UnsortedFormatter(Formatter):
+ def attributes(self, tag):
+ self.called_with = tag
+ for k, v in sorted(tag.attrs.items()):
+ if k == 'ignore':
+ continue
+ yield k,v
+
+ soup = self.soup('<p cval="1" aval="2" ignore="ignored"></p>')
+ formatter = UnsortedFormatter()
+ decoded = soup.decode(formatter=formatter)
+
+ # attributes() was called on the <p> tag. It filtered out one
+ # attribute and sorted the other two.
+ self.assertEquals(formatter.called_with, soup.p)
+ self.assertEquals(u'<p aval="2" cval="1"></p>', decoded)
+
+
class TestNavigableStringSubclasses(SoupTest):
def test_cdata(self):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/doc/source/index.rst
new/beautifulsoup4-4.8.0/doc/source/index.rst
--- old/beautifulsoup4-4.7.1/doc/source/index.rst 2018-12-31
17:52:43.000000000 +0100
+++ new/beautifulsoup4-4.8.0/doc/source/index.rst 2019-07-17
03:31:27.000000000 +0200
@@ -31,7 +31,7 @@
* `这篇文档当然还有中文版. <http://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/>`_
* このページは日本語で利用できます(`外部リンク <http://kondou.com/BS4/>`_)
-* 이 문서는 한국어 번역도 가능합니다. (`외부 링크
<http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_)
+* 이 문서는 한국어 번역도 가능합니다. (`외부 링크
<https://web.archive.org/web/20150319200824/http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_)
Getting help
------------
@@ -266,9 +266,9 @@
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
| Parser | Typical usage |
Advantages | Disadvantages |
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
-| Python's html.parser | ``BeautifulSoup(markup, "html.parser")`` | *
Batteries included | * Not very lenient |
-| | | * Decent
speed | (before Python 2.7.3 |
-| | | *
Lenient (as of Python 2.7.3 | or 3.2.2) |
+| Python's html.parser | ``BeautifulSoup(markup, "html.parser")`` | *
Batteries included | * Not as fast as lxml, |
+| | | * Decent
speed | less lenient than |
+| | | *
Lenient (As of Python 2.7.3 | html5lib. |
| | | and
3.2.) | |
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
| lxml's HTML parser | ``BeautifulSoup(markup, "lxml")`` | * Very
fast | * External C dependency |
@@ -428,8 +428,15 @@
print(rel_soup.p)
# <p>Back to the <a rel="index contents">homepage</a></p>
-You can use ```get_attribute_list`` to get a value that's always a list,
-string, whether or not it's a multi-valued atribute
+ You can disable this by passing ``multi_valued_attributes=None`` as a
+keyword argument into the ``BeautifulSoup`` constructor::
+
+ no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html',
multi_valued_attributes=None)
+ no_list_soup.p['class']
+ # u'body strikeout'
+
+You can use ```get_attribute_list`` to get a value that's always a
+list, whether or not it's a multi-valued atribute::
id_soup.p.get_attribute_list('id')
# ["my id"]
@@ -440,8 +447,20 @@
xml_soup.p['class']
# u'body strikeout'
+Again, you can configure this using the ``multi_valued_attributes`` argument::
+
+ class_is_multi= { '*' : 'class'}
+ xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml',
multi_valued_attributes=class_is_multi)
+ xml_soup.p['class']
+ # [u'body', u'strikeout']
+You probably won't need to do this, but if you do, use the defaults as
+a guide. They implement the rules described in the HTML specification::
+ from bs4.builder import builder_registry
+ builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES
+
+
``NavigableString``
-------------------
@@ -2093,6 +2112,40 @@
Like ``replace_with()``, ``unwrap()`` returns the tag
that was replaced.
+``smooth()``
+---------------------------
+
+After calling a bunch of methods that modify the parse tree, you may end up
with two or more ``NavigableString`` objects next to each other. Beautiful Soup
doesn't have any problems with this, but since it can't happen in a freshly
parsed document, you might not expect behavior like the following::
+
+ soup = BeautifulSoup("<p>A one</p>")
+ soup.p.append(", a two")
+
+ soup.p.contents
+ # [u'A one', u', a two']
+
+ print(soup.p.encode())
+ # <p>A one, a two</p>
+
+ print(soup.p.prettify())
+ # <p>
+ # A one
+ # , a two
+ # </p>
+
+You can call ``Tag.smooth()`` to clean up the parse tree by consolidating
adjacent strings::
+
+ soup.smooth()
+
+ soup.p.contents
+ # [u'A one, a two']
+
+ print(soup.p.prettify())
+ # <p>
+ # A one, a two
+ # </p>
+
+The ``smooth()`` method is new in Beautiful Soup 4.8.0.
+
Output
======
@@ -2103,7 +2156,7 @@
The ``prettify()`` method will turn a Beautiful Soup parse tree into a
nicely formatted Unicode string, with a separate line for each
-tag and each string:
+tag and each string::
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
@@ -2216,7 +2269,7 @@
# </body>
# </html>
- If you pass in ``formatter="html5"``, it's the same as
+If you pass in ``formatter="html5"``, it's the same as
``formatter="html5"``, but Beautiful Soup will
omit the closing slash in HTML void tags like "br"::
@@ -2245,16 +2298,17 @@
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>
-Finally, if you pass in a function for ``formatter``, Beautiful Soup
-will call that function once for every string and attribute value in
-the document. You can do whatever you want in this function. Here's a
-formatter that converts strings to uppercase and does absolutely
-nothing else::
+If you need more sophisticated control over your output, you can
+use Beautiful Soup's ``Formatter`` class. Here's a formatter that
+converts strings to uppercase, whether they occur in a text node or in an
+attribute value::
+ from bs4.formatter import HTMLFormatter
def uppercase(str):
return str.upper()
+ formatter = HTMLFormatter(uppercase)
- print(soup.prettify(formatter=uppercase))
+ print(soup.prettify(formatter=formatter))
# <html>
# <body>
# <p>
@@ -2263,34 +2317,31 @@
# </body>
# </html>
- print(link_soup.a.prettify(formatter=uppercase))
+ print(link_soup.a.prettify(formatter=formatter))
# <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2">
# A LINK
# </a>
-If you're writing your own function, you should know about the
-``EntitySubstitution`` class in the ``bs4.dammit`` module. This class
-implements Beautiful Soup's standard formatters as class methods: the
-"html" formatter is ``EntitySubstitution.substitute_html``, and the
-"minimal" formatter is ``EntitySubstitution.substitute_xml``. You can
-use these functions to simulate ``formatter=html`` or
-``formatter==minimal``, but then do something extra.
-
-Here's an example that replaces Unicode characters with HTML entities
-whenever possible, but `also` converts all strings to uppercase::
-
- from bs4.dammit import EntitySubstitution
- def uppercase_and_substitute_html_entities(str):
- return EntitySubstitution.substitute_html(str.upper())
-
- print(soup.prettify(formatter=uppercase_and_substitute_html_entities))
- # <html>
- # <body>
- # <p>
- # IL A DIT <<SACRÉ BLEU!>>
- # </p>
- # </body>
- # </html>
+Subclassing ``HTMLFormatter`` or ``XMLFormatter`` will give you even
+more control over the output. For example, Beautiful Soup sorts the
+attributes in every tag by default::
+
+ attr_soup = BeautifulSoup(b'<p z="1" m="2" a="3"></p>')
+ print(attr_soup.p.encode())
+ # <p a="3" m="2" z="1"></p>
+
+To turn this off, you can subclass the ``Formatter.attributes()``
+method, which controls which attributes are output and in what
+order. This implementation also filters out out one of the attributes.
+
+ class UnsortedAttributes(HTMLFormatter):
+ def attributes(self, tag):
+ for k, v in tag.attrs.items():
+ if k == 'm':
+ continue
+ yield k, v
+ print(attr_soup.p.encode(formatter=UnsortedAttributes()))
+ # <p z="1" a="3"></p>
One last caveat: if you create a ``CData`` object, the text inside
that object is always presented `exactly as it appears, with no
@@ -3097,6 +3148,7 @@
* ``findPrevious`` -> ``find_previous``
* ``findPreviousSibling`` -> ``find_previous_sibling``
* ``findPreviousSiblings`` -> ``find_previous_siblings``
+* ``getText`` -> ``get_text``
* ``nextSibling`` -> ``next_sibling``
* ``previousSibling`` -> ``previous_sibling``
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.7.1/setup.py
new/beautifulsoup4-4.8.0/setup.py
--- old/beautifulsoup4-4.7.1/setup.py 2019-01-07 01:47:07.000000000 +0100
+++ new/beautifulsoup4-4.8.0/setup.py 2019-07-20 01:50:29.000000000 +0200
@@ -8,7 +8,7 @@
setup(
name="beautifulsoup4",
- version = "4.7.1",
+ version = "4.8.0",
author="Leonard Richardson",
author_email='[email protected]',
url="http://www.crummy.com/software/BeautifulSoup/bs4/",