Hello community,
here is the log from the commit of package python-beautifulsoup4 for
openSUSE:Factory checked in at 2018-08-08 14:45:25
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/python-beautifulsoup4 (Old)
and /work/SRC/openSUSE:Factory/.python-beautifulsoup4.new (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-beautifulsoup4"
Wed Aug 8 14:45:25 2018 rev:26 rq:627527 version:4.6.1
Changes:
--------
---
/work/SRC/openSUSE:Factory/python-beautifulsoup4/python-beautifulsoup4.changes
2018-07-21 10:08:10.975198769 +0200
+++
/work/SRC/openSUSE:Factory/.python-beautifulsoup4.new/python-beautifulsoup4.changes
2018-08-08 14:45:27.748759390 +0200
@@ -1,0 +2,54 @@
+Sun Aug 5 11:02:25 UTC 2018 - [email protected]
+
+- update to 4.6.1:
+ * Stop data loss when encountering an empty numeric entity, and
+ possibly in other cases. Thanks to tos.kamiya for the fix. [bug=1698503]
+
+ * Preserve XML namespaces introduced inside an XML document, not just
+ the ones introduced at the top level. [bug=1718787]
+
+ * Added a new formatter, "html5", which represents void elements
+ as "<element>" rather than "<element/>". [bug=1716272]
+
+ * Fixed a problem where the html.parser tree builder interpreted
+ a string like "&foo " as the character entity "&foo;" [bug=1728706]
+
+ * Correctly handle invalid HTML numeric character entities like “
+ which reference code points that are not Unicode code points. Note
+ that this is only fixed when Beautiful Soup is used with the
+ html.parser parser -- html5lib already worked and I couldn't fix it
+ with lxml. [bug=1782933]
+
+ * Improved the warning given when no parser is specified. [bug=1780571]
+
+ * When markup contains duplicate elements, a select() call that
+ includes multiple match clauses will match all relevant
+ elements. [bug=1770596]
+
+ * Fixed code that was causing deprecation warnings in recent Python 3
+ versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496]
+
+ * Fixed a Windows crash in diagnose() when checking whether a long
+ markup string is a filename. [bug=1737121]
+
+ * Stopped HTMLParser from raising an exception in very rare cases of
+ bad markup. [bug=1708831]
+
+ * Fixed a bug where find_all() was not working when asked to find a
+ tag with a namespaced name in an XML document that was parsed as
+ HTML. [bug=1723783]
+
+ * You can get finer control over formatting by subclassing
+ bs4.element.Formatter and passing a Formatter instance into (e.g.)
+ encode(). [bug=1716272]
+
+ * You can pass a dictionary of `attrs` into
+ BeautifulSoup.new_tag. This makes it possible to create a tag with
+ an attribute like 'name' that would otherwise be masked by another
+ argument of new_tag. [bug=1779276]
+
+ * Clarified the deprecation warning when accessing tag.fooTag, to cover
+ the possibility that you might really have been looking for a tag
+ called 'fooTag'.
+
+-------------------------------------------------------------------
Old:
----
beautifulsoup4-4.6.0.tar.gz
New:
----
beautifulsoup4-4.6.1.tar.gz
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Other differences:
------------------
++++++ python-beautifulsoup4.spec ++++++
--- /var/tmp/diff_new_pack.fRScod/_old 2018-08-08 14:45:28.352760373 +0200
+++ /var/tmp/diff_new_pack.fRScod/_new 2018-08-08 14:45:28.356760380 +0200
@@ -18,7 +18,7 @@
%{?!python_module:%define python_module() python-%{**} python3-%{**}}
Name: python-beautifulsoup4
-Version: 4.6.0
+Version: 4.6.1
Release: 0
Summary: HTML/XML Parser for Quick-Turnaround Applications Like
Screen-Scraping
License: MIT
++++++ beautifulsoup4-4.6.0.tar.gz -> beautifulsoup4-4.6.1.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/LICENSE
new/beautifulsoup4-4.6.1/LICENSE
--- old/beautifulsoup4-4.6.0/LICENSE 1970-01-01 01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.6.1/LICENSE 2016-07-16 17:25:45.000000000 +0200
@@ -0,0 +1,27 @@
+Beautiful Soup is made available under the MIT license:
+
+ Copyright (c) 2004-2016 Leonard Richardson
+
+ Permission is hereby granted, free of charge, to any person obtaining
+ a copy of this software and associated documentation files (the
+ "Software"), to deal in the Software without restriction, including
+ without limitation the rights to use, copy, modify, merge, publish,
+ distribute, sublicense, and/or sell copies of the Software, and to
+ permit persons to whom the Software is furnished to do so, subject to
+ the following conditions:
+
+ The above copyright notice and this permission notice shall be
+ included in all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ SOFTWARE.
+
+Beautiful Soup incorporates code from the html5lib library, which is
+also made available under the MIT license. Copyright (c) 2006-2013
+James Graham and other contributors
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/MANIFEST.in
new/beautifulsoup4-4.6.1/MANIFEST.in
--- old/beautifulsoup4-4.6.0/MANIFEST.in 2015-06-28 22:13:43.000000000
+0200
+++ new/beautifulsoup4-4.6.1/MANIFEST.in 2018-07-21 17:17:01.000000000
+0200
@@ -1,5 +1,6 @@
include test-all-versions
include convert-py3k
+include LICENSE
include *.txt
include doc*/Makefile
include doc*/source/*.py
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/NEWS.txt
new/beautifulsoup4-4.6.1/NEWS.txt
--- old/beautifulsoup4-4.6.0/NEWS.txt 2017-05-07 15:49:34.000000000 +0200
+++ new/beautifulsoup4-4.6.1/NEWS.txt 2018-07-29 01:02:57.000000000 +0200
@@ -1,3 +1,55 @@
+= 4.6.1 (20180728)
+
+* Stop data loss when encountering an empty numeric entity, and
+ possibly in other cases. Thanks to tos.kamiya for the fix. [bug=1698503]
+
+* Preserve XML namespaces introduced inside an XML document, not just
+ the ones introduced at the top level. [bug=1718787]
+
+* Added a new formatter, "html5", which represents void elements
+ as "<element>" rather than "<element/>". [bug=1716272]
+
+* Fixed a problem where the html.parser tree builder interpreted
+ a string like "&foo " as the character entity "&foo;" [bug=1728706]
+
+* Correctly handle invalid HTML numeric character entities like “
+ which reference code points that are not Unicode code points. Note
+ that this is only fixed when Beautiful Soup is used with the
+ html.parser parser -- html5lib already worked and I couldn't fix it
+ with lxml. [bug=1782933]
+
+* Improved the warning given when no parser is specified. [bug=1780571]
+
+* When markup contains duplicate elements, a select() call that
+ includes multiple match clauses will match all relevant
+ elements. [bug=1770596]
+
+* Fixed code that was causing deprecation warnings in recent Python 3
+ versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496]
+
+* Fixed a Windows crash in diagnose() when checking whether a long
+ markup string is a filename. [bug=1737121]
+
+* Stopped HTMLParser from raising an exception in very rare cases of
+ bad markup. [bug=1708831]
+
+* Fixed a bug where find_all() was not working when asked to find a
+ tag with a namespaced name in an XML document that was parsed as
+ HTML. [bug=1723783]
+
+* You can get finer control over formatting by subclassing
+ bs4.element.Formatter and passing a Formatter instance into (e.g.)
+ encode(). [bug=1716272]
+
+* You can pass a dictionary of `attrs` into
+ BeautifulSoup.new_tag. This makes it possible to create a tag with
+ an attribute like 'name' that would otherwise be masked by another
+ argument of new_tag. [bug=1779276]
+
+* Clarified the deprecation warning when accessing tag.fooTag, to cover
+ the possibility that you might really have been looking for a tag
+ called 'fooTag'.
+
= 4.6.0 (20170507) =
* Added the `Tag.get_attribute_list` method, which acts like `Tag.get` for
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/PKG-INFO
new/beautifulsoup4-4.6.1/PKG-INFO
--- old/beautifulsoup4-4.6.0/PKG-INFO 2017-05-07 15:52:33.000000000 +0200
+++ new/beautifulsoup4-4.6.1/PKG-INFO 2018-07-29 01:29:27.000000000 +0200
@@ -1,6 +1,6 @@
Metadata-Version: 1.1
Name: beautifulsoup4
-Version: 4.6.0
+Version: 4.6.1
Summary: Screen-scraping library
Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/beautifulsoup4-4.6.0/beautifulsoup4.egg-info/PKG-INFO
new/beautifulsoup4-4.6.1/beautifulsoup4.egg-info/PKG-INFO
--- old/beautifulsoup4-4.6.0/beautifulsoup4.egg-info/PKG-INFO 2017-05-07
15:52:32.000000000 +0200
+++ new/beautifulsoup4-4.6.1/beautifulsoup4.egg-info/PKG-INFO 2018-07-29
01:29:26.000000000 +0200
@@ -1,6 +1,6 @@
Metadata-Version: 1.1
Name: beautifulsoup4
-Version: 4.6.0
+Version: 4.6.1
Summary: Screen-scraping library
Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/beautifulsoup4-4.6.0/beautifulsoup4.egg-info/SOURCES.txt
new/beautifulsoup4-4.6.1/beautifulsoup4.egg-info/SOURCES.txt
--- old/beautifulsoup4-4.6.0/beautifulsoup4.egg-info/SOURCES.txt
2017-05-07 15:52:33.000000000 +0200
+++ new/beautifulsoup4-4.6.1/beautifulsoup4.egg-info/SOURCES.txt
2018-07-29 01:29:27.000000000 +0200
@@ -1,5 +1,6 @@
AUTHORS.txt
COPYING.txt
+LICENSE
MANIFEST.in
NEWS.txt
README.txt
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/bs4/__init__.py
new/beautifulsoup4-4.6.1/bs4/__init__.py
--- old/beautifulsoup4-4.6.0/bs4/__init__.py 2017-05-07 15:48:18.000000000
+0200
+++ new/beautifulsoup4-4.6.1/bs4/__init__.py 2018-07-29 01:00:46.000000000
+0200
@@ -21,14 +21,15 @@
# found in the LICENSE file.
__author__ = "Leonard Richardson ([email protected])"
-__version__ = "4.6.0"
-__copyright__ = "Copyright (c) 2004-2017 Leonard Richardson"
+__version__ = "4.6.1"
+__copyright__ = "Copyright (c) 2004-2018 Leonard Richardson"
__license__ = "MIT"
__all__ = ['BeautifulSoup']
import os
import re
+import sys
import traceback
import warnings
@@ -82,14 +83,46 @@
ASCII_SPACES = '\x20\x0a\x09\x0c\x0d'
- NO_PARSER_SPECIFIED_WARNING = "No parser was explicitly specified, so I'm
using the best available %(markup_type)s parser for this system
(\"%(parser)s\"). This usually isn't a problem, but if you run this code on
another system, or in a different virtual environment, it may use a different
parser and behave differently.\n\nThe code that caused this warning is on line
%(line_number)s of the file %(filename)s. To get rid of this warning, change
code that looks like this:\n\n BeautifulSoup(YOUR_MARKUP})\n\nto this:\n\n
BeautifulSoup(YOUR_MARKUP, \"%(parser)s\")\n"
+ NO_PARSER_SPECIFIED_WARNING = "No parser was explicitly specified, so I'm
using the best available %(markup_type)s parser for this system
(\"%(parser)s\"). This usually isn't a problem, but if you run this code on
another system, or in a different virtual environment, it may use a different
parser and behave differently.\n\nThe code that caused this warning is on line
%(line_number)s of the file %(filename)s. To get rid of this warning, pass the
additional argument 'features=\"%(parser)s\"' to the BeautifulSoup
constructor.\n"
def __init__(self, markup="", features=None, builder=None,
parse_only=None, from_encoding=None, exclude_encodings=None,
**kwargs):
- """The Soup object is initialized as the 'root tag', and the
- provided markup (which can be a string or a file-like object)
- is fed into the underlying parser."""
+ """Constructor.
+
+ :param markup: A string or a file-like object representing
+ markup to be parsed.
+
+ :param features: Desirable features of the parser to be used. This
+ may be the name of a specific parser ("lxml", "lxml-xml",
+ "html.parser", or "html5lib") or it may be the type of markup
+ to be used ("html", "html5", "xml"). It's recommended that you
+ name a specific parser, so that Beautiful Soup gives you the
+ same results across platforms and virtual environments.
+
+ :param builder: A specific TreeBuilder to use instead of looking one
+ up based on `features`. You shouldn't need to use this.
+
+ :param parse_only: A SoupStrainer. Only parts of the document
+ matching the SoupStrainer will be considered. This is useful
+ when parsing part of a document that would otherwise be too
+ large to fit into memory.
+
+ :param from_encoding: A string indicating the encoding of the
+ document to be parsed. Pass this in if Beautiful Soup is
+ guessing wrongly about the document's encoding.
+
+ :param exclude_encodings: A list of strings indicating
+ encodings known to be wrong. Pass this in if you don't know
+ the document's encoding but you know Beautiful Soup's guess is
+ wrong.
+
+ :param kwargs: For backwards compatibility purposes, the
+ constructor accepts certain keyword arguments used in
+ Beautiful Soup 3. None of these arguments do anything in
+ Beautiful Soup 4 and there's no need to actually pass keyword
+ arguments into the constructor.
+ """
if 'convertEntities' in kwargs:
warnings.warn(
@@ -171,14 +204,35 @@
else:
markup_type = "HTML"
- caller = traceback.extract_stack()[0]
- filename = caller[0]
- line_number = caller[1]
- warnings.warn(self.NO_PARSER_SPECIFIED_WARNING % dict(
- filename=filename,
- line_number=line_number,
- parser=builder.NAME,
- markup_type=markup_type))
+ # This code adapted from warnings.py so that we get the same
line
+ # of code as our warnings.warn() call gets, even if the answer
is wrong
+ # (as it may be in a multithreading situation).
+ caller = None
+ try:
+ caller = sys._getframe(1)
+ except ValueError:
+ pass
+ if caller:
+ globals = caller.f_globals
+ line_number = caller.f_lineno
+ else:
+ globals = sys.__dict__
+ line_number= 1
+ filename = globals.get('__file__')
+ if filename:
+ fnl = filename.lower()
+ if fnl.endswith((".pyc", ".pyo")):
+ filename = filename[:-1]
+ if filename:
+ # If there is no filename at all, the user is most likely
in a REPL,
+ # and the warning is not necessary.
+ values = dict(
+ filename=filename,
+ line_number=line_number,
+ parser=builder.NAME,
+ markup_type=markup_type
+ )
+ warnings.warn(self.NO_PARSER_SPECIFIED_WARNING % values,
stacklevel=2)
self.builder = builder
self.is_xml = builder.is_xml
@@ -302,9 +356,10 @@
self.preserve_whitespace_tag_stack = []
self.pushTag(self)
- def new_tag(self, name, namespace=None, nsprefix=None, **attrs):
+ def new_tag(self, name, namespace=None, nsprefix=None, attrs={},
**kwattrs):
"""Create a new tag associated with this soup."""
- return Tag(None, self.builder, name, namespace, nsprefix, attrs)
+ kwattrs.update(attrs)
+ return Tag(None, self.builder, name, namespace, nsprefix, kwattrs)
def new_string(self, s, subclass=NavigableString):
"""Create a new NavigableString associated with this soup."""
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/bs4/builder/__init__.py
new/beautifulsoup4-4.6.1/bs4/builder/__init__.py
--- old/beautifulsoup4-4.6.0/bs4/builder/__init__.py 2017-05-06
19:31:05.000000000 +0200
+++ new/beautifulsoup4-4.6.1/bs4/builder/__init__.py 2018-07-29
00:53:19.000000000 +0200
@@ -93,7 +93,7 @@
preserve_whitespace_tags = set()
empty_element_tags = None # A tag will be considered an empty-element
# tag when and only when it has no contents.
-
+
# A value for these tag/attribute combinations is a space- or
# comma-separated list of CDATA, rather than a single CDATA.
cdata_list_attributes = {}
@@ -125,7 +125,7 @@
if self.empty_element_tags is None:
return True
return tag_name in self.empty_element_tags
-
+
def feed(self, markup):
raise NotImplementedError()
@@ -235,11 +235,11 @@
empty_element_tags = set([
# These are from HTML5.
'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen',
'link', 'menuitem', 'meta', 'param', 'source', 'track', 'wbr',
-
- # These are from HTML4, removed in HTML5.
- 'spacer', 'frame'
+
+ # These are from earlier versions of HTML and are removed in HTML5.
+ 'basefont', 'bgsound', 'command', 'frame', 'image', 'isindex',
'nextid', 'spacer'
])
-
+
# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/bs4/builder/_htmlparser.py
new/beautifulsoup4-4.6.1/bs4/builder/_htmlparser.py
--- old/beautifulsoup4-4.6.0/bs4/builder/_htmlparser.py 2017-05-07
13:08:16.000000000 +0200
+++ new/beautifulsoup4-4.6.1/bs4/builder/_htmlparser.py 2018-07-28
22:48:02.000000000 +0200
@@ -1,3 +1,4 @@
+# encoding: utf-8
"""Use the HTMLParser library to parse HTML files that aren't too bad."""
# Use of this source code is governed by a BSD-style license that can be
@@ -64,7 +65,18 @@
# order. It's a list of closing tags we've already handled and
# will ignore, assuming they ever show up.
self.already_closed_empty_element = []
-
+
+ def error(self, msg):
+ """In Python 3, HTMLParser subclasses must implement error(), although
this
+ requirement doesn't appear to be documented.
+
+ In Python 2, HTMLParser implements error() as raising an exception.
+
+ In any event, this method is called only on very strange markup and
our best strategy
+ is to pretend it didn't happen and keep going.
+ """
+ warnings.warn(msg)
+
def handle_startendtag(self, name, attrs):
# This is only called when the markup looks like
# <tag/>.
@@ -129,11 +141,26 @@
else:
real_name = int(name)
- try:
- data = unichr(real_name)
- except (ValueError, OverflowError), e:
- data = u"\N{REPLACEMENT CHARACTER}"
-
+ data = None
+ if real_name < 256:
+ # HTML numeric entities are supposed to reference Unicode
+ # code points, but sometimes they reference code points in
+ # some other encoding (ahem, Windows-1252). E.g. “
+ # instead of É for LEFT DOUBLE QUOTATION MARK. This
+ # code tries to detect this situation and compensate.
+ for encoding in (self.soup.original_encoding, 'windows-1252'):
+ if not encoding:
+ continue
+ try:
+ data = bytearray([real_name]).decode(encoding)
+ except UnicodeDecodeError, e:
+ pass
+ if not data:
+ try:
+ data = unichr(real_name)
+ except (ValueError, OverflowError), e:
+ pass
+ data = data or u"\N{REPLACEMENT CHARACTER}"
self.handle_data(data)
def handle_entityref(self, name):
@@ -141,7 +168,12 @@
if character is not None:
data = character
else:
- data = "&%s;" % name
+ # If this were XML, it would be ambiguous whether "&foo"
+ # was an character entity reference with a missing
+ # semicolon or the literal string "&foo". Since this is
+ # HTML, we have a complete list of all character entity references,
+ # and this one wasn't found, so assume it's the literal string
"&foo".
+ data = "&%s" % name
self.handle_data(data)
def handle_comment(self, data):
@@ -213,6 +245,7 @@
parser.soup = self.soup
try:
parser.feed(markup)
+ parser.close()
except HTMLParseError, e:
warnings.warn(RuntimeWarning(
"Python's built-in HTMLParser cannot parse the given document.
This is not a bug in Beautiful Soup. The best solution is to install an
external parser (lxml or html5lib), and use Beautiful Soup with that parser.
See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
for help."))
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/bs4/builder/_lxml.py
new/beautifulsoup4-4.6.1/bs4/builder/_lxml.py
--- old/beautifulsoup4-4.6.0/bs4/builder/_lxml.py 2017-05-07
15:47:45.000000000 +0200
+++ new/beautifulsoup4-4.6.1/bs4/builder/_lxml.py 2018-07-28
22:13:54.000000000 +0200
@@ -5,9 +5,13 @@
'LXMLTreeBuilder',
]
+try:
+ from collections.abc import Callable # Python 3.6
+except ImportError , e:
+ from collections import Callable
+
from io import BytesIO
from StringIO import StringIO
-import collections
from lxml import etree
from bs4.element import (
Comment,
@@ -58,7 +62,7 @@
# Use the default parser.
parser = self.default_parser(encoding)
- if isinstance(parser, collections.Callable):
+ if isinstance(parser, Callable):
# Instantiate the parser with default arguments
parser = parser(target=self, strip_cdata=False, encoding=encoding)
return parser
@@ -147,11 +151,11 @@
attrs = dict(attrs)
nsprefix = None
# Invert each namespace map as it comes in.
- if len(self.nsmaps) > 1:
- # There are no new namespaces for this tag, but
- # non-default namespaces are in play, so we need a
- # separate tag stack to know when they end.
- self.nsmaps.append(None)
+ if len(nsmap) == 0 and len(self.nsmaps) > 1:
+ # There are no new namespaces for this tag, but
+ # non-default namespaces are in play, so we need a
+ # separate tag stack to know when they end.
+ self.nsmaps.append(None)
elif len(nsmap) > 0:
# A new namespace mapping has come into play.
inverted_nsmap = dict((value, key) for key, value in nsmap.items())
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/bs4/dammit.py
new/beautifulsoup4-4.6.1/bs4/dammit.py
--- old/beautifulsoup4-4.6.0/bs4/dammit.py 2017-05-07 04:35:39.000000000
+0200
+++ new/beautifulsoup4-4.6.1/bs4/dammit.py 2018-07-28 22:22:07.000000000
+0200
@@ -46,9 +46,9 @@
pass
xml_encoding_re = re.compile(
- '^<\?.*encoding=[\'"](.*?)[\'"].*\?>'.encode(), re.I)
+ '^<\\?.*encoding=[\'"](.*?)[\'"].*\\?>'.encode(), re.I)
html_meta_re = re.compile(
- '<\s*meta[^>]+charset\s*=\s*["\']?([^>]*?)[ /;\'">]'.encode(), re.I)
+ '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'.encode(), re.I)
class EntitySubstitution(object):
@@ -82,7 +82,7 @@
}
BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
- "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)"
+ "&(?!#\\d+;|#x[0-9a-fA-F]+;|\\w+;)"
")")
AMPERSAND_OR_BRACKET = re.compile("([<>&])")
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/bs4/diagnose.py
new/beautifulsoup4-4.6.1/bs4/diagnose.py
--- old/beautifulsoup4-4.6.0/bs4/diagnose.py 2016-07-27 03:25:45.000000000
+0200
+++ new/beautifulsoup4-4.6.1/bs4/diagnose.py 2018-07-14 20:23:02.000000000
+0200
@@ -37,7 +37,7 @@
name)
if 'lxml' in basic_parsers:
- basic_parsers.append(["lxml", "xml"])
+ basic_parsers.append("lxml-xml")
try:
from lxml import etree
print "Found lxml version %s" %
".".join(map(str,etree.LXML_VERSION))
@@ -56,21 +56,27 @@
if hasattr(data, 'read'):
data = data.read()
- elif os.path.exists(data):
- print '"%s" looks like a filename. Reading data from the file.' % data
- with open(data) as fp:
- data = fp.read()
elif data.startswith("http:") or data.startswith("https:"):
print '"%s" looks like a URL. Beautiful Soup is not an HTTP client.' %
data
print "You need to use some other library to get the document behind
the URL, and feed that document to Beautiful Soup."
return
- print
+ else:
+ try:
+ if os.path.exists(data):
+ print '"%s" looks like a filename. Reading data from the
file.' % data
+ with open(data) as fp:
+ data = fp.read()
+ except ValueError:
+ # This can happen on some platforms when the 'filename' is
+ # too long. Assume it's data and not a filename.
+ pass
+ print
for parser in basic_parsers:
print "Trying to parse your markup with %s" % parser
success = False
try:
- soup = BeautifulSoup(data, parser)
+ soup = BeautifulSoup(data, features=parser)
success = True
except Exception, e:
print "%s could not parse the markup." % parser
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/bs4/element.py
new/beautifulsoup4-4.6.1/bs4/element.py
--- old/beautifulsoup4-4.6.0/bs4/element.py 2017-05-07 14:15:39.000000000
+0200
+++ new/beautifulsoup4-4.6.1/bs4/element.py 2018-07-29 00:20:13.000000000
+0200
@@ -2,7 +2,10 @@
# found in the LICENSE file.
__license__ = "MIT"
-import collections
+try:
+ from collections.abc import Callable # Python 3.6
+except ImportError , e:
+ from collections import Callable
import re
import shlex
import sys
@@ -12,7 +15,7 @@
DEFAULT_OUTPUT_ENCODING = "utf-8"
PY3K = (sys.version_info[0] > 2)
-whitespace_re = re.compile("\s+")
+whitespace_re = re.compile(r"\s+")
def _alias(attr):
"""Alias one attribute name to another for backward compatibility"""
@@ -69,7 +72,7 @@
The value of the 'content' attribute will be one of these objects.
"""
- CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
+ CHARSET_RE = re.compile(r"((^|;)\s*charset=)([^;]*)", re.M)
def __new__(cls, original_value):
match = cls.CHARSET_RE.search(original_value)
@@ -123,6 +126,41 @@
return cls._substitute_if_appropriate(
ns, EntitySubstitution.substitute_xml)
+class Formatter(object):
+ """Contains information about how to format a parse tree."""
+
+ # By default, represent void elements as <tag/> rather than <tag>
+ void_element_close_prefix = '/'
+
+ def substitute_entities(self, *args, **kwargs):
+ """Transform certain characters into named entities."""
+ raise NotImplementedError()
+
+class HTMLFormatter(Formatter):
+ """The default HTML formatter."""
+ def substitute(self, *args, **kwargs):
+ return HTMLAwareEntitySubstitution.substitute_html(*args, **kwargs)
+
+class MinimalHTMLFormatter(Formatter):
+ """A minimal HTML formatter."""
+ def substitute(self, *args, **kwargs):
+ return HTMLAwareEntitySubstitution.substitute_xml(*args, **kwargs)
+
+class HTML5Formatter(HTMLFormatter):
+ """An HTML formatter that omits the slash in a void tag."""
+ void_element_close_prefix = None
+
+class XMLFormatter(Formatter):
+ """Substitute only the essential XML entities."""
+ def substitute(self, *args, **kwargs):
+ return EntitySubstitution.substitute_xml(*args, **kwargs)
+
+class HTMLXMLFormatter(Formatter):
+ """Format XML using HTML rules."""
+ def substitute(self, *args, **kwargs):
+ return HTMLAwareEntitySubstitution.substitute_html(*args, **kwargs)
+
+
class PageElement(object):
"""Contains the navigational information for some part of the page
(either a tag or a piece of text)"""
@@ -131,40 +169,49 @@
# to methods like encode() and prettify():
#
# "html" - All Unicode characters with corresponding HTML entities
- # are converted to those entities on output.
- # "minimal" - Bare ampersands and angle brackets are converted to
+ # are converted to those entities on output.
+ # "html5" - The same as "html", but empty void tags are represented as
+ # <tag> rather than <tag/>
+ # "minimal" - Bare ampersands and angle brackets are converted to
# XML entities: & < >
# None - The null formatter. Unicode characters are never
# converted to entities. This is not recommended, but it's
# faster than "minimal".
- # A function - This function will be called on every string that
+ # A callable function - it will be called on every string that needs to
undergo entity substitution.
+ # A Formatter instance - Formatter.substitute(string) will be called on
every string that
# needs to undergo entity substitution.
#
- # In an HTML document, the default "html" and "minimal" functions
- # will leave the contents of <script> and <style> tags alone. For
- # an XML document, all tags will be given the same treatment.
+ # In an HTML document, the default "html", "html5", and "minimal"
+ # functions will leave the contents of <script> and <style> tags
+ # alone. For an XML document, all tags will be given the same
+ # treatment.
HTML_FORMATTERS = {
- "html" : HTMLAwareEntitySubstitution.substitute_html,
- "minimal" : HTMLAwareEntitySubstitution.substitute_xml,
+ "html" : HTMLFormatter(),
+ "html5" : HTML5Formatter(),
+ "minimal" : MinimalHTMLFormatter(),
None : None
}
XML_FORMATTERS = {
- "html" : EntitySubstitution.substitute_html,
- "minimal" : EntitySubstitution.substitute_xml,
+ "html" : HTMLXMLFormatter(),
+ "minimal" : XMLFormatter(),
None : None
}
def format_string(self, s, formatter='minimal'):
"""Format the given string using the given formatter."""
- if not callable(formatter):
+ if isinstance(formatter, basestring):
formatter = self._formatter_for_name(formatter)
if formatter is None:
output = s
else:
- output = formatter(s)
+ if callable(formatter):
+ # Backwards compatibility -- you used to pass in a formatting
method.
+ output = formatter(s)
+ else:
+ output = formatter.substitute(s)
return output
@property
@@ -194,11 +241,9 @@
def _formatter_for_name(self, name):
"Look up a formatter function based on its name and the tree."
if self._is_xml:
- return self.XML_FORMATTERS.get(
- name, EntitySubstitution.substitute_xml)
+ return self.XML_FORMATTERS.get(name, XMLFormatter())
else:
- return self.HTML_FORMATTERS.get(
- name, HTMLAwareEntitySubstitution.substitute_xml)
+ return self.HTML_FORMATTERS.get(name, HTMLFormatter())
def setup(self, parent=None, previous_element=None, next_element=None,
previous_sibling=None, next_sibling=None):
@@ -316,6 +361,14 @@
and not isinstance(new_child, NavigableString)):
new_child = NavigableString(new_child)
+ from bs4 import BeautifulSoup
+ if isinstance(new_child, BeautifulSoup):
+ # We don't want to end up with a situation where one BeautifulSoup
+ # object contains another. Insert the children one at a time.
+ for subchild in list(new_child.contents):
+ self.insert(position, subchild)
+ position += 1
+ return
position = min(position, len(self.contents))
if hasattr(new_child, 'parent') and new_child.parent is not None:
# We're 'inserting' an element that's already one
@@ -536,14 +589,21 @@
elif isinstance(name, basestring):
# Optimization to find all tags with a given name.
if name.count(':') == 1:
- # This is a name with a prefix.
- prefix, name = name.split(':', 1)
+ # This is a name with a prefix. If this is a
namespace-aware document,
+ # we need to match the local name against tag.name. If not,
+ # we need to match the fully-qualified name against
tag.name.
+ prefix, local_name = name.split(':', 1)
else:
prefix = None
+ local_name = name
result = (element for element in generator
if isinstance(element, Tag)
- and element.name == name
- and (prefix is None or element.prefix == prefix)
+ and (
+ element.name == name
+ ) or (
+ element.name == local_name
+ and (prefix is None or element.prefix == prefix)
+ )
)
return ResultSet(strainer, result)
results = ResultSet(strainer)
@@ -862,7 +922,7 @@
self.can_be_empty_element = builder.can_be_empty_element(name)
else:
self.can_be_empty_element = False
-
+
parserClass = _alias("parser_class") # BS3
def __copy__(self):
@@ -1046,8 +1106,10 @@
# BS3: soup.aTag -> "soup.find("a")
tag_name = tag[:-3]
warnings.warn(
- '.%sTag is deprecated, use .find("%s") instead.' % (
- tag_name, tag_name))
+ '.%(name)sTag is deprecated, use .find("%(name)s") instead. If
you really were looking for a tag called %(name)sTag, use .find("%(name)sTag")'
% dict(
+ name=tag_name
+ )
+ )
return self.find(tag_name)
# We special case contents to avoid recursion.
elif not tag.startswith("__") and not tag == "contents":
@@ -1129,11 +1191,10 @@
encoding.
"""
- # First off, turn a string formatter into a function. This
+ # First off, turn a string formatter into a Formatter object. This
# will stop the lookup from happening over and over again.
- if not callable(formatter):
+ if not isinstance(formatter, Formatter) and not callable(formatter):
formatter = self._formatter_for_name(formatter)
-
attrs = []
if self.attrs:
for key, val in sorted(self.attrs.items()):
@@ -1162,7 +1223,7 @@
prefix = self.prefix + ":"
if self.is_empty_element:
- close = '/'
+ close = formatter.void_element_close_prefix or ''
else:
closeTag = '</%s%s>' % (prefix, self.name)
@@ -1233,9 +1294,9 @@
:param formatter: The output formatter responsible for converting
entities to Unicode characters.
"""
- # First off, turn a string formatter into a function. This
+ # First off, turn a string formatter into a Formatter object. This
# will stop the lookup from happening over and over again.
- if not callable(formatter):
+ if not isinstance(formatter, Formatter) and not callable(formatter):
formatter = self._formatter_for_name(formatter)
pretty_print = (indent_level is not None)
@@ -1348,15 +1409,29 @@
# Handle grouping selectors if ',' exists, ie: p,a
if ',' in selector:
context = []
- for partial_selector in selector.split(','):
- partial_selector = partial_selector.strip()
+ selectors = [x.strip() for x in selector.split(",")]
+
+ # If a selector is mentioned multiple times we don't want
+ # to use it more than once.
+ used_selectors = set()
+
+ # We also don't want to select the same element more than once,
+ # if it's matched by multiple selectors.
+ selected_object_ids = set()
+ for partial_selector in selectors:
if partial_selector == '':
raise ValueError('Invalid group selection syntax: %s' %
selector)
+ if partial_selector in used_selectors:
+ continue
+ used_selectors.add(partial_selector)
candidates = self.select(partial_selector, limit=limit)
for candidate in candidates:
- if candidate not in context:
+ # This lets us distinguish between distinct tags that
+ # represent the same markup.
+ object_id = id(candidate)
+ if object_id not in selected_object_ids:
context.append(candidate)
-
+ selected_object_ids.add(object_id)
if limit and len(context) >= limit:
break
return context
@@ -1418,7 +1493,7 @@
if tag_name == '':
raise ValueError(
"A pseudo-class must be prefixed with a tag name.")
- pseudo_attributes =
re.match('([a-zA-Z\d-]+)\(([a-zA-Z\d]+)\)', pseudo)
+ pseudo_attributes =
re.match(r'([a-zA-Z\d-]+)\(([a-zA-Z\d]+)\)', pseudo)
found = []
if pseudo_attributes is None:
pseudo_type = pseudo
@@ -1652,7 +1727,7 @@
markup = markup_name
markup_attrs = markup
call_function_with_tag_data = (
- isinstance(self.name, collections.Callable)
+ isinstance(self.name, Callable)
and not isinstance(markup_name, Tag))
if ((not self.name)
@@ -1732,7 +1807,7 @@
# True matches any non-None value.
return markup is not None
- if isinstance(match_against, collections.Callable):
+ if isinstance(match_against, Callable):
return match_against(markup)
# Custom callables take the tag as an argument, but all
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/bs4/testing.py
new/beautifulsoup4-4.6.1/bs4/testing.py
--- old/beautifulsoup4-4.6.0/bs4/testing.py 2017-05-07 14:16:59.000000000
+0200
+++ new/beautifulsoup4-4.6.1/bs4/testing.py 2018-07-28 22:58:10.000000000
+0200
@@ -1,3 +1,4 @@
+# encoding: utf-8
"""Helper classes for tests."""
# Use of this source code is governed by a BSD-style license that can be
@@ -150,6 +151,14 @@
soup.encode("utf-8").replace(b"\n", b""),
markup.replace(b"\n", b""))
+ def test_namespaced_html(self):
+ """When a namespaced XML document is parsed as HTML it should
+ be treated as HTML with weird tag names.
+ """
+ markup = b"""<ns1:foo>content</ns1:foo><ns1:foo/><ns2:foo/>"""
+ soup = self.soup(markup)
+ self.assertEqual(2, len(soup.find_all("ns1:foo")))
+
def test_processing_instruction(self):
# We test both Unicode and bytestring to verify that
# process_markup correctly sets processing_instruction_class
@@ -311,6 +320,26 @@
def test_angle_brackets_in_attribute_values_are_escaped(self):
self.assertSoupEquals('<a b="<a>"></a>', '<a b="<a>"></a>')
+ def test_strings_resembling_character_entity_references(self):
+ # "&T" and "&p" look like incomplete character entities, but they are
+ # not.
+ self.assertSoupEquals(
+ u"<p>• AT&T is in the s&p 500</p>",
+ u"<p>\u2022 AT&T is in the s&p 500</p>"
+ )
+
+ def test_entities_in_foreign_document_encoding(self):
+ # “ and ” are invalid numeric entities referencing
+ # Windows-1252 characters. - references a character common
+ # to Windows-1252 and Unicode, and ☃ references a
+ # character only found in Unicode.
+ #
+ # All of these entities should be converted to Unicode
+ # characters.
+ markup = "<p>“Hello” -☃</p>"
+ soup = self.soup(markup)
+ self.assertEquals(u"“Hello” -☃", soup.p.string)
+
def test_entities_in_attributes_converted_to_unicode(self):
expect = u'<p id="pi\N{LATIN SMALL LETTER N WITH TILDE}ata"></p>'
self.assertSoupEquals('<p id="piñata"></p>', expect)
@@ -334,7 +363,7 @@
self.assertSoupEquals("�", expect)
self.assertSoupEquals("�", expect)
self.assertSoupEquals("�", expect)
-
+
def test_multipart_strings(self):
"Mostly to prevent a recurrence of a bug in the html5lib treebuilder."
soup = self.soup("<html><h2>\nfoo</h2><p></p></html>")
@@ -624,6 +653,17 @@
self.assertEqual(
soup.encode("utf-8"), markup)
+ def test_nested_namespaces(self):
+ doc = b"""<?xml version="1.0" encoding="utf-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
+<parent xmlns="http://ns1/">
+<child xmlns="http://ns2/" xmlns:ns3="http://ns3/">
+<grandchild ns3:attr="value" xmlns="http://ns4/"/>
+</child>
+</parent>"""
+ soup = self.soup(doc)
+ self.assertEqual(doc, soup.encode())
+
def test_formatter_processes_script_tag_for_xml_documents(self):
doc = """
<script type="text/javascript">
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/bs4/tests/test_htmlparser.py
new/beautifulsoup4-4.6.1/bs4/tests/test_htmlparser.py
--- old/beautifulsoup4-4.6.0/bs4/tests/test_htmlparser.py 2017-05-07
03:30:50.000000000 +0200
+++ new/beautifulsoup4-4.6.1/bs4/tests/test_htmlparser.py 2018-07-15
14:26:01.000000000 +0200
@@ -5,6 +5,7 @@
import pickle
from bs4.testing import SoupTest, HTMLTreeBuilderSmokeTest
from bs4.builder import HTMLParserTreeBuilder
+from bs4.builder._htmlparser import BeautifulSoupHTMLParser
class HTMLParserTreeBuilderSmokeTest(SoupTest, HTMLTreeBuilderSmokeTest):
@@ -32,3 +33,17 @@
def test_redundant_empty_element_closing_tags(self):
self.assertSoupEquals('<br></br><br></br><br></br>', "<br/><br/><br/>")
self.assertSoupEquals('</br></br></br>', "")
+
+ def test_empty_element(self):
+ # This verifies that any buffered data present when the parser
+ # finishes working is handled.
+ self.assertSoupEquals("foo &# bar", "foo &# bar")
+
+
+class TestHTMLParserSubclass(SoupTest):
+ def test_error(self):
+ """Verify that our HTMLParser subclass implements error() in a way
+ that doesn't cause a crash.
+ """
+ parser = BeautifulSoupHTMLParser()
+ parser.error("don't crash")
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/bs4/tests/test_lxml.py
new/beautifulsoup4-4.6.1/bs4/tests/test_lxml.py
--- old/beautifulsoup4-4.6.0/bs4/tests/test_lxml.py 2016-07-27
03:45:19.000000000 +0200
+++ new/beautifulsoup4-4.6.1/bs4/tests/test_lxml.py 2018-07-29
00:06:35.000000000 +0200
@@ -46,6 +46,12 @@
self.assertSoupEquals(
"<p>foo�bar</p>", "<p>foobar</p>")
+ def test_entities_in_foreign_document_encoding(self):
+ # We can't implement this case correctly because by the time we
+ # hear about markup like "“", it's been (incorrectly) converted
into
+ # a string like u'\x93'
+ pass
+
# In lxml < 2.3.5, an empty doctype causes a segfault. Skip this
# test if an old version of lxml is installed.
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/bs4/tests/test_tree.py
new/beautifulsoup4-4.6.1/bs4/tests/test_tree.py
--- old/beautifulsoup4-4.6.0/bs4/tests/test_tree.py 2017-05-07
03:38:18.000000000 +0200
+++ new/beautifulsoup4-4.6.1/bs4/tests/test_tree.py 2018-07-29
00:17:17.000000000 +0200
@@ -605,7 +605,7 @@
</html>'''
# All that whitespace looks good but makes the tests more
# difficult. Get rid of it.
- markup = re.compile("\n\s*").sub("", markup)
+ markup = re.compile(r"\n\s*").sub("", markup)
self.tree = self.soup(markup)
@@ -703,12 +703,12 @@
"""Test the ability to create new tags."""
def test_new_tag(self):
soup = self.soup("")
- new_tag = soup.new_tag("foo", bar="baz")
+ new_tag = soup.new_tag("foo", bar="baz", attrs={"name": "a name"})
self.assertTrue(isinstance(new_tag, Tag))
self.assertEqual("foo", new_tag.name)
- self.assertEqual(dict(bar="baz"), new_tag.attrs)
+ self.assertEqual(dict(bar="baz", name="a name"), new_tag.attrs)
self.assertEqual(None, new_tag.parent)
-
+
def test_tag_inherits_self_closing_rules_from_builder(self):
if XML_BUILDER_PRESENT:
xml_soup = BeautifulSoup("", "lxml-xml")
@@ -821,6 +821,26 @@
soup = self.soup(text)
self.assertRaises(ValueError, soup.a.insert, 0, soup.a)
+ def test_insert_beautifulsoup_object_inserts_children(self):
+ """Inserting one BeautifulSoup object into another actually inserts all
+ of its children -- you'll never combine BeautifulSoup objects.
+ """
+ soup = self.soup("<p>And now, a word:</p><p>And we're back.</p>")
+
+ text = "<p>p2</p><p>p3</p>"
+ to_insert = self.soup(text)
+ soup.insert(1, to_insert)
+
+ for i in soup.descendants:
+ assert not isinstance(i, BeautifulSoup)
+
+ p1, p2, p3, p4 = list(soup.children)
+ self.assertEquals("And now, a word:", p1.string)
+ self.assertEquals("p2", p2.string)
+ self.assertEquals("p3", p3.string)
+ self.assertEquals("And we're back.", p4.string)
+
+
def test_replace_with_maintains_next_element_throughout(self):
soup = self.soup('<p><a>one</a><b>three</b></p>')
a = soup.a
@@ -1186,7 +1206,7 @@
tag = soup.bTag
self.assertEqual(soup.b, tag)
self.assertEqual(
- '.bTag is deprecated, use .find("b") instead.',
+ '.bTag is deprecated, use .find("b") instead. If you really were
looking for a tag called bTag, use .find("bTag")',
str(w[0].message))
def test_has_attr(self):
@@ -1419,13 +1439,21 @@
u"<b><<Sacr\N{LATIN SMALL LETTER E WITH ACUTE}
bleu!>></b>"))
def test_formatter_html(self):
- markup = u"<b><<Sacr\N{LATIN SMALL LETTER E WITH ACUTE}
bleu!>></b>"
+ markup = u"<br><b><<Sacr\N{LATIN SMALL LETTER E WITH ACUTE}
bleu!>></b>"
soup = self.soup(markup)
decoded = soup.decode(formatter="html")
self.assertEqual(
decoded,
- self.document_for("<b><<Sacré bleu!>></b>"))
+ self.document_for("<br/><b><<Sacré
bleu!>></b>"))
+ def test_formatter_html5(self):
+ markup = u"<br><b><<Sacr\N{LATIN SMALL LETTER E WITH ACUTE}
bleu!>></b>"
+ soup = self.soup(markup)
+ decoded = soup.decode(formatter="html5")
+ self.assertEqual(
+ decoded,
+ self.document_for("<br><b><<Sacré bleu!>></b>"))
+
def test_formatter_minimal(self):
markup = u"<b><<Sacr\N{LATIN SMALL LETTER E WITH ACUTE}
bleu!>></b>"
soup = self.soup(markup)
@@ -1498,7 +1526,7 @@
u'<div>\n foo\n <pre> \tbar\n \n </pre>\n baz\n</div>',
soup.div.prettify())
- def test_prettify_accepts_formatter(self):
+ def test_prettify_accepts_formatter_function(self):
soup = BeautifulSoup("<html><body>foo</body></html>", 'html.parser')
pretty = soup.prettify(formatter = lambda x: x.upper())
self.assertTrue("FOO" in pretty)
@@ -2046,5 +2074,17 @@
def test_multiple_select_nested(self):
self.assertSelects('body > div > x, y > z', ['xid', 'zidb'])
-
+ def test_select_duplicate_elements(self):
+ # When markup contains duplicate elements, a multiple select
+ # will find all of them.
+ markup = '<div class="c1"/><div class="c2"/><div class="c1"/>'
+ soup = BeautifulSoup(markup, 'html.parser')
+ selected = soup.select(".c1, .c2")
+ self.assertEquals(3, len(selected))
+
+ # Verify that find_all finds the same elements, though because
+ # of an implementation detail it finds them in a different
+ # order.
+ for element in soup.find_all(class_=['c1', 'c2']):
+ assert element in selected
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/doc/source/index.rst
new/beautifulsoup4-4.6.1/doc/source/index.rst
--- old/beautifulsoup4-4.6.0/doc/source/index.rst 2017-05-07
03:37:18.000000000 +0200
+++ new/beautifulsoup4-4.6.1/doc/source/index.rst 2018-07-20
04:41:47.000000000 +0200
@@ -1271,7 +1271,7 @@
You can't use a keyword argument to search for HTML's 'name' element,
because Beautiful Soup uses the ``name`` argument to contain the name
of the tag itself. Instead, you can give a value to 'name' in the
-``attrs`` argument.
+``attrs`` argument::
name_soup = BeautifulSoup('<input name="email"/>')
name_soup.find_all(name="email")
@@ -1732,7 +1732,7 @@
soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
-Find tags that match any selector from a list of selectors:
+Find tags that match any selector from a list of selectors::
soup.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
@@ -2145,7 +2145,7 @@
You can change this behavior by providing a value for the
``formatter`` argument to ``prettify()``, ``encode()``, or
-``decode()``. Beautiful Soup recognizes four possible values for
+``decode()``. Beautiful Soup recognizes six possible values for
``formatter``.
The default is ``formatter="minimal"``. Strings will only be processed
@@ -2174,6 +2174,18 @@
# </body>
# </html>
+ If you pass in ``formatter="html5"``, it's the same as
+``formatter="html5"``, but Beautiful Soup will
+omit the closing slash in HTML void tags like "br"::
+
+ soup = BeautifulSoup("<br>")
+
+ print(soup.encode(formatter="html"))
+ # <html><body><br/></body></html>
+
+ print(soup.encode(formatter="html5"))
+ # <html><body><br></body></html>
+
If you pass in ``formatter=None``, Beautiful Soup will not modify
strings at all on output. This is the fastest option, but it may lead
to Beautiful Soup generating invalid HTML/XML, as in these examples::
@@ -2418,7 +2430,7 @@
as ``from_encoding``.
Here's a document written in ISO-8859-8. The document is so short that
-Unicode, Dammit can't get a good lock on it, and misidentifies it as
+Unicode, Dammit can't get a lock on it, and misidentifies it as
ISO-8859-7::
markup = b"<h1>\xed\xe5\xec\xf9</h1>"
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.6.0/setup.py
new/beautifulsoup4-4.6.1/setup.py
--- old/beautifulsoup4-4.6.0/setup.py 2017-05-07 15:49:03.000000000 +0200
+++ new/beautifulsoup4-4.6.1/setup.py 2018-07-29 01:14:04.000000000 +0200
@@ -5,7 +5,7 @@
setup(
name="beautifulsoup4",
- version = "4.6.0",
+ version = "4.6.1",
author="Leonard Richardson",
author_email='[email protected]',
url="http://www.crummy.com/software/BeautifulSoup/bs4/",
++++++ beautifulsoup4-lxml-fixes.patch ++++++
--- /var/tmp/diff_new_pack.fRScod/_old 2018-08-08 14:45:28.488760595 +0200
+++ /var/tmp/diff_new_pack.fRScod/_new 2018-08-08 14:45:28.492760601 +0200
@@ -1,7 +1,8 @@
-diff -ruN a/bs4/testing.py b/bs4/testing.py
---- a/bs4/testing.py 2013-06-10 15:16:25.000000000 +0200
-+++ b/bs4/testing.py 2014-01-08 16:09:35.845681062 +0100
-@@ -493,7 +493,7 @@
+Index: beautifulsoup4-4.6.1/bs4/testing.py
+===================================================================
+--- beautifulsoup4-4.6.1.orig/bs4/testing.py
++++ beautifulsoup4-4.6.1/bs4/testing.py
+@@ -677,7 +677,7 @@ class XMLTreeBuilderSmokeTest(object):
self.assertTrue(b"< < hey > >" in encoded)
def test_can_parse_unicode_document(self):
@@ -10,10 +11,11 @@
soup = self.soup(markup)
self.assertEqual(u'Sacr\xe9 bleu!', soup.root.string)
-diff -ruN a/bs4/tests/test_lxml.py b/bs4/tests/test_lxml.py
---- a/bs4/tests/test_lxml.py 2013-08-19 16:29:42.000000000 +0200
-+++ b/bs4/tests/test_lxml.py 2014-01-08 16:10:33.157497450 +0100
-@@ -61,6 +61,7 @@
+Index: beautifulsoup4-4.6.1/bs4/tests/test_lxml.py
+===================================================================
+--- beautifulsoup4-4.6.1.orig/bs4/tests/test_lxml.py
++++ beautifulsoup4-4.6.1/bs4/tests/test_lxml.py
+@@ -67,6 +67,7 @@ class LXMLTreeBuilderSmokeTest(SoupTest,
# Make sure that the deprecated BSS class uses an xml builder
# if one is installed.
with warnings.catch_warnings(record=True) as w: