Hello community, here is the log from the commit of package python-w3lib for openSUSE:Leap:15.2 checked in at 2020-03-02 13:24:44 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Leap:15.2/python-w3lib (Old) and /work/SRC/openSUSE:Leap:15.2/.python-w3lib.new.26092 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-w3lib" Mon Mar 2 13:24:44 2020 rev:11 rq:777269 version:1.21.0 Changes: -------- --- /work/SRC/openSUSE:Leap:15.2/python-w3lib/python-w3lib.changes 2020-01-15 15:54:08.947622176 +0100 +++ /work/SRC/openSUSE:Leap:15.2/.python-w3lib.new.26092/python-w3lib.changes 2020-03-02 13:24:44.714563932 +0100 @@ -1,0 +2,40 @@ +Thu Aug 29 13:15:56 UTC 2019 - Marketa Calabkova <[email protected]> + +- update to 1.21.1 + * Add the "encoding" and "path_encoding" parameters to + w3lib.url.safe_download_url (issue #118) + * w3lib.url.safe_url_string now also removes tabs and new lines + (issue #133) + * w3lib.html.remove_comments now also removes truncated comments + (issue #129) + * w3lib.html.remove_tags_with_content no longer removes tags which + start with the same text as one of the specified tags (issue #114) + +------------------------------------------------------------------- +Fri Mar 29 09:53:27 UTC 2019 - [email protected] + +- version update to 1.20.0 + * Fix url_query_cleaner to do not append "?" to urls without a + query string (issue #109) + * Add support for Python 3.7 and drop Python 3.3 (issue #113) + * Add `w3lib.url.add_or_replace_parameters` helper (issue #117) + * Documentation fixes (issue #115) + +------------------------------------------------------------------- +Tue Dec 4 12:56:15 UTC 2018 - Matej Cepl <[email protected]> + +- Remove superfluous devel dependency for noarch package + +------------------------------------------------------------------- +Fri Nov 16 18:49:26 UTC 2018 - Todd R <[email protected]> + +- Update to version 1.19.0 + * Add a workaround for CPython segfault (https://bugs.python.org/issue32583) + which affect w3lib.encoding functions. This is technically **backwards + incompatible** because it changes the way non-decodable bytes are replaced + (in some cases instead of two ``\ufffd`` chars you can get one). + As a side effect, the fix speeds up decoding in Python 3.4+. + * Add 'encoding' parameter for w3lib.http.basic_auth_header. + * Fix pypy testing setup, add pypy3 to CI. + +------------------------------------------------------------------- Old: ---- w3lib-1.18.0.tar.gz New: ---- w3lib-1.21.0.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ python-w3lib.spec ++++++ --- /var/tmp/diff_new_pack.tX3T3v/_old 2020-03-02 13:24:45.054564608 +0100 +++ /var/tmp/diff_new_pack.tX3T3v/_new 2020-03-02 13:24:45.058564616 +0100 @@ -1,7 +1,7 @@ # # spec file for package python-w3lib # -# Copyright (c) 2017 SUSE LINUX GmbH, Nuernberg, Germany. +# Copyright (c) 2019 SUSE LINUX GmbH, Nuernberg, Germany. # # All modifications and additions to the file contributed by third parties # remain the property of their copyright owners, unless otherwise agreed @@ -12,25 +12,25 @@ # license that conforms to the Open Source Definition (Version 1.9) # published by the Open Source Initiative. -# Please submit bugfixes or comments via http://bugs.opensuse.org/ +# Please submit bugfixes or comments via https://bugs.opensuse.org/ # %{?!python_module:%define python_module() python-%{**} python3-%{**}} Name: python-w3lib -Version: 1.18.0 +Version: 1.21.0 Release: 0 Summary: Library of Web-Related Functions License: BSD-3-Clause Group: Development/Languages/Python Url: http://github.com/scrapy/w3lib Source: https://files.pythonhosted.org/packages/source/w/w3lib/w3lib-%{version}.tar.gz -BuildRequires: %{python_module devel} BuildRequires: %{python_module setuptools} BuildRequires: %{python_module six} >= 1.4.1 BuildRequires: fdupes BuildRequires: python-rpm-macros BuildArch: noarch + %python_subpackages %description @@ -70,7 +70,8 @@ %python_exec setup.py test %files %{python_files} -%doc README.rst LICENSE +%doc README.rst +%license LICENSE %{python_sitelib}/* %changelog ++++++ w3lib-1.18.0.tar.gz -> w3lib-1.21.0.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/PKG-INFO new/w3lib-1.21.0/PKG-INFO --- old/w3lib-1.18.0/PKG-INFO 2017-08-03 15:25:28.000000000 +0200 +++ new/w3lib-1.21.0/PKG-INFO 2019-08-09 13:00:36.000000000 +0200 @@ -1,6 +1,6 @@ Metadata-Version: 1.1 Name: w3lib -Version: 1.18.0 +Version: 1.21.0 Summary: Library of web-related functions Home-page: https://github.com/scrapy/w3lib Author: Scrapy project @@ -15,10 +15,10 @@ Classifier: Programming Language :: Python :: 2 Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 -Classifier: Programming Language :: Python :: 3.3 Classifier: Programming Language :: Python :: 3.4 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 +Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Programming Language :: Python :: Implementation :: PyPy Classifier: Topic :: Internet :: WWW/HTTP diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/README.rst new/w3lib-1.21.0/README.rst --- old/w3lib-1.18.0/README.rst 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/README.rst 2019-08-09 13:00:00.000000000 +0200 @@ -27,7 +27,7 @@ Requirements ============ -Python 2.7 or Python 3.3+ +Python 2.7 or Python 3.4+ Install ======= diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/docs/conf.py new/w3lib-1.21.0/docs/conf.py --- old/w3lib-1.18.0/docs/conf.py 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/docs/conf.py 2019-08-09 13:00:00.000000000 +0200 @@ -53,7 +53,7 @@ # built documents. # # The full version, including alpha/beta/rc tags. -release = '1.18.0' +release = '1.21.0' # The short X.Y version. version = '.'.join(release.split('.')[:2]) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/docs/index.rst new/w3lib-1.21.0/docs/index.rst --- old/w3lib-1.18.0/docs/index.rst 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/docs/index.rst 2019-08-09 13:00:00.000000000 +0200 @@ -8,7 +8,7 @@ * remove comments, or tags from HTML snippets * extract base url from HTML snippets -* translate entites on HTML strings +* translate entities on HTML strings * convert raw HTTP headers to dicts and vice-versa * construct HTTP auth header * converting HTML pages to unicode @@ -39,7 +39,7 @@ Tests ===== -`nose`_ is the preferred way to run tests. Just run: ``nosetests`` from the +`pytest`_ is the preferred way to run tests. Just run: ``pytest`` from the root directory to execute tests using the default Python interpreter. `tox`_ could be used to run tests for all supported Python versions. @@ -48,7 +48,7 @@ Python interpreters. .. _tox: http://tox.testrun.org -.. _nose: http://readthedocs.org/docs/nose/en/latest/ +.. _pytest: https://docs.pytest.org/en/latest/ Changelog @@ -74,4 +74,3 @@ * :ref:`genindex` * :ref:`modindex` * :ref:`search` - diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/setup.py new/w3lib-1.21.0/setup.py --- old/w3lib-1.18.0/setup.py 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/setup.py 2019-08-09 13:00:00.000000000 +0200 @@ -3,7 +3,7 @@ setup( name='w3lib', - version='1.18.0', + version='1.21.0', license='BSD', description='Library of web-related functions', author='Scrapy project', @@ -21,10 +21,10 @@ 'Programming Language :: Python :: 2', 'Programming Language :: Python :: 2.7', 'Programming Language :: Python :: 3', - 'Programming Language :: Python :: 3.3', 'Programming Language :: Python :: 3.4', 'Programming Language :: Python :: 3.5', 'Programming Language :: Python :: 3.6', + 'Programming Language :: Python :: 3.7', 'Programming Language :: Python :: Implementation :: CPython', 'Programming Language :: Python :: Implementation :: PyPy', 'Topic :: Internet :: WWW/HTTP', diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/tests/test_encoding.py new/w3lib-1.21.0/tests/test_encoding.py --- old/w3lib-1.18.0/tests/test_encoding.py 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/tests/test_encoding.py 2019-08-09 13:00:00.000000000 +0200 @@ -144,9 +144,9 @@ def test_invalid_utf8_encoded_body_with_valid_utf8_BOM(self): # unlike scrapy, the BOM is stripped self._assert_encoding('utf-8', b"\xef\xbb\xbfWORD\xe3\xabWORD2", - 'utf-8', u'WORD\ufffd\ufffdWORD2') + 'utf-8', u'WORD\ufffdWORD2') self._assert_encoding(None, b"\xef\xbb\xbfWORD\xe3\xabWORD2", - 'utf-8', u'WORD\ufffd\ufffdWORD2') + 'utf-8', u'WORD\ufffdWORD2') def test_utf8_unexpected_end_of_data_with_valid_utf8_BOM(self): # Python implementations handle unexpected end of UTF8 data @@ -220,6 +220,18 @@ self._assert_encoding('utf-16', u"hi".encode('utf-16-be'), 'utf-16-be', u"hi") self._assert_encoding('utf-32', u"hi".encode('utf-32-be'), 'utf-32-be', u"hi") + def test_python_crash(self): + import random + from io import BytesIO + random.seed(42) + buf = BytesIO() + for i in range(150000): + buf.write(bytes([random.randint(0, 255)])) + to_unicode(buf.getvalue(), 'utf-16-le') + to_unicode(buf.getvalue(), 'utf-16-be') + to_unicode(buf.getvalue(), 'utf-32-le') + to_unicode(buf.getvalue(), 'utf-32-be') + def test_html_encoding(self): # extracting the encoding from raw html is tested elsewhere body = b"""blah blah < meta http-equiv="Content-Type" diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/tests/test_html.py new/w3lib-1.21.0/tests/test_html.py --- old/w3lib-1.18.0/tests/test_html.py 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/tests/test_html.py 2019-08-09 13:00:00.000000000 +0200 @@ -106,6 +106,8 @@ self.assertEqual(remove_comments(b"test <!--textcoment--> whatever"), u'test whatever') self.assertEqual(remove_comments(b"test <!--\ntextcoment\n--> whatever"), u'test whatever') + self.assertEqual(remove_comments(b"test <!--"), u'test ') + class RemoveTagsTest(unittest.TestCase): def test_returns_unicode(self): @@ -184,6 +186,10 @@ # text with empty tags self.assertEqual(remove_tags_with_content(u'<br/>a<br />', which_ones=('br',)), u'a') + def test_tags_with_shared_prefix(self): + # https://github.com/scrapy/w3lib/issues/114 + self.assertEqual(remove_tags_with_content(u'<span></span><s></s>', which_ones=('s',)), u'<span></span>') + class ReplaceEscapeCharsTest(unittest.TestCase): def test_returns_unicode(self): diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/tests/test_http.py new/w3lib-1.21.0/tests/test_http.py --- old/w3lib-1.18.0/tests/test_http.py 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/tests/test_http.py 2019-08-09 13:00:00.000000000 +0200 @@ -1,3 +1,5 @@ +# -*- coding: utf-8 -*- + import unittest from collections import OrderedDict from w3lib.http import (basic_auth_header, @@ -14,6 +16,13 @@ self.assertEqual(b'Basic c29tZXVzZXI6QDx5dTk-Jm8_UQ==', basic_auth_header('someuser', '@<yu9>&o?Q')) + def test_basic_auth_header_encoding(self): + self.assertEqual(b'Basic c29tw6Z1c8Oocjpzw7htZXDDpHNz', + basic_auth_header(u'somæusèr', u'sømepäss', encoding='utf8')) + # default encoding (ISO-8859-1) + self.assertEqual(b'Basic c29t5nVz6HI6c_htZXDkc3M=', + basic_auth_header(u'somæusèr', u'sømepäss')) + def test_headers_raw_dict_none(self): self.assertIsNone(headers_raw_to_dict(None)) self.assertIsNone(headers_dict_to_raw(None)) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/tests/test_url.py new/w3lib-1.21.0/tests/test_url.py --- old/w3lib-1.18.0/tests/test_url.py 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/tests/test_url.py 2019-08-09 13:00:00.000000000 +0200 @@ -5,7 +5,7 @@ from w3lib.url import (is_url, safe_url_string, safe_download_url, url_query_parameter, add_or_replace_parameter, url_query_cleaner, file_uri_to_path, parse_data_uri, path_to_file_uri, any_to_uri, - urljoin_rfc, canonicalize_url, parse_url) + urljoin_rfc, canonicalize_url, parse_url, add_or_replace_parameters) from six.moves.urllib.parse import urlparse @@ -59,6 +59,20 @@ self.assertTrue(isinstance(safe_url_string(b'http://example.com/'), str)) + def test_safe_url_string_remove_ascii_tab_and_newlines(self): + self.assertEqual(safe_url_string("http://example.com/test\n.html"), + "http://example.com/test.html") + self.assertEqual(safe_url_string("http://example.com/test\t.html"), + "http://example.com/test.html") + self.assertEqual(safe_url_string("http://example.com/test\r.html"), + "http://example.com/test.html") + self.assertEqual(safe_url_string("http://example.com/test\r.html\n"), + "http://example.com/test.html") + self.assertEqual(safe_url_string("http://example.com/test\r\n.html\t"), + "http://example.com/test.html") + self.assertEqual(safe_url_string("http://example.com/test\a\n.html"), + "http://example.com/test%07.html") + def test_safe_url_string_unsafe_chars(self): safeurl = safe_url_string(r"http://localhost:8001/unwise{,},|,\,^,[,],`?|=[]&[]=|") self.assertEqual(safeurl, r"http://localhost:8001/unwise%7B,%7D,|,%5C,%5E,[,],%60?|=[]&[]=|") @@ -203,6 +217,19 @@ 'http://www.example.org/image') self.assertEqual(safe_download_url('http://www.example.org/dir/'), 'http://www.example.org/dir/') + self.assertEqual(safe_download_url(b'http://www.example.org/dir/'), + 'http://www.example.org/dir/') + + # Encoding related tests + self.assertEqual(safe_download_url(b'http://www.example.org?\xa3', + encoding='latin-1', path_encoding='latin-1'), + 'http://www.example.org/?%A3') + self.assertEqual(safe_download_url(b'http://www.example.org?\xc2\xa3', + encoding='utf-8', path_encoding='utf-8'), + 'http://www.example.org/?%C2%A3') + self.assertEqual(safe_download_url(b'http://www.example.org/\xc2\xa3?\xc2\xa3', + encoding='utf-8', path_encoding='latin-1'), + 'http://www.example.org/%A3?%C2%A3') def test_is_url(self): self.assertTrue(is_url('http://www.example.org')) @@ -283,7 +310,21 @@ self.assertEqual(add_or_replace_parameter(url, 'pageurl', 'test'), 'http://example.com/?version=1&pageurl=test¶m2=value2') + def test_add_or_replace_parameters(self): + url = 'http://domain/test' + self.assertEqual(add_or_replace_parameters(url, {'arg': 'v'}), + 'http://domain/test?arg=v') + url = 'http://domain/test?arg1=v1&arg2=v2&arg3=v3' + self.assertEqual(add_or_replace_parameters(url, {'arg4': 'v4'}), + 'http://domain/test?arg1=v1&arg2=v2&arg3=v3&arg4=v4') + self.assertEqual(add_or_replace_parameters(url, {'arg4': 'v4', 'arg3': 'v3new'}), + 'http://domain/test?arg1=v1&arg2=v2&arg3=v3new&arg4=v4') + def test_url_query_cleaner(self): + self.assertEqual('product.html', + url_query_cleaner("product.html?")) + self.assertEqual('product.html', + url_query_cleaner("product.html?&")) self.assertEqual('product.html?id=200', url_query_cleaner("product.html?id=200&foo=bar&name=wired", ['id'])) self.assertEqual('product.html?id=200', @@ -308,6 +349,10 @@ url_query_cleaner("product.html?id=2&foo=bar&name=wired", ['id', 'foo'], remove=True)) self.assertEqual('product.html?foo=bar&name=wired', url_query_cleaner("product.html?id=2&foo=bar&name=wired", ['id', 'footo'], remove=True)) + self.assertEqual('product.html', + url_query_cleaner("product.html", ['id'], remove=True)) + self.assertEqual('product.html', + url_query_cleaner("product.html?&", ['id'], remove=True)) self.assertEqual('product.html?foo=bar', url_query_cleaner("product.html?foo=bar&name=wired", 'foo')) self.assertEqual('product.html?foobar=wired', @@ -321,7 +366,7 @@ def test_path_to_file_uri(self): if os.name == 'nt': - self.assertEqual(path_to_file_uri("C:\\windows\clock.avi"), + self.assertEqual(path_to_file_uri(r"C:\\windows\clock.avi"), "file:///C:/windows/clock.avi") else: self.assertEqual(path_to_file_uri("/some/path.txt"), @@ -329,13 +374,13 @@ fn = "test.txt" x = path_to_file_uri(fn) - self.assert_(x.startswith('file:///')) + self.assertTrue(x.startswith('file:///')) self.assertEqual(file_uri_to_path(x).lower(), os.path.abspath(fn).lower()) def test_file_uri_to_path(self): if os.name == 'nt': self.assertEqual(file_uri_to_path("file:///C:/windows/clock.avi"), - "C:\\windows\clock.avi") + r"C:\\windows\clock.avi") uri = "file:///C:/windows/clock.avi" uri2 = path_to_file_uri(file_uri_to_path(uri)) self.assertEqual(uri, uri2) @@ -353,7 +398,7 @@ def test_any_to_uri(self): if os.name == 'nt': - self.assertEqual(any_to_uri("C:\\windows\clock.avi"), + self.assertEqual(any_to_uri(r"C:\\windows\clock.avi"), "file:///C:/windows/clock.avi") else: self.assertEqual(any_to_uri("/some/path.txt"), diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/tox.ini new/w3lib-1.21.0/tox.ini --- old/w3lib-1.18.0/tox.ini 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/tox.ini 2019-08-09 13:00:00.000000000 +0200 @@ -4,7 +4,7 @@ # and then run "tox" from this directory. [tox] -envlist = py27, pypy, py33, py34, py35, py36 +envlist = py27, pypy, py34, py35, py36, py37, pypy3 [testenv] deps = diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/w3lib/__init__.py new/w3lib-1.21.0/w3lib/__init__.py --- old/w3lib-1.18.0/w3lib/__init__.py 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/w3lib/__init__.py 2019-08-09 13:00:00.000000000 +0200 @@ -1,3 +1,3 @@ -__version__ = "1.18.0" +__version__ = "1.21.0" version_info = tuple(int(v) if v.isdigit() else v for v in __version__.split('.')) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/w3lib/encoding.py new/w3lib-1.21.0/w3lib/encoding.py --- old/w3lib-1.18.0/w3lib/encoding.py 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/w3lib/encoding.py 2019-08-09 13:00:00.000000000 +0200 @@ -3,6 +3,7 @@ Functions for handling encoding of web pages """ import re, codecs, encodings +from sys import version_info _HEADER_ENCODING_RE = re.compile(r'charset=([\w-]+)', re.I) @@ -22,7 +23,7 @@ # regexp for parsing HTTP meta tags _TEMPLATE = r'''%s\s*=\s*["']?\s*%s\s*["']?''' -_SKIP_ATTRS = '''(?x)(?:\\s+ +_SKIP_ATTRS = '''(?:\\s+ [^=<>/\\s"'\x00-\x1f\x7f]+ # Attribute name (?:\\s*=\\s* (?: # ' and " are entity encoded (', "), so no need for \', \" @@ -32,7 +33,7 @@ | [^'"\\s]+ # attr having no ' nor " ))? -)*?''' +)*?''' # must be used with re.VERBOSE flag _HTTPEQUIV_RE = _TEMPLATE % ('http-equiv', 'Content-Type') _CONTENT_RE = _TEMPLATE % ('content', r'(?P<mime>[^;]+);\s*charset=(?P<charset>[\w-]+)') _CONTENT2_RE = _TEMPLATE % ('charset', r'(?P<charset2>[\w-]+)') @@ -41,8 +42,9 @@ # check for meta tags, or xml decl. and stop search if a body tag is encountered _BODY_ENCODING_PATTERN = r'<\s*(?:meta%s(?:(?:\s+%s|\s+%s){2}|\s+%s)|\?xml\s[^>]+%s|body)' % ( _SKIP_ATTRS, _HTTPEQUIV_RE, _CONTENT_RE, _CONTENT2_RE, _XML_ENCODING_RE) -_BODY_ENCODING_STR_RE = re.compile(_BODY_ENCODING_PATTERN, re.I) -_BODY_ENCODING_BYTES_RE = re.compile(_BODY_ENCODING_PATTERN.encode('ascii'), re.I) +_BODY_ENCODING_STR_RE = re.compile(_BODY_ENCODING_PATTERN, re.I | re.VERBOSE) +_BODY_ENCODING_BYTES_RE = re.compile(_BODY_ENCODING_PATTERN.encode('ascii'), + re.I | re.VERBOSE) def html_body_declared_encoding(html_body_str): '''Return the encoding specified in meta tags in the html body, @@ -173,7 +175,7 @@ # Python decoder doesn't follow unicode standard when handling # bad utf-8 encoded strings. see http://bugs.python.org/issue8271 -codecs.register_error('w3lib_replace', lambda exc: (u'\ufffd', exc.start+1)) +codecs.register_error('w3lib_replace', lambda exc: (u'\ufffd', exc.end)) def to_unicode(data_str, encoding): """Convert a str object to unicode using the encoding given @@ -181,7 +183,7 @@ Characters that cannot be converted will be converted to ``\\ufffd`` (the unicode replacement character). """ - return data_str.decode(encoding, 'w3lib_replace') + return data_str.decode(encoding, 'replace' if version_info[0:2] >= (3, 3) else 'w3lib_replace') def html_to_unicode(content_type_header, html_body_str, default_encoding='utf8', auto_detect_fun=None): diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/w3lib/html.py new/w3lib-1.21.0/w3lib/html.py --- old/w3lib-1.18.0/w3lib/html.py 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/w3lib/html.py 2019-08-09 13:00:00.000000000 +0200 @@ -122,7 +122,7 @@ return _tag_re.sub(token, to_unicode(text, encoding)) -_REMOVECOMMENTS_RE = re.compile(u'<!--.*?-->', re.DOTALL) +_REMOVECOMMENTS_RE = re.compile(u'<!--.*?(?:-->|$)', re.DOTALL) def remove_comments(text, encoding=None): """ Remove HTML Comments. @@ -220,7 +220,7 @@ text = to_unicode(text, encoding) if which_ones: - tags = '|'.join([r'<%s.*?</%s>|<%s\s*/>' % (tag, tag, tag) for tag in which_ones]) + tags = '|'.join([r'<%s\b.*?</%s>|<%s\s*/>' % (tag, tag, tag) for tag in which_ones]) retags = re.compile(tags, re.DOTALL | re.IGNORECASE) text = retags.sub(u'', text) return text diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/w3lib/http.py new/w3lib-1.21.0/w3lib/http.py --- old/w3lib-1.18.0/w3lib/http.py 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/w3lib/http.py 2019-08-09 13:00:00.000000000 +0200 @@ -78,7 +78,7 @@ return b'\r\n'.join(raw_lines) -def basic_auth_header(username, password): +def basic_auth_header(username, password, encoding='ISO-8859-1'): """ Return an `Authorization` header field value for `HTTP Basic Access Authentication (RFC 2617)`_ @@ -95,5 +95,5 @@ # XXX: RFC 2617 doesn't define encoding, but ISO-8859-1 # seems to be the most widely used encoding here. See also: # http://greenbytes.de/tech/webdav/draft-ietf-httpauth-basicauth-enc-latest.html - auth = auth.encode('ISO-8859-1') + auth = auth.encode(encoding) return b'Basic ' + urlsafe_b64encode(auth) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/w3lib/url.py new/w3lib-1.21.0/w3lib/url.py --- old/w3lib-1.18.0/w3lib/url.py 2017-08-03 15:24:36.000000000 +0200 +++ new/w3lib-1.21.0/w3lib/url.py 2019-08-09 13:00:00.000000000 +0200 @@ -9,7 +9,7 @@ import posixpath import warnings import string -from collections import namedtuple +from collections import namedtuple, OrderedDict import six from six.moves.urllib.parse import (urljoin, urlsplit, urlunsplit, urldefrag, urlencode, urlparse, @@ -34,9 +34,12 @@ _safe_chars = RFC3986_RESERVED + RFC3986_UNRESERVED + EXTRA_SAFE_CHARS + b'%' +_ascii_tab_newline_re = re.compile(r'[\t\n\r]') # see https://infra.spec.whatwg.org/#ascii-tab-or-newline + def safe_url_string(url, encoding='utf8', path_encoding='utf8'): """Convert the given URL into a legal URL by escaping unsafe characters - according to RFC-3986. + according to RFC-3986. Also, ASCII tabs and newlines are removed + as per https://url.spec.whatwg.org/#url-parsing. If a bytes URL is given, it is first converted to `str` using the given encoding (which defaults to 'utf-8'). 'utf-8' encoding is used for @@ -56,8 +59,8 @@ # encoded with the supplied encoding (or UTF8 by default) # - if the supplied (or default) encoding chokes, # percent-encode offending bytes - parts = urlsplit(to_unicode(url, encoding=encoding, - errors='percentencode')) + decoded = to_unicode(url, encoding=encoding, errors='percentencode') + parts = urlsplit(_ascii_tab_newline_re.sub('', decoded)) # IDNA encoding can fail for too long labels (>63 characters) # or missing labels (e.g. http://.example.com) @@ -84,7 +87,7 @@ _parent_dirs = re.compile(r'/?(\.\./)+') -def safe_download_url(url): +def safe_download_url(url, encoding='utf8', path_encoding='utf8'): """ Make a url for download. This will call safe_url_string and then strip the fragment, if one exists. The path will be normalised. @@ -92,11 +95,11 @@ If the path is outside the document root, it will be changed to be within the document root. """ - safe_url = safe_url_string(url) + safe_url = safe_url_string(url, encoding, path_encoding) scheme, netloc, path, query, _ = urlsplit(safe_url) if path: path = _parent_dirs.sub('', posixpath.normpath(path)) - if url.endswith('/') and not path.endswith('/'): + if safe_url.endswith('/') and not path.endswith('/'): path += '/' else: path = '/' @@ -182,6 +185,8 @@ seen = set() querylist = [] for ksv in query.split(sep): + if not ksv: + continue k, _, _ = ksv.partition(kvsep) if unique and k in seen: continue @@ -198,6 +203,17 @@ return url +def _add_or_replace_parameters(url, params): + parsed = urlsplit(url) + args = parse_qsl(parsed.query, keep_blank_values=True) + + new_args = OrderedDict(args) + new_args.update(params) + + query = urlencode(new_args) + return urlunsplit(parsed._replace(query=query)) + + def add_or_replace_parameter(url, name, new_value): """Add or remove a parameter to a given url @@ -211,23 +227,22 @@ >>> """ - parsed = urlsplit(url) - args = parse_qsl(parsed.query, keep_blank_values=True) + return _add_or_replace_parameters(url, {name: new_value}) - new_args = [] - found = False - for name_, value_ in args: - if name_ == name: - new_args.append((name_, new_value)) - found = True - else: - new_args.append((name_, value_)) - if not found: - new_args.append((name, new_value)) +def add_or_replace_parameters(url, new_parameters): + """Add or remove a parameters to a given url - query = urlencode(new_args) - return urlunsplit(parsed._replace(query=query)) + >>> import w3lib.url + >>> w3lib.url.add_or_replace_parameters('http://www.example.com/index.php', {'arg': 'v'}) + 'http://www.example.com/index.php?arg=v' + >>> args = {'arg4': 'v4', 'arg3': 'v3new'} + >>> w3lib.url.add_or_replace_parameters('http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3', args) + 'http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3new&arg4=v4' + >>> + + """ + return _add_or_replace_parameters(url, new_parameters) def path_to_file_uri(path): @@ -291,6 +306,7 @@ _ParseDataURIResult = namedtuple("ParseDataURIResult", "media_type media_type_parameters data") + def parse_data_uri(uri): """ @@ -355,6 +371,7 @@ __all__ = ["add_or_replace_parameter", + "add_or_replace_parameters", "any_to_uri", "canonicalize_url", "file_uri_to_path", diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/w3lib-1.18.0/w3lib.egg-info/PKG-INFO new/w3lib-1.21.0/w3lib.egg-info/PKG-INFO --- old/w3lib-1.18.0/w3lib.egg-info/PKG-INFO 2017-08-03 15:25:28.000000000 +0200 +++ new/w3lib-1.21.0/w3lib.egg-info/PKG-INFO 2019-08-09 13:00:36.000000000 +0200 @@ -1,6 +1,6 @@ Metadata-Version: 1.1 Name: w3lib -Version: 1.18.0 +Version: 1.21.0 Summary: Library of web-related functions Home-page: https://github.com/scrapy/w3lib Author: Scrapy project @@ -15,10 +15,10 @@ Classifier: Programming Language :: Python :: 2 Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 -Classifier: Programming Language :: Python :: 3.3 Classifier: Programming Language :: Python :: 3.4 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 +Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Programming Language :: Python :: Implementation :: PyPy Classifier: Topic :: Internet :: WWW/HTTP
