Script 'mail_helper' called by obssrc Hello community, here is the log from the commit of package python-charset-normalizer for openSUSE:Factory checked in at 2022-09-18 17:31:58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/python-charset-normalizer (Old) and /work/SRC/openSUSE:Factory/.python-charset-normalizer.new.2083 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-charset-normalizer" Sun Sep 18 17:31:58 2022 rev:15 rq:1004361 version:2.1.1 Changes: -------- --- /work/SRC/openSUSE:Factory/python-charset-normalizer/python-charset-normalizer.changes 2022-08-20 20:27:51.741219415 +0200 +++ /work/SRC/openSUSE:Factory/.python-charset-normalizer.new.2083/python-charset-normalizer.changes 2022-09-18 17:32:00.929734658 +0200 @@ -1,0 +2,7 @@ +Sat Sep 17 15:46:10 UTC 2022 - Dirk M??ller <dmuel...@suse.com> + +- update to 2.1.1: + * Function `normalize` scheduled for removal in 3.0 + * Removed useless call to decode in fn is_unprintable (#206) + +------------------------------------------------------------------- Old: ---- charset_normalizer-2.1.0.tar.gz New: ---- charset_normalizer-2.1.1.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ python-charset-normalizer.spec ++++++ --- /var/tmp/diff_new_pack.1YRidd/_old 2022-09-18 17:32:01.469736234 +0200 +++ /var/tmp/diff_new_pack.1YRidd/_new 2022-09-18 17:32:01.477736257 +0200 @@ -19,7 +19,7 @@ %{?!python_module:%define python_module() python3-%{**}} %define skip_python2 1 Name: python-charset-normalizer -Version: 2.1.0 +Version: 2.1.1 Release: 0 Summary: Python Universal Charset detector License: MIT ++++++ charset_normalizer-2.1.0.tar.gz -> charset_normalizer-2.1.1.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/.github/workflows/lint.yml new/charset_normalizer-2.1.1/.github/workflows/lint.yml --- old/charset_normalizer-2.1.0/.github/workflows/lint.yml 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/.github/workflows/lint.yml 2022-08-20 00:06:12.000000000 +0200 @@ -28,7 +28,7 @@ python setup.py install - name: Type checking (Mypy) run: | - mypy charset_normalizer + mypy --strict charset_normalizer - name: Import sorting check (isort) run: | isort --check charset_normalizer diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/CHANGELOG.md new/charset_normalizer-2.1.1/CHANGELOG.md --- old/charset_normalizer-2.1.0/CHANGELOG.md 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/CHANGELOG.md 2022-08-20 00:06:12.000000000 +0200 @@ -2,6 +2,17 @@ All notable changes to charset-normalizer will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). +## [2.1.1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...2.1.1) (2022-08-19) + +### Deprecated +- Function `normalize` scheduled for removal in 3.0 + +### Changed +- Removed useless call to decode in fn is_unprintable (#206) + +### Fixed +- Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from [@aleksandernovikov](https://github.com/aleksandernovikov) (#204) + ## [2.1.0](https://github.com/Ousret/charset_normalizer/compare/2.0.12...2.1.0) (2022-06-19) ### Added diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/README.md new/charset_normalizer-2.1.1/README.md --- old/charset_normalizer-2.1.0/README.md 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/README.md 2022-08-20 00:06:12.000000000 +0200 @@ -29,11 +29,12 @@ | `Universal**` | ??? | :heavy_check_mark: | ??? | | `Reliable` **without** distinguishable standards | ??? | :heavy_check_mark: | :heavy_check_mark: | | `Reliable` **with** distinguishable standards | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | -| `Free & Open` | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | -| `License` | LGPL-2.1 | MIT | MPL-1.1 +| `License` | LGPL-2.1<br>_restrictive_ | MIT | MPL-1.1<br>_restrictive_ | | `Native Python` | :heavy_check_mark: | :heavy_check_mark: | ??? | | `Detect spoken language` | ??? | :heavy_check_mark: | N/A | -| `Supported Encoding` | 30 | :tada: [93](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40 +| `UnicodeDecodeError Safety` | ??? | :heavy_check_mark: | ??? | +| `Whl Size` | 193.6 kB | 39.5 kB | ~200 kB | +| `Supported Encoding` | 33 | :tada: [93](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40 <p align="center"> <img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/> @@ -51,7 +52,7 @@ | Package | Accuracy | Mean per file (ms) | File per sec (est) | | ------------- | :-------------: | :------------------: | :------------------: | -| [chardet](https://github.com/chardet/chardet) | 92 % | 200 ms | 5 file/sec | +| [chardet](https://github.com/chardet/chardet) | 86 % | 200 ms | 5 file/sec | | charset-normalizer | **98 %** | **39 ms** | 26 file/sec | | Package | 99th percentile | 95th percentile | 50th percentile | @@ -64,6 +65,8 @@ > Stats are generated using 400+ files using default parameters. More details > on used files, see GHA workflows. > And yes, these results might change at any time. The dataset can be updated > to include more files. > The actual delays heavily depends on your CPU capabilities. The factors > should remain the same. +> Keep in mind that the stats are generous and that Chardet accuracy vs our is measured using Chardet initial capability +> (eg. Supported Encoding) Challenge-them if you want. [cchardet](https://github.com/PyYoshi/cChardet) is a non-native (cpp binding) and unmaintained faster alternative with a better accuracy than chardet but lower than this package. If speed is the most important factor, you should try it. diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/charset_normalizer/__init__.py new/charset_normalizer-2.1.1/charset_normalizer/__init__.py --- old/charset_normalizer-2.1.0/charset_normalizer/__init__.py 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/charset_normalizer/__init__.py 2022-08-20 00:06:12.000000000 +0200 @@ -1,4 +1,4 @@ -# -*- coding: utf_8 -*- +# -*- coding: utf-8 -*- """ Charset-Normalizer ~~~~~~~~~~~~~~ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/charset_normalizer/api.py new/charset_normalizer-2.1.1/charset_normalizer/api.py --- old/charset_normalizer-2.1.0/charset_normalizer/api.py 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/charset_normalizer/api.py 2022-08-20 00:06:12.000000000 +0200 @@ -1,7 +1,8 @@ import logging +import warnings from os import PathLike from os.path import basename, splitext -from typing import BinaryIO, List, Optional, Set +from typing import Any, BinaryIO, List, Optional, Set from .cd import ( coherence_ratio, @@ -36,8 +37,8 @@ steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, - cp_isolation: List[str] = None, - cp_exclusion: List[str] = None, + cp_isolation: Optional[List[str]] = None, + cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False, ) -> CharsetMatches: @@ -486,8 +487,8 @@ steps: int = 5, chunk_size: int = 512, threshold: float = 0.20, - cp_isolation: List[str] = None, - cp_exclusion: List[str] = None, + cp_isolation: Optional[List[str]] = None, + cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False, ) -> CharsetMatches: @@ -508,12 +509,12 @@ def from_path( - path: PathLike, + path: "PathLike[Any]", steps: int = 5, chunk_size: int = 512, threshold: float = 0.20, - cp_isolation: List[str] = None, - cp_exclusion: List[str] = None, + cp_isolation: Optional[List[str]] = None, + cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False, ) -> CharsetMatches: @@ -535,17 +536,22 @@ def normalize( - path: PathLike, + path: "PathLike[Any]", steps: int = 5, chunk_size: int = 512, threshold: float = 0.20, - cp_isolation: List[str] = None, - cp_exclusion: List[str] = None, + cp_isolation: Optional[List[str]] = None, + cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, ) -> CharsetMatch: """ Take a (text-based) file path and try to create another file next to it, this time using UTF-8. """ + warnings.warn( + "normalize is deprecated and will be removed in 3.0", + DeprecationWarning, + ) + results = from_path( path, steps, diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/charset_normalizer/assets/__init__.py new/charset_normalizer-2.1.1/charset_normalizer/assets/__init__.py --- old/charset_normalizer-2.1.0/charset_normalizer/assets/__init__.py 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/charset_normalizer/assets/__init__.py 2022-08-20 00:06:12.000000000 +0200 @@ -1,4 +1,4 @@ -# -*- coding: utf_8 -*- +# -*- coding: utf-8 -*- from typing import Dict, List FREQUENCIES: Dict[str, List[str]] = { diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/charset_normalizer/cd.py new/charset_normalizer-2.1.1/charset_normalizer/cd.py --- old/charset_normalizer-2.1.0/charset_normalizer/cd.py 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/charset_normalizer/cd.py 2022-08-20 00:06:12.000000000 +0200 @@ -2,7 +2,7 @@ from codecs import IncrementalDecoder from collections import Counter from functools import lru_cache -from typing import Dict, List, Optional, Tuple +from typing import Counter as TypeCounter, Dict, List, Optional, Tuple from .assets import FREQUENCIES from .constant import KO_NAMES, LANGUAGE_SUPPORTED_COUNT, TOO_SMALL_SEQUENCE, ZH_NAMES @@ -24,7 +24,9 @@ if is_multi_byte_encoding(iana_name): raise IOError("Function not supported on multi-byte code page") - decoder = importlib.import_module("encodings.{}".format(iana_name)).IncrementalDecoder # type: ignore + decoder = importlib.import_module( + "encodings.{}".format(iana_name) + ).IncrementalDecoder p: IncrementalDecoder = decoder(errors="ignore") seen_ranges: Dict[str, int] = {} @@ -307,7 +309,7 @@ lg_inclusion_list.remove("Latin Based") for layer in alpha_unicode_split(decoded_sequence): - sequence_frequencies: Counter = Counter(layer) + sequence_frequencies: TypeCounter[str] = Counter(layer) most_common = sequence_frequencies.most_common() character_count: int = sum(o for c, o in most_common) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/charset_normalizer/cli/normalizer.py new/charset_normalizer-2.1.1/charset_normalizer/cli/normalizer.py --- old/charset_normalizer-2.1.0/charset_normalizer/cli/normalizer.py 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/charset_normalizer/cli/normalizer.py 2022-08-20 00:06:12.000000000 +0200 @@ -3,7 +3,7 @@ from json import dumps from os.path import abspath from platform import python_version -from typing import List +from typing import List, Optional try: from unicodedata2 import unidata_version @@ -48,7 +48,7 @@ sys.stdout.write("Please respond with 'yes' or 'no' " "(or 'y' or 'n').\n") -def cli_detect(argv: List[str] = None) -> int: +def cli_detect(argv: Optional[List[str]] = None) -> int: """ CLI assistant using ARGV and ArgumentParser :param argv: diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/charset_normalizer/models.py new/charset_normalizer-2.1.1/charset_normalizer/models.py --- old/charset_normalizer-2.1.0/charset_normalizer/models.py 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/charset_normalizer/models.py 2022-08-20 00:06:12.000000000 +0200 @@ -4,7 +4,16 @@ from hashlib import sha256 from json import dumps from re import sub -from typing import Any, Dict, Iterator, List, Optional, Tuple, Union +from typing import ( + Any, + Counter as TypeCounter, + Dict, + Iterator, + List, + Optional, + Tuple, + Union, +) from .constant import NOT_PRINTABLE_PATTERN, TOO_BIG_SEQUENCE from .md import mess_ratio @@ -95,7 +104,7 @@ return 0.0 @property - def w_counter(self) -> Counter: + def w_counter(self) -> TypeCounter[str]: """ Word counter instance on decoded text. Notice: Will be removed in 3.0 @@ -280,7 +289,7 @@ Act like a list(iterable) but does not implements all related methods. """ - def __init__(self, results: List[CharsetMatch] = None): + def __init__(self, results: Optional[List[CharsetMatch]] = None): self._results: List[CharsetMatch] = sorted(results) if results else [] def __iter__(self) -> Iterator[CharsetMatch]: diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/charset_normalizer/utils.py new/charset_normalizer-2.1.1/charset_normalizer/utils.py --- old/charset_normalizer-2.1.0/charset_normalizer/utils.py 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/charset_normalizer/utils.py 2022-08-20 00:06:12.000000000 +0200 @@ -13,7 +13,7 @@ from re import findall from typing import Generator, List, Optional, Set, Tuple, Union -from _multibytecodec import MultibyteIncrementalDecoder # type: ignore +from _multibytecodec import MultibyteIncrementalDecoder from .constant import ( ENCODING_MARKS, @@ -206,7 +206,7 @@ character.isspace() is False # includes \n \t \r \v and character.isprintable() is False and character != "\x1A" # Why? Its the ASCII substitute character. - and character != b"\xEF\xBB\xBF".decode("utf_8") # bug discovered in Python, + and character != "\ufeff" # bug discovered in Python, # Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space. ) @@ -231,6 +231,9 @@ for specified_encoding in results: specified_encoding = specified_encoding.lower().replace("-", "_") + encoding_alias: str + encoding_iana: str + for encoding_alias, encoding_iana in aliases.items(): if encoding_alias == specified_encoding: return encoding_iana @@ -256,7 +259,7 @@ "utf_32_be", "utf_7", } or issubclass( - importlib.import_module("encodings.{}".format(name)).IncrementalDecoder, # type: ignore + importlib.import_module("encodings.{}".format(name)).IncrementalDecoder, MultibyteIncrementalDecoder, ) @@ -286,6 +289,9 @@ def iana_name(cp_name: str, strict: bool = True) -> str: cp_name = cp_name.lower().replace("-", "_") + encoding_alias: str + encoding_iana: str + for encoding_alias, encoding_iana in aliases.items(): if cp_name in [encoding_alias, encoding_iana]: return encoding_iana @@ -315,8 +321,12 @@ if is_multi_byte_encoding(iana_name_a) or is_multi_byte_encoding(iana_name_b): return 0.0 - decoder_a = importlib.import_module("encodings.{}".format(iana_name_a)).IncrementalDecoder # type: ignore - decoder_b = importlib.import_module("encodings.{}".format(iana_name_b)).IncrementalDecoder # type: ignore + decoder_a = importlib.import_module( + "encodings.{}".format(iana_name_a) + ).IncrementalDecoder + decoder_b = importlib.import_module( + "encodings.{}".format(iana_name_b) + ).IncrementalDecoder id_a: IncrementalDecoder = decoder_a(errors="ignore") id_b: IncrementalDecoder = decoder_b(errors="ignore") diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/charset_normalizer/version.py new/charset_normalizer-2.1.1/charset_normalizer/version.py --- old/charset_normalizer-2.1.0/charset_normalizer/version.py 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/charset_normalizer/version.py 2022-08-20 00:06:12.000000000 +0200 @@ -2,5 +2,5 @@ Expose version """ -__version__ = "2.1.0" +__version__ = "2.1.1" VERSION = __version__.split(".") diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/dev-requirements.txt new/charset_normalizer-2.1.1/dev-requirements.txt --- old/charset_normalizer-2.1.0/dev-requirements.txt 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/dev-requirements.txt 2022-08-20 00:06:12.000000000 +0200 @@ -1,10 +1,10 @@ pytest pytest-cov codecov -chardet==4.0.* -Flask>=2.0,<3.0; python_version >= '3.6' -requests>=2.26,<3.0; python_version >= '3.6' -black==22.3.0; python_version >= '3.6' -flake8==4.0.1; python_version >= '3.6' -mypy==0.961; python_version >= '3.6' -isort; python_version >= '3.6' +chardet>=5.0,<5.1 +Flask>=2.0,<3.0 +requests>=2.26,<3.0 +black==22.6.0 +flake8==5.0.4 +mypy==0.971 +isort diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/docs/community/faq.rst new/charset_normalizer-2.1.1/docs/community/faq.rst --- old/charset_normalizer-2.1.0/docs/community/faq.rst 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/docs/community/faq.rst 2022-08-20 00:06:12.000000000 +0200 @@ -23,6 +23,10 @@ The real debate is to state if the detection is an HTTP client matter or not. That is more complicated and not my field. +Some individuals keep insisting that the *whole* Internet is UTF-8 ready. Those are absolutely wrong and very Europe and North America-centered, +In my humble experience, the countries in the world are very disparate in this evolution. And the Internet is not just about HTML content. +Having a thorough analysis of this is very scary. + Should I bother using detection? -------------------------------- @@ -36,11 +40,10 @@ Then this change is mostly backward-compatible, exception of a thing: - This new library support way more code pages (x3) than its counterpart Chardet. - - Based on the 30-ich charsets that Chardet support, expect roughly 90% BC results https://github.com/Ousret/charset_normalizer/pull/77/checks?check_run_id=3244585065 + - Based on the 30-ich charsets that Chardet support, expect roughly 85% BC results https://github.com/Ousret/charset_normalizer/pull/77/checks?check_run_id=3244585065 We do not guarantee this BC exact percentage through time. May vary but not by much. - Isn't it the same as Chardet? ----------------------------- diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/charset_normalizer-2.1.0/docs/community/why_migrate.rst new/charset_normalizer-2.1.1/docs/community/why_migrate.rst --- old/charset_normalizer-2.1.0/docs/community/why_migrate.rst 2022-06-19 23:55:20.000000000 +0200 +++ new/charset_normalizer-2.1.1/docs/community/why_migrate.rst 2022-08-20 00:06:12.000000000 +0200 @@ -4,13 +4,13 @@ There is so many reason to migrate your current project. Here are some of them: - Remove ANY license ambiguity/restriction for projects bundling Chardet (even indirectly). -- X5 faster than Chardet in average and X2 faster in 99% of the cases AND support 3 times more encoding. +- X5 faster than Chardet in average and X3 faster in 99% of the cases AND support 3 times more encoding. - Never return a encoding if not suited for the given decoder. Eg. Never get UnicodeDecodeError! - Actively maintained, open to contributors. - Have the backward compatible function ``detect`` that come from Chardet. - Truly detect the language used in the text. - It is, for the first time, really universal! As there is no specific probe per charset. -- The package size is X4 lower than Chardet's (4.0)! +- The package size is X4 lower than Chardet's (5.0)! - Propose much more options/public kwargs to tweak the detection as you sees fit! - Using static typing to ease your development. - Detect Unicode content better than Chardet or cChardet does.