Hello community,
here is the log from the commit of package python-html2text for
openSUSE:Leap:15.2 checked in at 2020-04-15 20:07:22
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Leap:15.2/python-html2text (Old)
and /work/SRC/openSUSE:Leap:15.2/.python-html2text.new.2738 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-html2text"
Wed Apr 15 20:07:22 2020 rev:18 rq: version:2019.8.11
Changes:
--------
--- /work/SRC/openSUSE:Leap:15.2/python-html2text/python-html2text.changes
2020-04-14 14:24:16.617422083 +0200
+++
/work/SRC/openSUSE:Leap:15.2/.python-html2text.new.2738/python-html2text.changes
2020-04-15 20:08:28.638186050 +0200
@@ -2,21 +1,0 @@
-Thu Apr 9 11:17:36 UTC 2020 - Marketa Calabkova <[email protected]>
-
-- Update to 2020.1.16
- * Add type annotations.
- * Add support for Python 3.8.
- * Performance improvements when ``wrap_links`` is ``False`` (the default).
- * Configure setuptools using setup.cfg.
-
--------------------------------------------------------------------
-Fri Dec 13 13:43:47 UTC 2019 - Matthias Fehring <[email protected]>
-
-- Update to 2019.9.26:
- * Fix long blockquotes wrapping.
- * Remove the trailing whitespaces that were added after wrapping list items
& blockquotes.
- * Remove support for Python <= 3.4. Now requires Python 3.5+.
- * Fix memory leak when processing a document containing a <abbr> tag.
- * Fix AttributeError when reading text from stdin.
- * Fix UnicodeEncodeError when writing output to stdout.
-- Disable build for Python 2
-
--------------------------------------------------------------------
Old:
----
html2text-2020.1.16.tar.gz
New:
----
html2text-2019.8.11.tar.gz
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Other differences:
------------------
++++++ python-html2text.spec ++++++
--- /var/tmp/diff_new_pack.iuv4gt/_old 2020-04-15 20:08:29.010186303 +0200
+++ /var/tmp/diff_new_pack.iuv4gt/_new 2020-04-15 20:08:29.014186306 +0200
@@ -1,7 +1,7 @@
#
# spec file for package python-html2text
#
-# Copyright (c) 2020 SUSE LLC
+# Copyright (c) 2019 SUSE LINUX GmbH, Nuernberg, Germany.
#
# All modifications and additions to the file contributed by third parties
# remain the property of their copyright owners, unless otherwise agreed
@@ -17,10 +17,9 @@
%define upname html2text
-%define skip_python2 1
%{?!python_module:%define python_module() python-%{**} python3-%{**}}
Name: python-%{upname}
-Version: 2020.1.16
+Version: 2019.8.11
Release: 0
Summary: Python script for turning HTML into Markdown text
License: GPL-3.0-only
@@ -64,8 +63,6 @@
%python_uninstall_alternative html2text
%check
-# otherwise python 3.6 does not automatically select UTF-8 for console output
-export LANG=en_US.UTF-8
%pytest
%files %{python_files}
++++++ html2text-2020.1.16.tar.gz -> html2text-2019.8.11.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/ChangeLog.rst
new/html2text-2019.8.11/ChangeLog.rst
--- old/html2text-2020.1.16/ChangeLog.rst 2020-01-16 15:20:17.000000000
+0100
+++ new/html2text-2019.8.11/ChangeLog.rst 2019-08-11 21:33:38.000000000
+0200
@@ -1,25 +1,3 @@
-2020.1.16
-=========
-----
-
-* Add type annotations.
-* Add support for Python 3.8.
-* Performance improvements when ``wrap_links`` is ``False`` (the default).
-* Configure setuptools using setup.cfg.
-
-
-2019.9.26
-=========
-----
-
-* Fix long blockquotes wrapping.
-* Remove the trailing whitespaces that were added after wrapping list items &
blockquotes.
-* Remove support for Python ≤ 3.4. Now requires Python 3.5+.
-* Fix memory leak when processing a document containing a ``<abbr>`` tag.
-* Fix ``AttributeError`` when reading text from stdin.
-* Fix ``UnicodeEncodeError`` when writing output to stdout.
-
-
2019.8.11
=========
----
@@ -32,16 +10,13 @@
* Add ``__main__.py`` module to allow running the CLI using ``python -m
html2text ...``.
* Fix #238: correct spacing when a HTML entity follows a non-stressed tags
which follow a stressed tag.
* Remove unused or deprecated:
-
* ``html2text.compat.escape()``
* ``html2text.config.RE_UNESCAPE``
* ``html2text.HTML2Text.replaceEntities()``
* ``html2text.HTML2Text.unescape()``
* ``html2text.unescape()``
-
* Fix #208: handle LEFT-TO-RIGHT MARK after a stressed tag.
-
2018.1.9
========
----
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/PKG-INFO
new/html2text-2019.8.11/PKG-INFO
--- old/html2text-2020.1.16/PKG-INFO 2020-01-16 15:21:10.000000000 +0100
+++ new/html2text-2019.8.11/PKG-INFO 2019-08-11 21:36:00.000000000 +0200
@@ -1,6 +1,6 @@
Metadata-Version: 2.1
Name: html2text
-Version: 2020.1.16
+Version: 2019.8.11
Summary: Turn HTML into equivalent Markdown-structured text.
Home-page: https://github.com/Alir3z4/html2text/
Author: Aaron Swartz
@@ -101,13 +101,14 @@
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
+Classifier: Programming Language :: Python :: 2
+Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
-Classifier: Programming Language :: Python :: 3.8
-Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
-Requires-Python: >=3.5
+Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*
Description-Content-Type: text/markdown
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/html2text/__init__.py
new/html2text-2019.8.11/html2text/__init__.py
--- old/html2text-2020.1.16/html2text/__init__.py 2020-01-16
15:20:17.000000000 +0100
+++ new/html2text-2019.8.11/html2text/__init__.py 2019-08-11
21:35:55.000000000 +0200
@@ -1,16 +1,14 @@
+# coding: utf-8
"""html2text: Turn HTML into equivalent Markdown-structured text."""
+from __future__ import division, unicode_literals
-import html.entities
-import html.parser
import re
-import urllib.parse as urlparse
+import sys
from textwrap import wrap
-from typing import Dict, List, Optional, Tuple, Union
-from . import config
-from .elements import AnchorElement, ListElement
-from .typing import OutCallback
-from .utils import (
+from html2text import config
+from html2text.compat import HTMLParser, urlparse
+from html2text.utils import (
dumb_css_parser,
element_style,
escape_md,
@@ -21,32 +19,38 @@
google_text_emphasis,
hn,
list_numbering_start,
+ name2cp,
pad_tables_in_text,
skipwrap,
unifiable_n,
)
-__version__ = (2020, 1, 16)
+try:
+ chr = unichr
+ nochr = unicode("")
+except NameError:
+ # python3 uses chr
+ nochr = str("")
+
+__version__ = (2019, 8, 11)
# TODO:
# Support decoded entities with UNIFIABLE.
-class HTML2Text(html.parser.HTMLParser):
- def __init__(
- self,
- out: Optional[OutCallback] = None,
- baseurl: str = "",
- bodywidth: int = config.BODY_WIDTH,
- ) -> None:
+class HTML2Text(HTMLParser.HTMLParser):
+ def __init__(self, out=None, baseurl="", bodywidth=config.BODY_WIDTH):
"""
Input parameters:
out: possible custom replacement for self.outtextf (which
appends lines of text).
baseurl: base URL of the document we process
"""
- super().__init__(convert_charrefs=False)
+ kwargs = {}
+ if sys.version_info >= (3, 4):
+ kwargs["convert_charrefs"] = False
+ HTMLParser.HTMLParser.__init__(self, **kwargs)
# Config options
self.split_next_td = False
@@ -90,20 +94,20 @@
self.out = out
# empty list to store output characters before they are "joined"
- self.outtextlist = [] # type: List[str]
+ self.outtextlist = []
self.quiet = 0
self.p_p = 0 # number of newline character to print before next output
self.outcount = 0
self.start = True
self.space = False
- self.a = [] # type: List[AnchorElement]
- self.astack = [] # type: List[Optional[Dict[str, Optional[str]]]]
- self.maybe_automatic_link = None # type: Optional[str]
+ self.a = []
+ self.astack = []
+ self.maybe_automatic_link = None
self.empty_link = False
self.absolute_url_matcher = re.compile(r"^[a-zA-Z+]+://")
self.acount = 0
- self.list = [] # type: List[ListElement]
+ self.list = []
self.blockquote = 0
self.pre = False
self.startpre = False
@@ -113,57 +117,52 @@
self.lastWasNL = False
self.lastWasList = False
self.style = 0
- self.style_def = {} # type: Dict[str, Dict[str, str]]
- self.tag_stack = (
- []
- ) # type: List[Tuple[str, Dict[str, Optional[str]], Dict[str, str]]]
+ self.style_def = {}
+ self.tag_stack = []
self.emphasis = 0
self.drop_white_space = 0
self.inheader = False
- # Current abbreviation definition
- self.abbr_title = None # type: Optional[str]
- # Last inner HTML (for abbr being defined)
- self.abbr_data = None # type: Optional[str]
- # Stack of abbreviations to write later
- self.abbr_list = {} # type: Dict[str, str]
+ self.abbr_title = None # current abbreviation definition
+ self.abbr_data = None # last inner HTML (for abbr being defined)
+ self.abbr_list = {} # stack of abbreviations to write later
self.baseurl = baseurl
self.stressed = False
self.preceding_stressed = False
- self.preceding_data = ""
- self.current_tag = ""
+ self.preceding_data = None
+ self.current_tag = None
config.UNIFIABLE["nbsp"] = " _place_holder;"
- def feed(self, data: str) -> None:
+ def feed(self, data):
data = data.replace("</' + 'script>", "</ignore>")
- super().feed(data)
+ HTMLParser.HTMLParser.feed(self, data)
- def handle(self, data: str) -> str:
+ def handle(self, data):
self.feed(data)
self.feed("")
- markdown = self.optwrap(self.finish())
+ markdown = self.optwrap(self.close())
if self.pad_tables:
return pad_tables_in_text(markdown)
else:
return markdown
- def outtextf(self, s: str) -> None:
+ def outtextf(self, s):
self.outtextlist.append(s)
if s:
self.lastWasNL = s[-1] == "\n"
- def finish(self) -> str:
- self.close()
+ def close(self):
+ HTMLParser.HTMLParser.close(self)
self.pbr()
self.o("", force="end")
- outtext = "".join(self.outtextlist)
+ outtext = nochr.join(self.outtextlist)
if self.unicode_snob:
- nbsp = html.entities.html5["nbsp;"]
+ nbsp = chr(name2cp("nbsp"))
else:
- nbsp = " "
+ nbsp = chr(32)
outtext = outtext.replace(" _place_holder;", nbsp)
# Clear self.outtextlist to avoid memory leak of its content to
@@ -172,10 +171,10 @@
return outtext
- def handle_charref(self, c: str) -> None:
+ def handle_charref(self, c):
self.handle_data(self.charref(c), True)
- def handle_entityref(self, c: str) -> None:
+ def handle_entityref(self, c):
ref = self.entityref(c)
# ref may be an empty string (e.g. for ‎/‏ markers that should
@@ -187,13 +186,13 @@
if ref:
self.handle_data(ref, True)
- def handle_starttag(self, tag: str, attrs: List[Tuple[str,
Optional[str]]]) -> None:
- self.handle_tag(tag, dict(attrs), start=True)
+ def handle_starttag(self, tag, attrs):
+ self.handle_tag(tag, attrs, 1)
- def handle_endtag(self, tag: str) -> None:
- self.handle_tag(tag, {}, start=False)
+ def handle_endtag(self, tag):
+ self.handle_tag(tag, None, 0)
- def previousIndex(self, attrs: Dict[str, Optional[str]]) -> Optional[int]:
+ def previousIndex(self, attrs):
"""
:type attrs: dict
@@ -203,15 +202,17 @@
"""
if "href" not in attrs:
return None
+ i = -1
+ for a in self.a:
+ i += 1
+ match = False
- match = False
- for i, a in enumerate(self.a):
- if "href" in a.attrs and a.attrs["href"] == attrs["href"]:
- if "title" in a.attrs or "title" in attrs:
+ if "href" in a and a["href"] == attrs["href"]:
+ if "title" in a or "title" in attrs:
if (
- "title" in a.attrs
+ "title" in a
and "title" in attrs
- and a.attrs["title"] == attrs["title"]
+ and a["title"] == attrs["title"]
):
match = True
else:
@@ -219,11 +220,8 @@
if match:
return i
- return None
- def handle_emphasis(
- self, start: bool, tag_style: Dict[str, str], parent_style: Dict[str,
str]
- ) -> None:
+ def handle_emphasis(self, start, tag_style, parent_style):
"""
Handles various text emphases
"""
@@ -294,10 +292,13 @@
if strikethrough:
self.quiet -= 1
- def handle_tag(
- self, tag: str, attrs: Dict[str, Optional[str]], start: bool
- ) -> None:
+ def handle_tag(self, tag, attrs, start):
self.current_tag = tag
+ # attrs is None for endtags
+ if attrs is None:
+ attrs = {}
+ else:
+ attrs = dict(attrs)
if self.tag_callback is not None:
if self.tag_callback(self, tag, attrs, start) is True:
@@ -320,7 +321,7 @@
# need the attributes of the parent nodes in order to get a
# complete style description for the current element. we assume
# that google docs export well formed html.
- parent_style = {} # type: Dict[str, str]
+ parent_style = {}
if start:
if self.tag_stack:
parent_style = self.tag_stack[-1][2]
@@ -389,10 +390,8 @@
self.blockquote -= 1
self.p()
- def no_preceding_space(self: HTML2Text) -> bool:
- return bool(
- self.preceding_data and re.match(r"[^\s]",
self.preceding_data[-1])
- )
+ def no_preceding_space(self):
+ return self.preceding_data and re.match(r"[^\s]",
self.preceding_data[-1])
if tag in ["em", "i", "u"] and not self.ignore_emphasis:
if start and no_preceding_space(self):
@@ -441,10 +440,9 @@
self.abbr_title = attrs["title"]
else:
if self.abbr_title is not None:
- assert self.abbr_data is not None
self.abbr_list[self.abbr_data] = self.abbr_title
self.abbr_title = None
- self.abbr_data = None
+ self.abbr_data = ""
if tag == "q":
if not self.quote:
@@ -453,7 +451,7 @@
self.o(self.close_quote)
self.quote = not self.quote
- def link_url(self: HTML2Text, link: str, title: str = "") -> None:
+ def link_url(self, link, title=""):
url = urlparse.urljoin(self.baseurl, link)
title = ' "{}"'.format(title) if title.strip() else ""
self.o("]({url}{title})".format(url=escape_md(url), title=title))
@@ -478,28 +476,31 @@
if self.maybe_automatic_link and not self.empty_link:
self.maybe_automatic_link = None
elif a:
- assert a["href"] is not None
if self.empty_link:
self.o("[")
self.empty_link = False
self.maybe_automatic_link = None
if self.inline_links:
- title = a.get("title") or ""
- title = escape_md(title)
- link_url(self, a["href"], title)
+ try:
+ title = a["title"] if a["title"] else ""
+ title = escape_md(title)
+ except KeyError:
+ link_url(self, a["href"], "")
+ else:
+ link_url(self, a["href"], title)
else:
i = self.previousIndex(a)
if i is not None:
- a_props = self.a[i]
+ a = self.a[i]
else:
self.acount += 1
- a_props = AnchorElement(a, self.acount,
self.outcount)
- self.a.append(a_props)
- self.o("][" + str(a_props.count) + "]")
+ a["count"] = self.acount
+ a["outcount"] = self.outcount
+ self.a.append(a)
+ self.o("][" + str(a["count"]) + "]")
if tag == "img" and start and not self.ignore_images:
if "src" in attrs:
- assert attrs["src"] is not None
if not self.images_to_alt:
attrs["href"] = attrs["src"]
alt = attrs.get("alt") or self.default_image_alt
@@ -511,10 +512,8 @@
):
self.o("<img src='" + attrs["src"] + "' ")
if "width" in attrs:
- assert attrs["width"] is not None
self.o("width='" + attrs["width"] + "' ")
if "height" in attrs:
- assert attrs["height"] is not None
self.o("height='" + attrs["height"] + "' ")
if alt:
self.o("alt='" + alt + "' ")
@@ -551,12 +550,13 @@
else:
i = self.previousIndex(attrs)
if i is not None:
- a_props = self.a[i]
+ attrs = self.a[i]
else:
self.acount += 1
- a_props = AnchorElement(attrs, self.acount,
self.outcount)
- self.a.append(a_props)
- self.o("[" + str(a_props.count) + "]")
+ attrs["count"] = self.acount
+ attrs["outcount"] = self.outcount
+ self.a.append(attrs)
+ self.o("[" + str(attrs["count"]) + "]")
if tag == "dl" and start:
self.p()
@@ -569,7 +569,7 @@
if tag in ["ol", "ul"]:
# Google Docs create sub lists as top level lists
- if not self.list and not self.lastWasList:
+ if (not self.list) and (not self.lastWasList):
self.p()
if start:
if self.google_doc:
@@ -577,11 +577,11 @@
else:
list_style = tag
numbering_start = list_numbering_start(attrs)
- self.list.append(ListElement(list_style, numbering_start))
+ self.list.append({"name": list_style, "num": numbering_start})
else:
if self.list:
self.list.pop()
- if not self.google_doc and not self.list:
+ if (not self.google_doc) and (not self.list):
self.o("\n")
self.lastWasList = True
else:
@@ -593,18 +593,18 @@
if self.list:
li = self.list[-1]
else:
- li = ListElement("ul", 0)
+ li = {"name": "ul", "num": 0}
if self.google_doc:
nest_count = self.google_nest_count(tag_style)
else:
nest_count = len(self.list)
# TODO: line up <ol><li>s > 9 correctly.
self.o(" " * nest_count)
- if li.name == "ul":
+ if li["name"] == "ul":
self.o(self.ul_item_mark + " ")
- elif li.name == "ol":
- li.num += 1
- self.o(str(li.num) + ". ")
+ elif li["name"] == "ol":
+ li["num"] += 1
+ self.o(str(li["num"]) + ". ")
self.start = True
if tag in ["table", "tr", "td", "th"]:
@@ -671,23 +671,21 @@
self.p()
# TODO: Add docstring for these one letter functions
- def pbr(self) -> None:
+ def pbr(self):
"Pretty print has a line break"
if self.p_p == 0:
self.p_p = 1
- def p(self) -> None:
+ def p(self):
"Set pretty print to 1 or 2 lines"
self.p_p = 1 if self.single_line_break else 2
- def soft_br(self) -> None:
+ def soft_br(self):
"Soft breaks"
self.pbr()
self.br_toggle = " "
- def o(
- self, data: str, puredata: bool = False, force: Union[bool, str] =
False
- ) -> None:
+ def o(self, data, puredata=False, force=False):
"""
Deal with indentation and whitespace
"""
@@ -732,7 +730,8 @@
if not self.list:
bq += " "
# else: list content is already partially indented
- bq += " " * len(self.list)
+ for i in range(len(self.list)):
+ bq += " "
data = data.replace("\n", "\n" + bq)
if self.startpre:
@@ -770,16 +769,15 @@
newa = []
for link in self.a:
- if self.outcount > link.outcount:
+ if self.outcount > link["outcount"]:
self.out(
" ["
- + str(link.count)
+ + str(link["count"])
+ "]: "
- + urlparse.urljoin(self.baseurl,
link.attrs["href"])
+ + urlparse.urljoin(self.baseurl, link["href"])
)
- if "title" in link.attrs:
- assert link.attrs["title"] is not None
- self.out(" (" + link.attrs["title"] + ")")
+ if "title" in link:
+ self.out(" (" + link["title"] + ")")
self.out("\n")
else:
newa.append(link)
@@ -798,7 +796,7 @@
self.out(data)
self.outcount += 1
- def handle_data(self, data: str, entity_char: bool = False) -> None:
+ def handle_data(self, data, entity_char=False):
if not data:
# Data may be empty for some HTML entities. For example,
# LEFT-TO-RIGHT MARK.
@@ -841,7 +839,7 @@
self.preceding_data = data
self.o(data, puredata=True)
- def charref(self, name: str) -> str:
+ def charref(self, name):
if name[0] in ["x", "X"]:
c = int(name[1:], 16)
else:
@@ -855,16 +853,21 @@
except ValueError: # invalid unicode
return ""
- def entityref(self, c: str) -> str:
+ def entityref(self, c):
if not self.unicode_snob and c in config.UNIFIABLE:
return config.UNIFIABLE[c]
- try:
- ch = html.entities.html5[c + ";"]
- except KeyError:
- return "&" + c + ";"
- return config.UNIFIABLE[c] if c == "nbsp" else ch
+ else:
+ try:
+ cp = name2cp(c)
+ except KeyError:
+ return "&" + c + ";"
+ else:
+ if c == "nbsp":
+ return config.UNIFIABLE[c]
+ else:
+ return chr(cp)
- def google_nest_count(self, style: Dict[str, str]) -> int:
+ def google_nest_count(self, style):
"""
Calculate the nesting count of google doc lists
@@ -878,7 +881,7 @@
return nest_count
- def optwrap(self, text: str) -> str:
+ def optwrap(self, text):
"""
Wrap all paragraphs in the provided text.
@@ -901,13 +904,7 @@
if not skipwrap(para, self.wrap_links, self.wrap_list_items):
indent = ""
if para.startswith(" " + self.ul_item_mark):
- # list item continuation: add a double indent to the
- # new lines
- indent = " "
- elif para.startswith("> "):
- # blockquote continuation: add the greater than symbol
- # to the new lines
- indent = "> "
+ indent = " " # For list items.
wrapped = wrap(
para,
self.body_width,
@@ -915,12 +912,9 @@
subsequent_indent=indent,
)
result += "\n".join(wrapped)
- if para.endswith(" "):
+ if indent or para.endswith(" "):
result += " \n"
newlines = 1
- elif indent:
- result += "\n"
- newlines = 1
else:
result += "\n\n"
newlines = 2
@@ -939,7 +933,7 @@
return result
-def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None)
-> str:
+def html2text(html, baseurl="", bodywidth=None):
if bodywidth is None:
bodywidth = config.BODY_WIDTH
h = HTML2Text(baseurl=baseurl, bodywidth=bodywidth)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/html2text/__main__.py
new/html2text-2019.8.11/html2text/__main__.py
--- old/html2text-2020.1.16/html2text/__main__.py 2019-10-12
17:55:30.000000000 +0200
+++ new/html2text-2019.8.11/html2text/__main__.py 2019-08-11
21:27:39.000000000 +0200
@@ -1,3 +1,3 @@
-from .cli import main
+from html2text.cli import main
main()
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/html2text/cli.py
new/html2text-2019.8.11/html2text/cli.py
--- old/html2text-2020.1.16/html2text/cli.py 2019-10-12 18:20:41.000000000
+0200
+++ new/html2text-2019.8.11/html2text/cli.py 2019-08-11 21:27:39.000000000
+0200
@@ -1,10 +1,10 @@
import argparse
-import sys
-from . import HTML2Text, __version__, config
+from html2text import HTML2Text, __version__, config
+from html2text.utils import wrap_read, wrapwrite
-def main() -> None:
+def main():
baseurl = ""
class bcolors:
@@ -256,10 +256,10 @@
with open(args.filename, "rb") as fp:
data = fp.read()
else:
- data = sys.stdin.buffer.read()
+ data = wrap_read()
try:
- html = data.decode(args.encoding, args.decode_errors)
+ data = data.decode(args.encoding, args.decode_errors)
except UnicodeDecodeError as err:
warning = bcolors.WARNING + "Warning:" + bcolors.ENDC
warning += " Use the " + bcolors.OKGREEN
@@ -303,4 +303,4 @@
h.open_quote = args.open_quote
h.close_quote = args.close_quote
- sys.stdout.write(h.handle(html))
+ wrapwrite(h.handle(data))
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/html2text/compat.py
new/html2text-2019.8.11/html2text/compat.py
--- old/html2text-2020.1.16/html2text/compat.py 1970-01-01 01:00:00.000000000
+0100
+++ new/html2text-2019.8.11/html2text/compat.py 2019-08-11 21:27:39.000000000
+0200
@@ -0,0 +1,12 @@
+import sys
+
+if sys.version_info[0] == 2:
+ import htmlentitydefs
+ import urlparse
+ import HTMLParser
+else:
+ import urllib.parse as urlparse
+ import html.entities as htmlentitydefs
+ import html.parser as HTMLParser
+
+__all__ = ["HTMLParser", "htmlentitydefs", "urlparse"]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/html2text/config.py
new/html2text-2019.8.11/html2text/config.py
--- old/html2text-2020.1.16/html2text/config.py 2019-08-15 12:56:54.000000000
+0200
+++ new/html2text-2019.8.11/html2text/config.py 2019-08-11 21:27:39.000000000
+0200
@@ -1,3 +1,5 @@
+from __future__ import unicode_literals
+
import re
# Use Unicode characters instead of their ascii pseudo-replacements
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/html2text/elements.py
new/html2text-2019.8.11/html2text/elements.py
--- old/html2text-2020.1.16/html2text/elements.py 2019-10-12
18:20:41.000000000 +0200
+++ new/html2text-2019.8.11/html2text/elements.py 1970-01-01
01:00:00.000000000 +0100
@@ -1,18 +0,0 @@
-from typing import Dict, Optional
-
-
-class AnchorElement:
- __slots__ = ["attrs", "count", "outcount"]
-
- def __init__(self, attrs: Dict[str, Optional[str]], count: int, outcount:
int):
- self.attrs = attrs
- self.count = count
- self.outcount = outcount
-
-
-class ListElement:
- __slots__ = ["name", "num"]
-
- def __init__(self, name: str, num: int):
- self.name = name
- self.num = num
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/html2text/typing.py
new/html2text-2019.8.11/html2text/typing.py
--- old/html2text-2020.1.16/html2text/typing.py 2019-10-12 18:20:41.000000000
+0200
+++ new/html2text-2019.8.11/html2text/typing.py 1970-01-01 01:00:00.000000000
+0100
@@ -1,3 +0,0 @@
-class OutCallback:
- def __call__(self, s: str) -> None:
- ...
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/html2text/utils.py
new/html2text-2019.8.11/html2text/utils.py
--- old/html2text-2020.1.16/html2text/utils.py 2020-01-16 15:08:28.000000000
+0100
+++ new/html2text-2019.8.11/html2text/utils.py 2019-08-11 21:27:39.000000000
+0200
@@ -1,16 +1,20 @@
-import html.entities
-from typing import Dict, List, Optional
+import sys
-from . import config
+from html2text import config
+from html2text.compat import htmlentitydefs
-unifiable_n = {
- html.entities.name2codepoint[k]: v
- for k, v in config.UNIFIABLE.items()
- if k != "nbsp"
-}
+def name2cp(k):
+ """Return sname to codepoint"""
+ if k == "apos":
+ return ord("'")
+ return htmlentitydefs.name2codepoint[k]
-def hn(tag: str) -> int:
+
+unifiable_n = {name2cp(k): v for k, v in config.UNIFIABLE.items() if k !=
"nbsp"}
+
+
+def hn(tag):
if tag[0] == "h" and len(tag) == 2:
n = tag[1]
if "0" < n <= "9":
@@ -18,7 +22,7 @@
return 0
-def dumb_property_dict(style: str) -> Dict[str, str]:
+def dumb_property_dict(style):
"""
:returns: A hash of css attributes
"""
@@ -28,7 +32,7 @@
}
-def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]:
+def dumb_css_parser(data):
"""
:type data: str
@@ -45,20 +49,16 @@
# parse the css. reverted from dictionary comprehension in order to
# support older pythons
- pairs = [x.split("{") for x in data.split("}") if "{" in x.strip()]
+ elements = [x.split("{") for x in data.split("}") if "{" in x.strip()]
try:
- elements = {a.strip(): dumb_property_dict(b) for a, b in pairs}
+ elements = {a.strip(): dumb_property_dict(b) for a, b in elements}
except ValueError:
elements = {} # not that important
return elements
-def element_style(
- attrs: Dict[str, Optional[str]],
- style_def: Dict[str, Dict[str, str]],
- parent_style: Dict[str, str],
-) -> Dict[str, str]:
+def element_style(attrs, style_def, parent_style):
"""
:type attrs: dict
:type style_def: dict
@@ -69,19 +69,17 @@
"""
style = parent_style.copy()
if "class" in attrs:
- assert attrs["class"] is not None
for css_class in attrs["class"].split():
css_style = style_def.get("." + css_class, {})
style.update(css_style)
if "style" in attrs:
- assert attrs["style"] is not None
immediate_style = dumb_property_dict(attrs["style"])
style.update(immediate_style)
return style
-def google_list_style(style: Dict[str, str]) -> str:
+def google_list_style(style):
"""
Finds out whether this is an ordered or unordered list
@@ -97,7 +95,7 @@
return "ol"
-def google_has_height(style: Dict[str, str]) -> bool:
+def google_has_height(style):
"""
Check if the style of the element has the 'height' attribute
explicitly defined
@@ -109,7 +107,7 @@
return "height" in style
-def google_text_emphasis(style: Dict[str, str]) -> List[str]:
+def google_text_emphasis(style):
"""
:type style: dict
@@ -127,7 +125,7 @@
return emphasis
-def google_fixed_width_font(style: Dict[str, str]) -> bool:
+def google_fixed_width_font(style):
"""
Check if the css of the current element defines a fixed width font
@@ -141,7 +139,7 @@
return "courier new" == font_family or "consolas" == font_family
-def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int:
+def list_numbering_start(attrs):
"""
Extract numbering from list element attributes
@@ -150,7 +148,6 @@
:rtype: int or None
"""
if "start" in attrs:
- assert attrs["start"] is not None
try:
return int(attrs["start"]) - 1
except ValueError:
@@ -159,10 +156,10 @@
return 0
-def skipwrap(para: str, wrap_links: bool, wrap_list_items: bool) -> bool:
+def skipwrap(para, wrap_links, wrap_list_items):
# If it appears to contain a link
# don't wrap
- if not wrap_links and config.RE_LINK.search(para):
+ if (len(config.RE_LINK.findall(para)) > 0) and not wrap_links:
return True
# If the text begins with four spaces or one tab, it's a code block;
# don't wrap
@@ -190,7 +187,25 @@
)
-def escape_md(text: str) -> str:
+def wrapwrite(text):
+ text = text.encode("utf-8")
+ try: # Python3
+ sys.stdout.buffer.write(text)
+ except AttributeError:
+ sys.stdout.write(text)
+
+
+def wrap_read():
+ """
+ :rtype: str
+ """
+ try:
+ return sys.stdin.read()
+ except AttributeError:
+ return sys.stdin.buffer.read()
+
+
+def escape_md(text):
"""
Escapes markdown-sensitive characters within other markdown
constructs.
@@ -198,7 +213,7 @@
return config.RE_MD_CHARS_MATCHER.sub(r"\\\1", text)
-def escape_md_section(text: str, snob: bool = False) -> str:
+def escape_md_section(text, snob=False):
"""
Escapes markdown-sensitive characters across whole document sections.
"""
@@ -214,7 +229,7 @@
return text
-def reformat_table(lines: List[str], right_margin: int) -> List[str]:
+def reformat_table(lines, right_margin):
"""
Given the lines of a table
padds the cells and returns the new lines
@@ -257,13 +272,12 @@
return new_lines
-def pad_tables_in_text(text: str, right_margin: int = 1) -> str:
+def pad_tables_in_text(text, right_margin=1):
"""
Provide padding for tables in the text
"""
lines = text.split("\n")
- table_buffer = [] # type: List[str]
- table_started = False
+ table_buffer, table_started = [], False
new_lines = []
for line in lines:
# Toggle table started
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/html2text.egg-info/PKG-INFO
new/html2text-2019.8.11/html2text.egg-info/PKG-INFO
--- old/html2text-2020.1.16/html2text.egg-info/PKG-INFO 2020-01-16
15:21:10.000000000 +0100
+++ new/html2text-2019.8.11/html2text.egg-info/PKG-INFO 2019-08-11
21:35:58.000000000 +0200
@@ -1,6 +1,6 @@
Metadata-Version: 2.1
Name: html2text
-Version: 2020.1.16
+Version: 2019.8.11
Summary: Turn HTML into equivalent Markdown-structured text.
Home-page: https://github.com/Alir3z4/html2text/
Author: Aaron Swartz
@@ -101,13 +101,14 @@
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
+Classifier: Programming Language :: Python :: 2
+Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
-Classifier: Programming Language :: Python :: 3.8
-Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
-Requires-Python: >=3.5
+Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*
Description-Content-Type: text/markdown
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/html2text.egg-info/SOURCES.txt
new/html2text-2019.8.11/html2text.egg-info/SOURCES.txt
--- old/html2text-2020.1.16/html2text.egg-info/SOURCES.txt 2020-01-16
15:21:10.000000000 +0100
+++ new/html2text-2019.8.11/html2text.egg-info/SOURCES.txt 2019-08-11
21:35:59.000000000 +0200
@@ -9,16 +9,13 @@
html2text/__init__.py
html2text/__main__.py
html2text/cli.py
+html2text/compat.py
html2text/config.py
-html2text/elements.py
-html2text/py.typed
-html2text/typing.py
html2text/utils.py
html2text.egg-info/PKG-INFO
html2text.egg-info/SOURCES.txt
html2text.egg-info/dependency_links.txt
html2text.egg-info/entry_points.txt
-html2text.egg-info/not-zip-safe
html2text.egg-info/top_level.txt
test/GoogleDocMassDownload.html
test/GoogleDocMassDownload.md
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/html2text.egg-info/not-zip-safe
new/html2text-2019.8.11/html2text.egg-info/not-zip-safe
--- old/html2text-2020.1.16/html2text.egg-info/not-zip-safe 2020-01-16
15:21:10.000000000 +0100
+++ new/html2text-2019.8.11/html2text.egg-info/not-zip-safe 1970-01-01
01:00:00.000000000 +0100
@@ -1 +0,0 @@
-
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/setup.cfg
new/html2text-2019.8.11/setup.cfg
--- old/html2text-2020.1.16/setup.cfg 2020-01-16 15:21:10.000000000 +0100
+++ new/html2text-2019.8.11/setup.cfg 2019-08-11 21:36:00.000000000 +0200
@@ -1,42 +1,5 @@
-[metadata]
-name = html2text
-version = attr: html2text.__version__
-description = Turn HTML into equivalent Markdown-structured text.
-long_description = file: README.md
-long_description_content_type = text/markdown
-url = https://github.com/Alir3z4/html2text/
-author = Aaron Swartz
-author_email = [email protected]
-maintainer = Alireza Savand
-maintainer_email = [email protected]
-license = GNU GPL 3
-classifiers =
- Development Status :: 5 - Production/Stable
- Intended Audience :: Developers
- License :: OSI Approved :: GNU General Public License (GPL)
- Operating System :: OS Independent
- Programming Language :: Python
- Programming Language :: Python :: 3
- Programming Language :: Python :: 3.5
- Programming Language :: Python :: 3.6
- Programming Language :: Python :: 3.7
- Programming Language :: Python :: 3.8
- Programming Language :: Python :: 3 :: Only
- Programming Language :: Python :: Implementation :: CPython
- Programming Language :: Python :: Implementation :: PyPy
-platform = OS Independent
-
-[options]
-zip_safe = False
-packages = html2text
-python_requires = >=3.5
-
-[options.entry_points]
-console_scripts =
- html2text = html2text.cli:main
-
-[options.package_data]
-html2text = py.typed
+[bdist_wheel]
+universal = 1
[flake8]
max_line_length = 88
@@ -50,9 +13,6 @@
line_length = 88
multi_line_output = 3
-[mypy]
-python_version = 3.5
-
[egg_info]
tag_build =
tag_date = 0
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/setup.py
new/html2text-2019.8.11/setup.py
--- old/html2text-2020.1.16/setup.py 2019-10-31 19:37:31.000000000 +0100
+++ new/html2text-2019.8.11/setup.py 2019-08-11 21:27:39.000000000 +0200
@@ -1,3 +1,42 @@
+# coding: utf-8
from setuptools import setup
-setup()
+
+def readall(f):
+ with open(f) as fp:
+ return fp.read()
+
+
+setup(
+ name="html2text",
+ version=".".join(map(str, __import__("html2text").__version__)),
+ description="Turn HTML into equivalent Markdown-structured text.",
+ long_description=readall("README.md"),
+ long_description_content_type="text/markdown",
+ author="Aaron Swartz",
+ author_email="[email protected]",
+ maintainer="Alireza Savand",
+ maintainer_email="[email protected]",
+ url="https://github.com/Alir3z4/html2text/",
+ platforms="OS Independent",
+ classifiers=[
+ "Development Status :: 5 - Production/Stable",
+ "Intended Audience :: Developers",
+ "License :: OSI Approved :: GNU General Public License (GPL)",
+ "Operating System :: OS Independent",
+ "Programming Language :: Python",
+ "Programming Language :: Python :: 2",
+ "Programming Language :: Python :: 2.7",
+ "Programming Language :: Python :: 3",
+ "Programming Language :: Python :: 3.4",
+ "Programming Language :: Python :: 3.5",
+ "Programming Language :: Python :: 3.6",
+ "Programming Language :: Python :: 3.7",
+ "Programming Language :: Python :: Implementation :: CPython",
+ "Programming Language :: Python :: Implementation :: PyPy",
+ ],
+ python_requires=">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*",
+ entry_points={"console_scripts": ["html2text = html2text.cli:main"]},
+ license="GNU GPL 3",
+ packages=["html2text"],
+)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/test/blockquote_example.html
new/html2text-2019.8.11/test/blockquote_example.html
--- old/html2text-2020.1.16/test/blockquote_example.html 2019-08-15
12:56:54.000000000 +0200
+++ new/html2text-2019.8.11/test/blockquote_example.html 2019-08-11
21:27:39.000000000 +0200
@@ -1,3 +1,3 @@
<blockquote>
-"The time has come", the Walrus said, "To talk of many things: Of shoes - and
ships - and sealing wax - Of cabbages - and kings- And why the sea is boiling
hot - And whether pigs have wings."
+The time has come, the Walrus said, to speak of many things.
</blockquote>
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/test/blockquote_example.md
new/html2text-2019.8.11/test/blockquote_example.md
--- old/html2text-2020.1.16/test/blockquote_example.md 2019-09-25
10:07:55.000000000 +0200
+++ new/html2text-2019.8.11/test/blockquote_example.md 2019-08-11
21:27:39.000000000 +0200
@@ -1,4 +1,2 @@
-> "The time has come", the Walrus said, "To talk of many things: Of shoes -
-> and ships - and sealing wax - Of cabbages - and kings- And why the sea is
-> boiling hot - And whether pigs have wings."
+> The time has come, the Walrus said, to speak of many things.
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/test/test_html2text.py
new/html2text-2019.8.11/test/test_html2text.py
--- old/html2text-2020.1.16/test/test_html2text.py 2020-01-16
15:08:28.000000000 +0100
+++ new/html2text-2019.8.11/test/test_html2text.py 2019-08-11
21:27:39.000000000 +0200
@@ -1,3 +1,4 @@
+import codecs
import glob
import os
import re
@@ -40,7 +41,8 @@
if base_fn.find("unicode") >= 0:
module_args["unicode_snob"] = True
- cmdline_args.append("--unicode-snob")
+ # There is no command-line option to control unicode_snob.
+ cmdline_args = skip
func_args = skip
if base_fn.find("flip_emphasis") >= 0:
@@ -187,7 +189,7 @@
result = get_baseline(fn)
out = subprocess.check_output(cmd)
- actual = out.decode()
+ actual = out.decode("utf8")
actual = cleanup_eol(actual)
@@ -208,7 +210,7 @@
def get_baseline(fn):
name = get_baseline_name(fn)
- with open(name, encoding="utf-8") as f:
+ with codecs.open(name, mode="r", encoding="utf8") as f:
out = f.read()
return cleanup_eol(out)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/test/test_memleak.py
new/html2text-2019.8.11/test/test_memleak.py
--- old/html2text-2020.1.16/test/test_memleak.py 2019-09-25
09:41:57.000000000 +0200
+++ new/html2text-2019.8.11/test/test_memleak.py 2019-08-11
21:27:39.000000000 +0200
@@ -17,10 +17,3 @@
h2t.handle(INSTR)
# And even less when the input is empty.
assert h2t.handle("") == "\n\n"
-
-
-def test_abbr_data():
- h2t = html2text.HTML2Text()
- result = h2t.handle('<p>foo <abbr title="Three Letter Acronym">TLA</abbr>
bar</p>')
- assert result == "foo TLA bar\n\n *[TLA]: Three Letter Acronym\n\n"
- assert h2t.abbr_data is None
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/test/wrap_list_items_example.md
new/html2text-2019.8.11/test/wrap_list_items_example.md
--- old/html2text-2020.1.16/test/wrap_list_items_example.md 2019-09-25
10:07:55.000000000 +0200
+++ new/html2text-2019.8.11/test/wrap_list_items_example.md 2019-08-11
21:27:39.000000000 +0200
@@ -1,14 +1,14 @@
* One two three four five six seven eight nine ten eleven twelve thirteen
- fourteen fifteen sixteen seventeen eighteen nineteen twenty.
+ fourteen fifteen sixteen seventeen eighteen nineteen twenty.
* One two three four five six seven eight nine ten eleven twelve thirteen
- fourteen fifteen sixteen seventeen eighteen nineteen twenty.
+ fourteen fifteen sixteen seventeen eighteen nineteen twenty.
Text between lists.
* One two three four five six seven eight nine ten eleven twelve thirteen
- fourteen fifteen sixteen seventeen eighteen nineteen twenty.
+ fourteen fifteen sixteen seventeen eighteen nineteen twenty.
* One two three four five six seven eight nine ten eleven twelve thirteen
- fourteen fifteen sixteen seventeen eighteen nineteen twenty.
+ fourteen fifteen sixteen seventeen eighteen nineteen twenty.
Text after list.
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/html2text-2020.1.16/tox.ini
new/html2text-2019.8.11/tox.ini
--- old/html2text-2020.1.16/tox.ini 2019-10-31 19:37:31.000000000 +0100
+++ new/html2text-2019.8.11/tox.ini 2019-08-11 21:27:39.000000000 +0200
@@ -3,8 +3,7 @@
black
flake8
isort
- mypy
- py{35,36,37,38,py3}
+ py{27,34,35,36,37,py,py3}
minversion = 1.9
[testenv]
@@ -17,7 +16,7 @@
[testenv:black]
basepython = python3
commands =
- black --target-version py35 --check --diff .
+ black --check --diff .
deps =
black
skip_install = true
@@ -37,8 +36,3 @@
deps =
isort
skip_install = true
-
-[testenv:mypy]
-commands = mypy --strict html2text
-deps = mypy
-skip_install = true