commit urlscan for openSUSE:Factory

Source-Sync Mon, 17 May 2021 09:46:02 -0700

Script 'mail_helper' called by obssrc
Hello community,

here is the log from the commit of package urlscan for openSUSE:Factory checked 
in at 2021-05-17 18:45:05
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/urlscan (Old)
 and      /work/SRC/openSUSE:Factory/.urlscan.new.2988 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Package is "urlscan"

Mon May 17 18:45:05 2021 rev:8 rq:893552 version:0.9.6

Changes:
--------
--- /work/SRC/openSUSE:Factory/urlscan/urlscan.changes  2020-08-07 
14:22:37.590321213 +0200
+++ /work/SRC/openSUSE:Factory/.urlscan.new.2988/urlscan.changes        
2021-05-17 18:45:22.416608611 +0200
@@ -1,0 +2,14 @@
+Wed May 12 22:03:24 UTC 2021 - Dirk M??ller <[email protected]>
+
+- update to 0.9.6:
+  * Python 3.6+ required
+  * Convert to newer email.message.EmailMessage format for processing. Closes 
#98
+  * Hopefully fix #105. Escapes every "&" in the URL
+  * Attempt --run-safe implementation
+  * Fixes #106
+  * Scan a selection of email headers for URLs. Closes #97.
+  * Add option for custom regex. Closes #79.
+  * Allow $ as an acceptable trailing character
+  * Fix urwid reverse error. Thanks to @pavoljuhas. Closes #99 
+  
+-------------------------------------------------------------------

Old:
----
  urlscan-0.9.5.tar.gz

New:
----
  urlscan-0.9.6.tar.gz

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Other differences:
------------------
++++++ urlscan.spec ++++++
--- /var/tmp/diff_new_pack.fhOftc/_old  2021-05-17 18:45:23.068605845 +0200
+++ /var/tmp/diff_new_pack.fhOftc/_new  2021-05-17 18:45:23.068605845 +0200
@@ -1,7 +1,7 @@
 #
 # spec file for package urlscan
 #
-# Copyright (c) 2020 SUSE LLC
+# Copyright (c) 2021 SUSE LLC
 #
 # All modifications and additions to the file contributed by third parties
 # remain the property of their copyright owners, unless otherwise agreed
@@ -16,8 +16,9 @@
 #
 
 
+%define python_flavor python3
 Name:           urlscan
-Version:        0.9.5
+Version:        0.9.6
 Release:        0
 Summary:        An other URL extractor/viewer
 License:        GPL-2.0-or-later
@@ -25,16 +26,14 @@
 URL:            https://github.com/firecat53/urlscan
 Source0:        
https://github.com/firecat53/urlscan/archive/%{version}.tar.gz#/%{name}-%{version}.tar.gz
 Source1:        muttrc
-Requires:       python3
-Requires:       python3-base
-Requires:       python3-urwid
 BuildRequires:  python3-base
 BuildRequires:  python3-devel
 BuildRequires:  python3-rpm-macros
 BuildRequires:  python3-setuptools
-BuildRoot:      %{_tmppath}/%{name}-%{version}-build
+Requires:       python3
+Requires:       python3-base
+Requires:       python3-urwid
 BuildArch:      noarch
-%define python_flavor python3
 
 %description
 The urlscan utility displays URLs found in an email message with
@@ -50,18 +49,17 @@
 
 %install
 python3 setup.py install --prefix=%{_prefix} --root=%{buildroot}
-rm -rf %{buildroot}/usr/share/doc/%{name}*
+rm -rf %{buildroot}%{_datadir}/doc/%{name}*
 mkdir -p %{buildroot}%{_defaultdocdir}/%{name}
-install -m 0644 %{S:1} %{buildroot}%{_defaultdocdir}/%{name}
+install -m 0644 %{SOURCE1} %{buildroot}%{_defaultdocdir}/%{name}
 rm -rvf %{buildroot}%{python_sitelib}/%{name}-%{version}-*-info
 
 %files
-%defattr(-,root,root)
 %license COPYING
 %doc README.rst
 %{_bindir}/%{name}
 %{python_sitelib}/%{name}
-%{_mandir}/man1/%{name}.1.gz
+%{_mandir}/man1/%{name}.1%{?ext_man}
 %doc %{_defaultdocdir}/%{name}/muttrc
 
 %changelog

++++++ urlscan-0.9.5.tar.gz -> urlscan-0.9.6.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/urlscan-0.9.5/README.rst new/urlscan-0.9.6/README.rst
--- old/urlscan-0.9.5/README.rst        2020-07-09 19:25:47.000000000 +0200
+++ new/urlscan-0.9.6/README.rst        2021-03-23 06:00:17.000000000 +0100
@@ -19,7 +19,7 @@
 
 *NOTE* The last version that is Python 2 compatible is 0.9.3.
 
-Requires: Python 3.3+ and the python-urwid library
+Requires: Python 3.6+ and the python-urwid library
 
 Features
 --------
@@ -50,7 +50,7 @@
 
 - Use `l` to cycle through whether URLs are opened using the Python webbrowser
   module (default), xdg-open (if installed) or opened by a function passed on
-  the command line with `--run`.
+  the command line with `--run` or `--run-safe`.
 
 - Configure colors and keybindings via ~/.config/urlscan/config.json. Generate
   default config file for editing by running `urlscan -g`. Cycle through
@@ -64,6 +64,13 @@
 
 - Show complete help menu with `F1`. Hide header on startup with `--nohelp`.
 
+- Use a custom regular expression with `-E` for matching urls or any
+  other pattern. In junction with `-r`, this effectively turns urlscan
+  into a general purpose CLI selector-type utility.
+
+- Scan certain email headers for URLs. Currently `Link`, `Archived-At` and
+  `List-*` are scanned when `--headers` is passed.
+
 Installation and setup
 ----------------------
 
@@ -102,7 +109,7 @@
 
 ::
 
-    urlscan [-g, --genconf] [-n, --no-browser] [-c, --compact] [-d, --dedupe] 
[-r, --run <expression>] [-R, --reverse] [-s, --single] [-p, --pipe] [-w, 
--width] [-H, --nohelp] <file>
+    urlscan [-g, --genconf] [-n, --no-browser] [-c, --compact] [-d, --dedupe] 
[--headers] [-r, --run <expression>] [-f, --run-safe <expression>] [-R, 
--reverse] [-s, --single] [-p, --pipe] [-w, --width] [-H, --nohelp] [-E, 
--regex <expression>] <file>
 
 Urlscan can extract URLs and email addresses from emails or any text file.
 Calling with no flags will start the curses browser. Calling with '-n' will 
just
@@ -113,11 +120,11 @@
 urlscan` or `urlscan < <something>`
 
 Instead of opening a web browser, the selected URL can be passed as the 
argument
-to a command using `--run "<command> {}"`. Note the use of `{}` in the command
-string to denote the selected URL. Alternatively, the URL can be piped to the
-command using `--run <command> --pipe`. Using --run with --pipe is preferred if
-the command supports it, as it is marginally more secure and tolerant of 
special
-characters in the URL.
+to a command using `--run-safe "<command> {}"` or `--run "<command> {}"`. Note
+the use of `{}` in the command string to denote the selected URL. 
Alternatively,
+the URL can be piped to the command using `--run-safe <command> --pipe` (or
+`--run`). Using --run-safe with --pipe is preferred if the command supports it,
+as it is marginally more secure and tolerant of special characters in the URL.
 
 Theming
 -------
@@ -148,7 +155,7 @@
 - `context` -- show/hide context (default: `c`)
 - `down` -- cursor down (default: `j`)
 - `help_menu` -- show/hide help menu (default: `F1`)
-- `link_handler` -- cycle link handling (webbrowser, xdg-open or --run) 
(default: `l`)
+- `link_handler` -- cycle link handling (webbrowser, xdg-open, --run-safe or 
--run) (default: `l`)
 - `open_url` -- open selected URL (default: `space` or `enter`)
 - `palette` -- cycle through palettes (default: `p`)
 - `quit` -- quit (default: `q` or `Q`)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/urlscan-0.9.5/bin/urlscan 
new/urlscan-0.9.6/bin/urlscan
--- old/urlscan-0.9.5/bin/urlscan       2020-07-09 19:25:47.000000000 +0200
+++ new/urlscan-0.9.6/bin/urlscan       2021-03-23 06:00:17.000000000 +0100
@@ -1,11 +1,11 @@
 #!/usr/bin/env python3
 """ A simple urlview replacement that handles things like quoted-printable
-properly.  aka "urlview minus teh suck"
+properly.
 
 """
 #
 #   Copyright (C) 2006-2007 Daniel Burrows
-#   Copyright (C) 2019 Scott Hansen
+#   Copyright (C) 2021 Scott Hansen
 #
 # This program is free software; you can redistribute it and/or
 # modify it under the terms of the GNU General Public License
@@ -21,17 +21,13 @@
 # along with this program; if not, write to the Free Software
 # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA
 
-from __future__ import unicode_literals
 import argparse
 import io
-import locale
 import os
 import sys
+from email import policy
+from email.parser import BytesParser
 from urlscan import urlchoose, urlscan
-try:
-    from email.Parser import Parser as parser
-except ImportError:
-    from email.parser import Parser as parser
 
 
 def parse_arguments():
@@ -56,14 +52,25 @@
     arg_parse.add_argument('--dedupe', '-d', dest="dedupe",
                            action='store_true', default=False,
                            help="Remove duplicate URLs from list")
+    arg_parse.add_argument('--regex', '-E',
+                           help="Alternate custom regex to be used for all "
+                           "kinds of matching. "
+                           "For example: --regex 'https?://.+\.\w+'")
     arg_parse.add_argument('--run', '-r',
                            help="Alternate command to run on selected URL "
                            "instead of opening URL in browser. Use {} to "
                            "represent the URL value in the expression. "
                            "For example: --run 'echo {} | xclip -i'")
+    arg_parse.add_argument('--run-safe', '-f', dest="runsafe",
+                           help="Alternate command to run on selected URL "
+                           "instead of opening URL in browser. Use {} to "
+                           "represent the URL value in the expression. Safest "
+                           "run option but uses `shell=False` which does not "
+                           "allow use of shell features like | or >. Can use "
+                           "with --pipe.")
     arg_parse.add_argument('--pipe', '-p', dest='pipe',
                            action='store_true', default=False,
-                           help='Pipe URL into the command specified by --run')
+                           help="Pipe URL into the command specified by --run 
or --run-safe")
     arg_parse.add_argument('--nohelp', '-H', dest='nohelp',
                            action='store_true', default=False,
                            help='Hide help menu by default')
@@ -73,6 +80,9 @@
     arg_parse.add_argument('--width', '-w', dest='width',
                            type=int, default=0,
                            help='Set width to display')
+    arg_parse.add_argument('--headers', dest='headers',
+                           action='store_true', default=False,
+                           help='Scan certain message headers for URLs.')
     arg_parse.add_argument('message', nargs='?', default=sys.stdin,
                            help="Filename of the message to parse")
     return arg_parse.parse_args()
@@ -98,16 +108,9 @@
     file encoding differences.
 
         Args: fname - filename or sys.stdin
-        Returns: mesg - parsed (email parser) text of the message with the
-            correct encoding set
+        Returns: mesg - EmailMessage object
 
     """
-    enc_list = ['UTF-8', 'LATIN-1', 'iso8859-1', 'iso8859-2',
-                'UTF-16', 'CP720', 'CP437']
-    locale.setlocale(locale.LC_ALL, '')
-    code = locale.getpreferredencoding()
-    if code not in enc_list:
-        enc_list.insert(0, code)
     if fname is sys.stdin:
         try:
             stdin_file = fname.buffer.read()
@@ -115,34 +118,23 @@
             stdin_file = fname.read()
     else:
         stdin_file = None
-    for enc in enc_list:
-        try:
-            if stdin_file is not None:
-                fobj = io.StringIO(stdin_file.decode(enc))
-            else:
-                fobj = io.open(fname, mode='r', encoding=(enc))
-            f_keep = fobj
-            mesg = parser().parse(fobj)
-            if 'From' not in mesg.keys() and 'Date' not in mesg.keys():
-                # If it's not an email message, don't let the email parser
-                # delete the first line. If it is, let the parser do its job so
-                # we don't get mailto: links for all the To and From addresses
-                fobj = _fix_first_line(f_keep)
-                mesg = parser().parse(fobj)
-
-        except (UnicodeDecodeError, UnicodeError):
-            continue
-        else:
-            break
-        finally:
-            try:
-                fobj.close()
-            except NameError:
-                pass
-        raise Exception("Encoding not detected. Please pass encoding value 
manually")
+    if stdin_file is not None:
+        fobj = io.BytesIO(stdin_file)
+    else:
+        fobj = io.open(fname, mode='rb')
+    f_keep = fobj
+    mesg = BytesParser(policy=policy.default.clone(utf8=True)).parse(fobj)
+    if 'From' not in mesg.keys() and 'Date' not in mesg.keys():
+        # If it's not an email message, don't let the email parser
+        # delete the first line. If it is, let the parser do its job so
+        # we don't get mailto: links for all the To and From addresses
+        fobj = _fix_first_line(f_keep)
+        mesg = BytesParser(policy=policy.default.clone(utf8=True)).parse(fobj)
+    try:
+        fobj.close()
+    except NameError:
+        pass
     close_stdin()
-    # Handle multiple nested message parts
-    _msg_set_charset(mesg, enc)
     return mesg
 
 
@@ -151,37 +143,15 @@
     the URLs on that line will not be parsed by email.Parser. Add a blank line
     at the top of the file to ensure everything is read in a non-email file.
 
-      1. Take the file object 'f'.
-      2. Create a new StringIO object that starts with a blank line and read 
the
-      file into that. Return as open StringIO object 'f'
-      3. Return 'f'
-
     """
     fline.seek(0)
-    new = io.StringIO()
-    new.write("\n{}".format(fline.read()))
+    new = io.BytesIO()
+    new.write(b"\n" + fline.read())
     fline.close()
     new.seek(0)
     return new
 
 
-def _msg_set_charset(mesg, encoding):
-    """Recursive function to set the charset of nested message parts.
-
-    """
-    encoding = mesg.get_content_charset() or encoding
-    try:
-        mesg.set_charset(encoding)
-    except (AttributeError, TypeError):
-        for part in mesg.get_payload():
-            try:
-                # Try once to set correct encoding on the message part, then
-                # continue without crashing if it fails
-                _msg_set_charset(part, encoding)
-            except UnicodeEncodeError:
-                continue
-
-
 def main():
     """Entrypoint function for urlscan
 
@@ -192,18 +162,19 @@
         return
     msg = process_input(args.message)
     if args.nobrowser is False:
-        tui = urlchoose.URLChooser(urlscan.msgurls(msg),
+        tui = urlchoose.URLChooser(urlscan.msgurls(msg, regex=args.regex, 
headers=args.headers),
                                    compact=args.compact,
                                    reverse=args.reverse,
                                    nohelp=args.nohelp,
                                    dedupe=args.dedupe,
                                    run=args.run,
+                                   runsafe=args.runsafe,
                                    single=args.single,
                                    width=args.width,
                                    pipe=args.pipe)
         tui.main()
     else:
-        out = urlchoose.URLChooser(urlscan.msgurls(msg),
+        out = urlchoose.URLChooser(urlscan.msgurls(msg, regex=args.regex, 
headers=args.headers),
                                    dedupe=args.dedupe,
                                    reverse=args.reverse,
                                    shorten=False)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/urlscan-0.9.5/setup.py new/urlscan-0.9.6/setup.py
--- old/urlscan-0.9.5/setup.py  2020-07-09 19:25:47.000000000 +0200
+++ new/urlscan-0.9.6/setup.py  2021-03-23 06:00:17.000000000 +0100
@@ -3,17 +3,30 @@
 from setuptools import setup
 
 setup(name="urlscan",
-      version="0.9.5",
+      version="0.9.6",
       description="View/select the URLs in an email message or file",
       author="Scott Hansen",
       author_email="[email protected]",
       url="https://github.com/firecat53/urlscan";,
-      download_url="https://github.com/firecat53/urlscan/archive/0.9.5.zip";,
+      download_url="https://github.com/firecat53/urlscan/archive/0.9.6.zip";,
       packages=['urlscan'],
       scripts=['bin/urlscan'],
       package_data={'urlscan': ['assets/*']},
       data_files=[('share/doc/urlscan', ['README.rst', 'COPYING']),
                   ('share/man/man1', ['urlscan.1'])],
       license="GPLv2",
-      install_requires=["urwid>=1.2.1"]
+      install_requires=["urwid>=1.2.1"],
+      classifiers=[
+          'Development Status :: 4 - Beta',
+          'Environment :: Console',
+          'Environment :: Console :: Curses',
+          'License :: OSI Approved :: GNU General Public License v2 (GPLv2)',
+          'Operating System :: OS Independent',
+          'Programming Language :: Python',
+          'Programming Language :: Python :: 3.6',
+          'Programming Language :: Python :: 3.7',
+          'Programming Language :: Python :: 3.8',
+          'Programming Language :: Python :: 3.9',
+          'Topic :: Utilities'],
+      keywords=("urlscan urlview email mutt tmux"),
       )
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/urlscan-0.9.5/urlscan/urlchoose.py 
new/urlscan-0.9.6/urlscan/urlchoose.py
--- old/urlscan-0.9.5/urlscan/urlchoose.py      2020-07-09 19:25:47.000000000 
+0200
+++ new/urlscan-0.9.6/urlscan/urlchoose.py      2021-03-23 06:00:17.000000000 
+0100
@@ -1,5 +1,5 @@
 #   Copyright (C) 2006-2007 Daniel Burrows
-#   Copyright (C) 2020 Scott Hansen
+#   Copyright (C) 2021 Scott Hansen
 #
 # This program is free software; you can redistribute it and/or
 # modify it under the terms of the GNU General Public License
@@ -92,7 +92,8 @@
 class URLChooser:
 
     def __init__(self, extractedurls, compact=False, reverse=False, 
nohelp=False, dedupe=False,
-                 shorten=True, run="", single=False, pipe=False, 
genconf=False, width=0):
+                 shorten=True, run="", runsafe="", single=False, pipe=False,
+                 genconf=False, width=0):
         self.conf = expanduser("~/.config/urlscan/config.json")
         self.keys = {'/': self._search_key,
                      '0': self._digits,
@@ -178,6 +179,7 @@
         self.shorten = shorten
         self.compact = compact
         self.run = run
+        self.runsafe = runsafe
         self.single = single
         self.pipe = pipe
         self.search = False
@@ -208,7 +210,9 @@
                        "/ - search :: "
                        "URL opening mode - {}")
         self.link_open_modes = ["Web Browser", "Xdg-Open"] if self.xdg is True 
else ["Web Browser"]
-        if self.run:
+        if self.runsafe:
+            self.link_open_modes.insert(0, self.runsafe)
+        elif self.run:
             self.link_open_modes.insert(0, self.run)
         self.nohelp = nohelp
         if nohelp is False:
@@ -323,8 +327,8 @@
 
     def _open_url(self):
         """<Enter> or <space>"""
-        load_text = "Loading URL..." if self.link_open_modes[0] != self.run \
-            else "Executing: {}".format(self.run)
+        load_text = "Loading URL..." if self.link_open_modes[0] != (self.run 
or self.runsafe) \
+            else "Executing: {}".format(self.run or self.runsafe)
         if os.environ.get('BROWSER') not in ['elinks', 'links', 'w3m', 'lynx']:
             self._footer_display(load_text, 5)
 
@@ -462,7 +466,7 @@
     def _reverse(self):
         """ R """
         # Reverse items
-        fpo = self.top.body.focus_position
+        fpo = self.top.base_widget.body.focus_position
         if self.compact is True:
             self.items.reverse()
         else:
@@ -475,8 +479,8 @@
                 else:
                     rev.insert(2, item)
             self.items = rev
-        self.top.body = urwid.ListBox(self.items)
-        self.top.body.focus_position = self._cur_focus(fpo)
+        self.top.base_widget.body = urwid.ListBox(self.items)
+        self.top.base_widget.body.focus_position = self._cur_focus(fpo)
 
     def _context(self):
         """ c """
@@ -505,7 +509,7 @@
         cmds = COPY_COMMANDS_PRIMARY if pri else COPY_COMMANDS
         for cmd in cmds:
             try:
-                proc = Popen(shlex.split(cmd), stdin=PIPE)
+                proc = Popen(shlex.split(cmd), stdin=PIPE, stdout=DEVNULL, 
stderr=DEVNULL)
                 proc.communicate(input=url.encode(sys.getdefaultencoding()))
                 self._footer_display("Copied url to {} selection".format(
                     "primary" if pri is True else "clipboard"), 5)
@@ -635,7 +639,7 @@
 
     def _link_handler(self):
         """Function to cycle through opening links via webbrowser module,
-        xdg-open or custom expression passed with --run.
+        xdg-open or custom expression passed with --run-safe or --run.
 
         """
         mode = self.link_open_modes.pop()
@@ -659,10 +663,17 @@
                     self.search = False
                     self.enter = False
             elif self.link_open_modes[0] == "Web Browser":
-                webbrowser.open(url)
+                webbrowser.open(url.replace('&', '\&'))
             elif self.link_open_modes[0] == "Xdg-Open":
                 run = 'xdg-open "{}"'.format(url)
                 process = Popen(shlex.split(run), stdout=PIPE, stdin=PIPE)
+            elif self.link_open_modes[0] == self.runsafe:
+                if self.pipe:
+                    process = Popen(shlex.split(self.runsafe), stdout=PIPE, 
stdin=PIPE)
+                    
process.communicate(input=url.encode(sys.getdefaultencoding()))
+                else:
+                    cmd = [i.format(url) for i in shlex.split(self.runsafe)]
+                    Popen(cmd).communicate()
             elif self.link_open_modes[0] == self.run and self.pipe:
                 process = Popen(shlex.split(self.run), stdout=PIPE, stdin=PIPE)
                 process.communicate(input=url.encode(sys.getdefaultencoding()))
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/urlscan-0.9.5/urlscan/urlscan.py 
new/urlscan-0.9.6/urlscan/urlscan.py
--- old/urlscan-0.9.5/urlscan/urlscan.py        2020-07-09 19:25:47.000000000 
+0200
+++ new/urlscan-0.9.6/urlscan/urlscan.py        2021-03-23 06:00:17.000000000 
+0100
@@ -1,6 +1,6 @@
 # -*- coding: utf-8 -*-
 #   Copyright (C) 2006-2007 Daniel Burrows
-#   Copyright (C) 2020 Scott Hansen
+#   Copyright (C) 2021 Scott Hansen
 #
 # This program is free software; you can redistribute it and/or
 # modify it under the terms of the GNU General Public License
@@ -18,18 +18,10 @@
 
 """Contains the backend logic that scans messages for URLs and context."""
 
+from html.parser import HTMLParser
+import locale
 import os
 import re
-from html.parser import HTMLParser
-
-
-def get_charset(message, default="utf-8"):
-    """Get the message charset"""
-    if message.get_content_charset():
-        return message.get_content_charset()
-    if message.get_charset():
-        return message.get_charset()
-    return default
 
 
 class Chunk:
@@ -255,7 +247,7 @@
 
 
 URLINTERNALPATTERN = r'[{}()@\w/\\\-%?!&.=:;+,#~]'
-URLTRAILINGPATTERN = r'[{}(@\w/\-%&=+#]'
+URLTRAILINGPATTERN = r'[{}(@\w/\-%&=+#$]'
 HTTPURLPATTERN = (r'(?:(https?|file|ftps?)://' + URLINTERNALPATTERN +
                   r'*' + URLTRAILINGPATTERN + r')')
 # Used to guess that blah.blah.blah.TLD is a URL.
@@ -302,7 +294,7 @@
 assert not URLRE.match('blah.baz.obviouslynotarealdomain')
 
 
-def parse_text_urls(mesg):
+def parse_text_urls(mesg, regex=None):
     """Parse a block of text, splitting it into its url and non-url
     components."""
 
@@ -310,16 +302,24 @@
 
     loc = 0
 
+    global URLRE
+
+    if regex:
+        URLRE = re.compile(regex)
+
     for match in URLRE.finditer(mesg):
         if loc < match.start():
             rval.append(Chunk(mesg[loc:match.start()], None))
         # Turn email addresses into mailto: links
-        email = match.group("email")
-        if email and "mailto" not in email:
-            mailto = "mailto:{}".format(email)
+        if regex:
+            rval.append(Chunk(None, match.group(0)))
         else:
-            mailto = match.group(1)
-        rval.append(Chunk(None, mailto))
+            email = match.group("email")
+            if email and "mailto" not in email:
+                mailto = "mailto:{}".format(email)
+            else:
+                mailto = match.group(1)
+            rval.append(Chunk(None, mailto))
         loc = match.end()
 
     if loc < len(mesg):
@@ -393,7 +393,7 @@
 NLRE = re.compile('\r\n|\n|\r')
 
 
-def extracturls(mesg):
+def extracturls(mesg, regex=None):
     """Given a text message, extract all the URLs found in the message, along
     with their surrounding context.  The output is a list of sequences of Chunk
     objects, corresponding to the contextual regions extracted from the string.
@@ -412,7 +412,7 @@
     # lines with more than one entry or one entry that's
     # a URL are the only lines containing URLs.
 
-    linechunks = [parse_text_urls(l) for l in lines]
+    linechunks = [parse_text_urls(l, regex=regex) for l in lines]
 
     return extract_with_context(linechunks,
                                 lambda chunk: len(chunk) > 1 or
@@ -439,41 +439,68 @@
     return extract_with_context(chunk.rval, somechunkisurl, 1, 1)
 
 
-def decode_bytes(byt, enc='utf-8'):
-    """Given a string or bytes input, return a string.
+def msgheaders(msg):
+    """ Process email message headers for URLs
 
-        Args: bytes - bytes or string
-              enc - encoding to use for decoding the byte string.
+    Args: msg - email message object
+    Returns: list
 
     """
-    try:
-        strg = byt.decode(enc)
-    except UnicodeDecodeError as err:
-        strg = "Unable to decode message:\n{}\n{}".format(str(byt), err)
-    except (AttributeError, UnicodeEncodeError):
-        # If byt is already a string, just return it
-        return byt
-    return strg
+    headers = ('Archived-At',
+               'Link',
+               'List-Archive',
+               'List-ID',
+               'List-Help',
+               'List-Owner',
+               'List-Post',
+               'List-Subscribe',
+               'List-Unsubscribe',
+               'List-Unsubscribe-Post')
+    res = []
+    for hdr in headers:
+        hdri = msg.get(hdr)
+        if hdri:
+            res.append(hdri)
+    return res
+
+
+def set_charset(message):
+    """Get and/or set the message or message part charset. Try the
+    content-charset or charset if it exists, or attempt to decode the message
+    with a variety of charsets to find the correct one.
 
+        Args: message - EmailMessage object
+        Returns: message - EmailMessage object
 
-def decode_msg(msg, enc='utf-8'):
     """
-    Decodes a message fragment.
-
-    Args: msg - A Message object representing the fragment
-          enc - The encoding to use for decoding the message
-    """
-    # We avoid the get_payload decoding machinery for raw
-    # content-transfer-encodings potentially containing non-ascii characters,
-    # such as 8bit or binary, as these are encoded using raw-unicode-escape 
which
-    # seems to prevent subsequent utf-8 decoding.
-    cte = str(msg.get('content-transfer-encoding', '')).lower()
-    decode = cte not in ("8bit", "7bit", "binary")
-    res = msg.get_payload(decode=decode)
-    return decode_bytes(res, enc)
+    if message.get_content_charset():
+        return message
+    if message.get_charset():
+        return message
+    enc_list = ['UTF-8', 'LATIN-1', 'iso8859-1', 'iso8859-2',
+                'UTF-16', 'CP1252', 'CP720', 'CP437']
+    locale.setlocale(locale.LC_ALL, '')
+    code = locale.getpreferredencoding()
+    if code not in enc_list:
+        enc_list.insert(0, code)
+    for enc in enc_list:
+        try:
+            message.as_bytes().decode(enc)
+        except (UnicodeDecodeError, UnicodeError):
+            continue
+        else:
+            try:
+                message.set_param('charset', enc)
+            except (KeyError, UnicodeEncodeError):
+                # Try once to set correct encoding on the message part, then
+                # continue without crashing if it fails
+                continue
+            break
+        raise Exception("Encoding not detected.")
+    return message
 
 
-def msgurls(msg, urlidx=1):
+def msgurls(msg, urlidx=1, regex=None, headers=False):
     """Main entry function for urlscan.py
 
     """
@@ -481,19 +508,22 @@
     # one subpart in the future (e.g., for
     # multipart/alternative).  Actually, I might even add
     # a browser for the message structure?
-    enc = get_charset(msg)
+    if headers is True:
+        for part in msgheaders(set_charset(msg)):
+            for chunk in extracturls(part):
+                urlidx += 1
+                yield chunk
+    msg = set_charset(msg)
     if msg.is_multipart():
-        for part in msg.get_payload():
-            for chunk in msgurls(part, urlidx):
+        for part in msg.iter_parts():
+            for chunk in msgurls(set_charset(part), urlidx, regex=regex):
                 urlidx += 1
                 yield chunk
     elif msg.get_content_type() == "text/plain":
-        decoded = decode_msg(msg, enc)
-        for chunk in extracturls(decoded):
+        for chunk in extracturls(msg.get_content(), regex=regex):
             urlidx += 1
             yield chunk
     elif msg.get_content_type() == "text/html":
-        decoded = decode_msg(msg, enc)
-        for chunk in extracthtmlurls(decoded):
+        for chunk in extracthtmlurls(msg.get_content()):
             urlidx += 1
             yield chunk
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/urlscan-0.9.5/urlscan.1 new/urlscan-0.9.6/urlscan.1
--- old/urlscan-0.9.5/urlscan.1 2020-07-09 19:25:47.000000000 +0200
+++ new/urlscan-0.9.6/urlscan.1 2021-03-23 06:00:17.000000000 +0100
@@ -1,6 +1,6 @@
 .\"                                      Hey, EMACS: -*- nroff -*-
 
-.TH URLSCAN 1 "15 May 2020"
+.TH URLSCAN 1 "6 March 2021"
 
 .SH NAME
 urlscan \- browse the URLs in an email message from a terminal
@@ -14,8 +14,8 @@
 .SH DESCRIPTION
 \fBurlscan\fR accepts a single email message on standard
 input, then displays a terminal-based list of the URLs in the given
-message.  Selecting a URL uses the Python webbrowser module to 
-determine which browser to open. The \fBBROWSER\fR environment 
+message.  Selecting a URL uses the Python webbrowser module to
+determine which browser to open. The \fBBROWSER\fR environment
 variable will be used if it is set.
 
 \fBurlscan\fR is primarily intended to be used with the
@@ -53,15 +53,19 @@
 \fB7.\fR \fBu\fR will unescape the highlighted URL if necessary.
 
 \fB8.\fR Run a command with the selected URL as the argument or pipe the
-selected URL to a command using the \fB--run\fR and \fB--pipe\fR arguments.
+selected URL to a command using the \fB--run-safe\fR, \fB--run\fR and
+\fB--pipe\fR arguments.
 
 \fB9.\fR Use \fBl\fR to cycle through whether URLs are opened using the Python
 webbrowser module (default), xdg-open (if installed) or a function passed on 
the
-command line with \fB--run\fR. The \fB--run\fR function will respect the value
-of \fB--pipe\fR.
+command line with \fB--run-safe\fR or \fB--run\fR. The \fB--run\fR and
+\fB--run-safe\fR functions will respect the value of \fB--pipe\fR.
 
 \fB10.\fR \fBF1\fR shows the help menu.
 
+\fB11.\fR Scan certain email headers for URLs. Currently \fBLink\fR,
+\fBArchived-At\fR and \fBList-*\fR are scanned when \fB--headers\fR is passed.
+
 .SH OPTIONS
 .TP
 .B \-g, \-\-genconf
@@ -81,19 +85,25 @@
 Disables the selection interface and print the links to standard output.
 Useful for scripting (implies \fB\-\-compact\fR).
 .TP
-.B \-r, \-\-run \<expression\>
+.B \-f, \-\-run\-safe \<expression\>
 Execute \<expression\> in place of opening URL with a browser. Use {} in
 \<expression\> to substitute in the URL. Examples:
 
+    $ urlscan --run-safe 'tmux set buffer {}'
+.TP
+.B \-r, \-\-run \<expression\>
+Execute \<expression\> in place of opening URL with a browser. Use {} in
+\<expression\> to substitute in the URL. Shell features such as \| and \> can 
be
+used, but it is less secure. Examples:
+
     $ urlscan --run 'echo {} | xclip -i' file.txt
-    $ urlscan --run 'tmux set buffer {}'
 .TP
 .B \-p, \-\-pipe
-Pipe the selected URL to the command specified by `--run`. This is preferred
-when the command supports it, as it is more secure and tolerant of special
-characters in the URL. Example:
+Pipe the selected URL to the command specified by `--run-safe` or `--run`. This
+is preferred when the command supports it, as it is more secure and tolerant of
+special characters in the URL. Example:
 
-    $ urlscan --run 'xclip -i' --pipe file.txt
+    $ urlscan --run-safe 'xclip -i' --pipe file.txt
 .TP
 .B \-R, \-\-reverse
 Reverse displayed order of URLs.
@@ -105,6 +115,18 @@
 .TP
 .B \-w, \-\-width
 Set display width.
+.TP
+.B \-E, \-\-regex \<expression\>
+Use \<expression\> in place of the default set of regular expressions,
+to be used for any kind of matching. This is useful for example when
+selectively avoiding 'mailto:' links or any other pattern that urlscan
+could interpret as urls (such as '<filename>.<extension>'). Usage
+example:
+
+    $ urlscan --regex 'https?://.+\.\w+' file.txt
+.TP
+.B \-\-headers
+Scan email headers for URLs.
 
 .SH MUTT INTEGRATION

commit urlscan for openSUSE:Factory

Reply via email to