Package: dodgy Version: 0.1.9-3 Severity: important File: /usr/lib/python3/dist-packages/dodgy/run.py
dodgy basically does this: • recursively find all regular files under ./ • for each file, • if its MIME type appears to be text/*, • assume it is UTF-8 • assume it is uncompressed • search it for "bad" regular expressions, and report any matches This misdetects compressed text files: bash4$ with-temp-dir with-temp-dir: entering directory `/tmp/with-temp-dir.stBmJY' This directory will be deleted when you exit. bash4$ wget -nv https://secure.eicar.org/eicar.com.txt 2019-01-09 13:54:31 URL:https://secure.eicar.org/eicar.com.txt [68/68] -> "eicar.com.txt" [1] bash4$ gzip eicar.com.txt bash4$ file --mime eicar.com.txt.gz eicar.com.txt.gz: application/gzip; charset=binary bash4$ python3 -m mimetypes eicar.com.txt.gz type: text/plain encoding: gzip bash4$ dodgy Traceback (most recent call last): File "/usr/bin/dodgy", line 4, in <module> dodgy.run.run() File "/usr/lib/python3/dist-packages/dodgy/run.py", line 56, in run warnings = run_checks(os.getcwd()) File "/usr/lib/python3/dist-packages/dodgy/run.py", line 44, in run_checks for msg_parts in check_file(filepath): File "/usr/lib/python3/dist-packages/dodgy/checks.py", line 72, in check_file return check_file_contents(to_check.read()) File "/usr/lib/python3.7/codecs.py", line 701, in read return self.reader.read(size) File "/usr/lib/python3.7/codecs.py", line 504, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte This happens because dodgy expects compressed files to be application/*, but mimetypes reports them as text/*. To fix this, dodgy can be extended to check the encoding property and then use different open/gzopen/lzmaopen calls depending on what kind of compression is used. I think dodgy's assumption of UTF-8 will also produce crashes and false negatives if UTF-16 is used for Windows/Java compatibility reasons. However I was not able to produce a trivial example. You may also want to switch from mimetypes to python3-magic, which uses libmagic1 (i.e. file(1)). This will make guess the MIME type based on the file's *CONTENTS*, rather than just the file's *NAME*. -- System Information: Debian Release: buster/sid APT prefers testing APT policy: (990, 'testing'), (500, 'unstable'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 4.19.0-1-amd64 (SMP w/2 CPU cores) Locale: LANG=en_AU.utf8, LC_CTYPE=en_AU.utf8 (charmap=UTF-8), LANGUAGE=en_AU:en (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages dodgy depends on: ii python3 3.7.1-3 dodgy recommends no packages. dodgy suggests no packages. -- no debconf information