Hi all, Parsing is hard. So, let's see what happens if we try parsing mail. All the code discussed is on GitHub , and links are provided through-out.
Firstly, let's look at how we currently parse individual mails. parsemail.py does the hard yards here. However, it doesn't start with modern Django and python 3, so let's fix that first . parsemail.py can take input in either a file name or in stdin. parsemail.sh then invokes parsemail.py. However, it's designed for python2 and production settings. I'd like to: - be able to test with either python version using $PW_PYTHON. - Only set the settings module if it isn't already set the environment. (In docker this is preset to the dev settings.) - pass arguments to parsemail.py, like --list-id This is all done in . Finally, parsemail-batch.sh can be used to invoke parsemail.sh in batch mode. Importantly, it feeds input to parsemail.sh through stdin, not filenames. It needs a small fixup to pass arguments through to parsemail.sh . OK, so now we're in the position to start testing stuff. My test corpus is: - The email that Thomas found caused a bug. - 310 messages from the patchwork list I've stuck this on github too, in a corpus directory . So, let's see how this parses. Firstly, let's try parsemail-batch.sh. This feeds mail in through stdin. I reset the database each time. So: PW_PYTHON=python2 patchwork/bin/parsemail-batch.sh corpus/ --list-id=patchwork.ozlabs.org PW_PYTHON=python3 patchwork/bin/parsemail-batch.sh corpus/ --list-id=patchwork.ozlabs.org With both python2 and python2, we get only one failure: mail-not-parsed.mbox Traceback (most recent call last): File "/home/patchwork/patchwork/patchwork/bin/parsemail.py", line 115, in <module> sys.exit(main(sys.argv)) File "/home/patchwork/patchwork/patchwork/bin/parsemail.py", line 103, in main result = parse_mail(mail, args['list_id']) File "/home/patchwork/patchwork/patchwork/parser.py", line 627, in parse_mail headers = find_headers(mail) File "/home/patchwork/patchwork/patchwork/parser.py", line 161, in find_headers for (k, v) in list(mail.items())]) File "/home/patchwork/patchwork/patchwork/parser.py", line 161, in <listcomp> for (k, v) in list(mail.items())]) File "/usr/lib/python3.5/email/header.py", line 217, in __init__ self.append(s, charset, errors) File "/usr/lib/python3.5/email/header.py", line 295, in append s = s.decode(input_charset, errors) AttributeError: 'Header' object has no attribute 'decode' Let's fix that with , which is based on what Stephen and I posted to the list, and many, many hours of [wat]. This stuff is _hard_. So, now lets add a test to stop that happening again. That's . (We also add a related test for when a valid, but unescaped, UTF-8 char is in a header.) Much to our dismay, that test errors out on Python3. Wat? Traceback (most recent call last): File "/home/patchwork/patchwork/patchwork/tests/test_parser.py", line 541, in test_invalid_header_char '0012-invalid-header-char.mbox') File "/home/patchwork/patchwork/patchwork/tests/test_parser.py", line 88, in _find_content mail = read_mail(mbox_filename, project=self.project) File "/home/patchwork/patchwork/patchwork/tests/test_parser.py", line 54, in read_mail mail = message_from_file(open(file_path)) File "/usr/lib/python3.5/email/__init__.py", line 54, in message_from_file return Parser(*args, **kws).parse(fp) File "/usr/lib/python3.5/email/parser.py", line 54, in parse data = fp.read(8192) File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0x9c in position 1019: ordinal not in range(128) Here we're getting skewered by the way we're reading the file. So, lets do things the way they're supposed to be done in Python3 and use message_from_binary_file. This now works: . Tox verifies this works with py27 as well. So, now we can parse mail with parsemail-batch. Remember though, this is not the only way we can ingest mail. parsemail.sh can take input *either* on stdin, *or* as a file name argument. parsemail-batch provides it through stdin. What happens if we try to parse our corpus mail by mail using the file name argument? for x in corpus/*; do echo $x PW_PYTHON=python2 patchwork/bin/parsemail.sh --list-id=patchwork.ozlabs.org $x done Now, we succeed in Python 2, but in Python 3, we get errors like this: corpus/1470109248.8244_5.possimpible,U=6567:2,RS Traceback (most recent call last): File "/home/patchwork/patchwork/patchwork/bin/parsemail.py", line 115, in <module> sys.exit(main(sys.argv)) File "/home/patchwork/patchwork/patchwork/bin/parsemail.py", line 101, in main mail = message_from_file(args['infile']) File "/usr/lib/python3.5/email/__init__.py", line 54, in message_from_file return Parser(*args, **kws).parse(fp) File "/usr/lib/python3.5/email/parser.py", line 54, in parse data = fp.read(8192) File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2036: ordinal not in range(128) So, why does it work on stdin, but not in a file open? The python documentation suggests that stdin is automatically decoded: , and this can be controlled by PYTHONIOENCODING environment variable: . The default encoding is based on the locale. In the docker container, the IO encoding defaults to US-ASCII (8-bit ascii rather than 7-bit). Sure enough, if we force it to 7-bit ASCII with PYTHONIOENCODING=ascii:strict, things break again: for x in corpus/*; do echo $x; PW_PYTHON=python3 PYTHONIOENCODING=ascii:strict patchwork/bin/parsemail.sh --list-id=patchwork.ozlabs.org < $x; done corpus/1470109248.8244_5.possimpible,U=6567:2,RS ... UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2036: ordinal not in range(128) So, there are now 2 issues on Python 3: - file open doesn't work - stdin behaviour is dependent on your locale. So, let's use the binary file mode of stdin, and binary files on open in Py3. This is . Yay, so now parsing of mail works, for the old-style parse setup, for the samples we know about. And it only took us 8 patches! Now we have Stephen's series to contend with. I've only been looking at parsemail, so I've only pulled that in. After fixing up some merge conflicts, it's . This has of course undone  - the use of message_from_binary_file, and obsoleted  - which just fixed an error in the old parsemail.py. This means we get some failures on py3. They're fixed in  in much the same way. So now we can use Stephen's new managment command to import mail in Py2 and Py3 with both stdin and filenames, and it passes the corpus. However, in so doing we've broken the tests in 2 inter-related ways. Firstly, the stdin tests pass files in instead of file names. I just broke that by changing the behaviour of the argument, so that's not surprising. However, that revealed that the tests marked stdin don't reopen stdin, so they don't actually test stdin behaviour. Fix that up . Finally, lets add tests for of a UTF-8 issue when invoked through the management command: . We reuse the UTF-8 email we created for . (Ideally we'd like a patch with UTF-8 in the body of the mail, but I can't find one yet.) This tests that the managment command deals correctly with files and stdin containing non-ASCII content. This also gives us a great scaffold to test other interesting email that we come across. So now things work in the 2x2 matrix of [py2, py3] x [stdin, filename]. It only took 12 patches! Yay! I've cleaned up the patches into a new series: , which weighs in at a much more manageable 6 patches. I'll submit that soon. Given that it contains mbox files, bits of it will get lost on the list, so I'll probably do a git pull request with a signed tag as well. Regards, Daniel : https://github.com/daxtens/patchwork/commits/parse-demo : https://github.com/daxtens/patchwork/commit/daa09b806427d7d207e3da503cc59ad8655041eb : https://github.com/daxtens/patchwork/commit/018d3f92c03ed54023a90622c1be3a1f52e39e2f : https://github.com/daxtens/patchwork/commit/8863a2d7ad9aa771df45650dc523aa5087c8fe56 : https://github.com/daxtens/patchwork/commit/5e9271231d7b01cf0f9a97028a2630a3a148303e : https://github.com/daxtens/patchwork/commit/88af79565e9bc85f587c076ed6eddb692f48b02e : https://github.com/daxtens/patchwork/commit/3a3886f083398f904ae7bf318fa0c910c900bde5 [wat]: https://www.destroyallsoftware.com/talks/wat : https://github.com/daxtens/patchwork/commit/09d9c91e3acfb1151b89f6f380f30248e2adcb62 : https://docs.python.org/3/library/sys.html#sys.stdin : https://docs.python.org/3/using/cmdline.html#envvar-PYTHONIOENCODING : https://github.com/daxtens/patchwork/commit/a791cf66ccad4017a8c4ba5cb10bea01faf40a5f : https://github.com/daxtens/patchwork/commit/b9e7dc75bb3008ed4eb8ef14271d5cd175f1b575 : https://github.com/daxtens/patchwork/commit/3614100bd1b03f287262f386fe0d193b22f83381 : https://github.com/daxtens/patchwork/commit/01a21af2ed500f6efc753ef347f0196a13882041 : https://github.com/daxtens/patchwork/commit/7c465c49112a4627f19fd5a45615516520997686 : https://github.com/daxtens/patchwork/commits/parse-clean
Description: PGP signature
_______________________________________________ Patchwork mailing list Patchwork@lists.ozlabs.org https://lists.ozlabs.org/listinfo/patchwork