> + def handle(self, *args, **options):
> + # Attempt to parse the path if provided, and fallback to stdin if not
> + if args:
> + logger.info('Parsing mail loaded by filename')
> + with open(args[0]) as file_:
> + mail = message_from_file(file_)
> + else:
> + logger.info('Parsing mail loaded from stdin')
> + mail = message_from_file(sys.stdin)
> +So, I have found an interesting case here, not strictly related to this patch but related to parsing messages from files. I have been testing with some messages from this list from earlier this month. One [0] includes the following sequence: 000018f0 69 65 73 20 76 69 65 77 29 20 3f c2 a0 20 48 6f |ies view) ?.. Ho| Note the sequence "c2 a0". Both these are > 128 and therefore not part of 7-bit ASCII. Apparently this is a UTF-8 for a non-breaking space: http://stackoverflow.com/a/2774507/463510 email.message_from_file does not handle this well: it boils down to UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 6395: ordinal not in range(128) I imagine this hasn't hit us in production because most (all?) production users use Python2, which doesn't have the bytes/string distinction that Python3 has. Anyway, the only way I've found to work around this is to do something like this: with open(args[0], 'rb') as file_: decoded_mail = file_.read().decode('utf-8') mail = email.message_from_string(decoded_mail) This is super ugly, but works in Py3. Ironically it doesn't work in Py2, but it's a start. Could you include something like this in this patch set? I think the parsearchive will require something similar too. I'm going to start collecting these "interesting" emails to make a test suite. Regards, Daniel [0] https://lists.ozlabs.org/pipermail/patchwork/2016-August/003158.html
signature.asc
Description: PGP signature
_______________________________________________ Patchwork mailing list [email protected] https://lists.ozlabs.org/listinfo/patchwork
