Hi Stephen, > > Excellent work: this works well and the tests are much appreciated :) > I spent some time reviewing this and I have one question: should we > extend this to other header parsing, such as 'Subject'? I ask because I > noticed that we have a 'clean_header' function which is already used to > handle some unicode headers and it should probably handle things like > invalid characters too. I did some hacking to see if it could slot in > place of the above changes: > > def find_headers(mail): > + return '\n'.join(['%s: %s' % (key, clean_header(value)) for key, > value > + in mail.items()]) > # We have some Py2/Py3 issues here. > > On running this, this failed with a LookupError on Python 3 and the > UnicodeDecodeError on Python 2, so I tried to handle these: > > def clean_header(header): > """Decode (possibly non-ascii) headers.""" > def decode(fragment): > - (frag_str, frag_encoding) = fragment > + frag_str, frag_encoding = fragment > if frag_encoding: > - return frag_str.decode(frag_encoding) > + if frag_encoding != 'unknown-8bit': > + return frag_str.decode(frag_encoding) > + else: > + return frag_str.decode('ascii', errors='replace') > elif isinstance(frag_str, six.binary_type): # python 2 > - return frag_str.decode() > + try: > + return frag_str.decode() > + except UnicodeDecodeError: > + return frag_str.decode('ascii', errors='replace') > + > return frag_str > > fragments = [decode(x) for x in decode_header(header)] > > When I make this change, all tests pass. Perhaps we should go about > integrating your changes in this function and adding tests for things > like invalid subjects lines to make sure this does what it says on the > tin?
So it turns out that: - we need the change to clean_header() you identified to properly handle invalid characters in fields that are parsed by clean_header() - clean_header() returns a decoded (8-bit) string, without wrapping. find_headers() on the other hand, needs to return encoded (7-bit) wrapped header lines. I'll probably add a helper function that deals with the decoding properly, then I'll pipe that into both clean_header() and find_header(). Regards, Daniel _______________________________________________ Patchwork mailing list Patchwork@lists.ozlabs.org https://lists.ozlabs.org/listinfo/patchwork