>> +def sanitise_header(header_contents, header_name=None): >> + """Given a header with header_contents, optionally labelled >> + header_name, decode it with decode_header, sanitise it to make >> + sure it decodes correctly and contains no invalid characters. >> + Then encode the result with make_header() >> + """ >> + >> + # We have some Py2/Py3 issues here. >> + # >> + # Firstly, the email parser (before we get here) >> + # Python 3: headers with weird chars are email.header.Header >> + # class, others as str >> + # Python 2: every header is an str >> + # >> + # Secondly, the behaviour of decode_header: >> + # Python 3: weird headers are labelled as unknown-8bit >> + # Python 2: weird headers are not labelled differently >> + # >> + # Lastly, aking matters worse, in Python2, unknown-8bit doesn't >> + # seem to be supported as an input to make_header, so not only do >> + # we have to detect dodgy headers, we have to fix them ourselves. >> + # >> + # We solve this by catching any Unicode errors, and then manually >> + # handling any interesting headers. > > I'm going to move all the above into the docstring, if that's OK by > you.
I don't mind, but my understanding was that the docstring described the function, rather than the implementation details, hence why I had it in a comment originally. >> + # python2 - no support in make_header for unknown-8bit >> + # We should do unknown-8bit coding ourselves. >> + # For now, we're just going to replace any dubious >> + # chars with ?. >> + # >> + # TODO: replace it with a proper QP unknown-8bit codec. > > How could we resolve this TODO in the future? > I think we use email.charset. Possibly we can copy some of the Py3 code that implements unknown-8bit. It all looks quite finnicky. Also, any time we get a message with an invalid header and we're relying on the patchwork header display to figure it out is an edge case on top of another edge case, so I didn't think it worth much more time. >> + # on Py2, we want to do unicode(), on Py3, str(). >> + # That gets us the decoded, un-wrapped header. >> + if six.PY2: >> + header_str = unicode(sane_header) >> + else: >> + header_str = str(sane_header) > > nit: this looks like something we could use the 'six.u()' function for? > > https://pythonhosted.org/six/#six.u > I thought this too. Then I looked at the code for django.utils.six: https://github.com/django/django/blob/master/django/utils/six.py#L646 def u(s): return unicode(s.replace(r'\\', r'\\\\'), "unicode_escape") That won't work because we're passing in a header class to u(). The header class, as far as I know, doesn't provide .replace(), and even if it did it's not clear we want to call it. >> + strings = [('%s: %s' % (key, header.encode())) > > How come '.encode' is suitable here, yet we need to call 'unicode'/'str' > in the 'clean_header' function? Because the header class overrides .encode(), __unicode__ and __str__, and they do different things. For headers: - str (py3) and unicode (py2) provides a non-line-wrapped version of the header, in unicode - that is, not in quoted-printable or base64 7-bit form. - encode provides a line-wrapped, 7-bit encoded form of the header. For find_headers, we're creating the headers that are displayed when you click on the 'show' link on the patch display, under Details, and just beneath the message id and state. These are supposed to be fully encoded and line wrapped: they're supposed to look like mail headers. For clean_header, we're looking for the human readable form: decoded From quoted printable or base64, converted from whatever charset they were in, and not wrapped. Hence the code. Admittedly this is not super clear, but I blame the API of the email.Header module. Regards, Daniel
signature.asc
Description: PGP signature
_______________________________________________ Patchwork mailing list [email protected] https://lists.ozlabs.org/listinfo/patchwork
