[PATCH] parser: leniently parse headers as UTF-8

Daniel Axtens Mon, 19 Sep 2016 08:10:26 -0700

If there is a non-ascii character in a header, parsing fails,
even on Py27.


Try to decode headers as UTF-8, but if that fails, replace the
offending bytes with a character marking that decoding failed.
See:
https://docs.python.org/3/howto/unicode.html#python-s-unicode-support

This is handy for mails with malformed headers containing weird
bytes.

Reported-by: Thomas Monjalon <[email protected]>
Signed-off-by: Daniel Axtens <[email protected]>

---

Many thanks to Thomas for his help debugging this.

Happy to bikeshed whether we want 'replace' or perhaps
'backslashreplace'. Not keen on 'ignore'; it has an interesting
security history - but willing to entertain convincing arguments.

This should probably go to a stable branch too. We'll need to start
some discussion about how to handle bug fixes for people not running
git mainline (like ozlabs.org and kernel.org).

Tests to prevent this recurring to come. Python 3 patches to come
also.
---
 patchwork/parser.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/patchwork/parser.py b/patchwork/parser.py
index 1805df8cda7f..d3f55634f530 100644
--- a/patchwork/parser.py
+++ b/patchwork/parser.py
@@ -157,6 +157,7 @@ def find_date(mail):
 def find_headers(mail):
     return reduce(operator.__concat__,
                   ['%s: %s\n' % (k, Header(v, header_name=k,
+                                           charset='utf-8', errors='replace',
                                            continuation_ws='\t').encode())
                    for (k, v) in list(mail.items())])
 
-- 
2.7.4

_______________________________________________
Patchwork mailing list
[email protected]
https://lists.ozlabs.org/listinfo/patchwork

[PATCH] parser: leniently parse headers as UTF-8

Reply via email to