New submission from Tony Nelson <tony_nel...@users.sourceforge.net>:
feedparser.py does not pares mixed newlines properly. NLCRE_eol, which is used to search for the various newlines at End Of Line, uses $ to match the end of string, but $ also matches \n$, due to a wise long-ago patch by the Effbot. This causes feedparser to match '\r\n\n' at '\r\n', and then to remove the last two characters, leaving '\r', thus eating up a line. Such mixed line endings can occur if a message with CRLF line endings is parsed, written out, and then parsed again. When explicitly searching for various newlines, the \Z end-of-string marker should be used instead. There are two improper uses of $ in feedparser.py. I don't see any others in the email package. NLCRE_eol = re.compile('(\r\n|\r|\n)$') should be: NLCRE_eol = re.compile('(\r\n|\r|\n)\Z') and boundary_re also needs the fix. I can write a test. Where exactly should it be put? ---------- components: Library (Lib) files: feedparser_crlflf.patch keywords: patch messages: 84595 nosy: barry, tony_nelson severity: normal status: open title: email feedparser.py CRLFLF bug: $ vs \Z versions: Python 2.6 Added file: http://bugs.python.org/file13476/feedparser_crlflf.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue5610> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com