Bug#320185: rss2email: non-ASCII long header is encoded incorrectly
On August 25, 2005 at 7:34PM +0900, tats (at vega.ocn.ne.jp) wrote: To fix this bug, Nedko Arnaudov revised the patch, and I sorted out and revised it. I can now recommend the attached patch. * feedparser.py (_sync_author_detail): Replace '' with ''. * rss2email.py (header7bit): Use email.Header instead of mimify. * rss2email.py (header7bit_ifnonatom): New function. * rss2email.py (run): Encode `From:' with header7bit_ifnonatom(), and don't encode `To:'. * rss2email.py (run): Insert `Mime-Version:' and `Content-Transfer-Encoding:'. I've revised the patch. Please replace it with the attached patch. (Sorry, the patch of the previous mail is broken.) The attached patch tries to use us-ascii instead of utf-8 for header fields and body. * feedparser.py (_sync_author_detail): Replace '' with ''. * rss2email.py (header7bit): Use email.Header instead of mimify. * rss2email.py (header7bit_phrase): New function. * rss2email.py (run): Encode `From:' with header7bit_phrase(), and don't encode `To:'. * rss2email.py (run): Insert `MIME-Version:' and `Content-Transfer-Encoding:'. * rss2email.py (run): Set charset to us-ascii or utf-8. -- Tatsuya Kinoshita --- rss2email-2.55-1/feedparser.py +++ rss2email-2.55/feedparser.py @@ -811,6 +811,7 @@ # probably a better way to do the following, but it passes all the tests author = author.replace(email, '') author = author.replace('()', '') +author = author.replace('', '') author = author.strip() if author and (author[0] == '('): author = author[1:] --- rss2email-2.55-1/rss2email.py +++ rss2email-2.55/rss2email.py @@ -107,6 +107,8 @@ for e in ['error', 'gaierror']: if hasattr(socket, e): socket_errors.append(getattr(socket, e)) import mimify; from StringIO import StringIO as SIO; mimify.CHARSET = 'utf-8' +from email.Header import Header +import re if SMTP_SEND: import smtplib; smtpserver = smtplib.SMTP(SMTP_SERVER) else: smtpserver = None @@ -135,13 +137,27 @@ Quote names in email according to RFC822. return '' + unu(s).replace(\\, ).replace('', '\\') + '' +nonascii = re.compile('[^\000-\177]') +nonatom = re.compile('[^a-zA-Z0-9\011\012\015\040\!\#\$\%\\'\*\+\-\/\=\?\^\_\`\{\|\}\~]') # ref. RFC2822, atom. comment is not supported + def header7bit(s): QP_CORRUPT headers. - #return mimify.mime_encode_header(s + ' ')[:-1] - # XXX due to mime_encode_header bug - import re - p = re.compile('=\n([^ \t])'); - return p.sub(r'\1', mimify.mime_encode_header(s + ' ')[:-1]) + charset = 'us-ascii' + if nonascii.search(s): + charset = 'utf-8' + h = Header(s, charset, 50) + return h.encode() + +def header7bit_phrase(s): + QP_CORRUPT headers for phrase. + if nonascii.search(s): + charset = 'utf-8' + else: + charset = 'us-ascii' + if nonatom.search(s): + s = quote822(s) + h = Header(s, charset, 50) + return h.encode() ### Parsing Utilities ### @@ -405,12 +421,13 @@ from_addr = unu(getEmail(r.feed, entry)) message = ( - From: + quote822(header7bit(getName(r, entry))) + +from_addr+ + - \nTo: + header7bit(unu(f.to or default_to)) + # set a default email! + From: + header7bit_phrase(unu(getName(r, entry))) + +from_addr+ + + \nTo: + unu(f.to or default_to) + # set a default email! \nSubject: + header7bit(title) + \nDate: + time.strftime(%a, %d %b %Y %H:%M:%S -, datetime) + \nUser-Agent: rss2email + # really should be X-Mailer BONUS_HEADER + + \nMIME-Version: 1.0 + \nContent-Type: ) # but backwards-compatibility if ishtml(content): @@ -425,7 +442,11 @@ message += text/plain content = unu(content).strip() + \n\nURL: +link - message += '; charset=utf-8\n\n' + content + \n + if nonascii.search(content): + message += '; charset=utf-8\nContent-Transfer-Encoding: 8bit' + else: + message += '; charset=us-ascii\nContent-Transfer-Encoding: 7bit' +
Bug#320185: rss2email: non-ASCII long header is encoded incorrectly
reopen 320185 tags 320185 + patch thanks On August 20, 2005 at 1:22PM +0900, tats (at vega.ocn.ne.jp) wrote: I actually encountered raw non-ASCII bytes in From field. http://nedko.arnaudov.name/soft/rss2email-2.55-folding.patch This patch uses email/Header.py instead of mimify.py, and it fixes both the newline bug and the raw non-ASCII bug. Oops, the above patch is not fine in `From:' and `To:'. After applying the patch, `From:' has encoded text between `' and `', and `To:' encodes the email addresses incorrectly. email/Header.py seems to be better than mimify.py. However, to use email/Header.py, we should have more modification in rss2email.py. To fix this bug, Nedko Arnaudov revised the patch, and I sorted out and revised it. I can now recommend the attached patch. * feedparser.py (_sync_author_detail): Replace '' with ''. * rss2email.py (header7bit): Use email.Header instead of mimify. * rss2email.py (header7bit_ifnonatom): New function. * rss2email.py (run): Encode `From:' with header7bit_ifnonatom(), and don't encode `To:'. * rss2email.py (run): Insert `Mime-Version:' and `Content-Transfer-Encoding:'. -- Tatsuya Kinoshita --- rss2email-2.55-1/feedparser.py +++ rss2email-2.55/feedparser.py @@ -811,6 +811,7 @@ # probably a better way to do the following, but it passes all the tests author = author.replace(email, '') author = author.replace('()', '') +author = author.replace('', '') author = author.strip() if author and (author[0] == '('): author = author[1:] --- rss2email-2.55-1/rss2email.py +++ rss2email-2.55/rss2email.py @@ -107,6 +107,8 @@ for e in ['error', 'gaierror']: if hasattr(socket, e): socket_errors.append(getattr(socket, e)) import mimify; from StringIO import StringIO as SIO; mimify.CHARSET = 'utf-8' +from email.Header import Header +import re if SMTP_SEND: import smtplib; smtpserver = smtplib.SMTP(SMTP_SERVER) else: smtpserver = None @@ -135,13 +137,24 @@ Quote names in email according to RFC822. return '' + unu(s).replace(\\, ).replace('', '\\') + '' +nonascii = re.compile('[^\000-\177]') +nonatom = re.compile('[^a-zA-Z0-9\012\015\040\!\#\$\%\\'\*\+\-\/\=\?\^\_\`\{\|\}\~]') # ref. RFC2822, atom. comment is not supported + def header7bit(s): QP_CORRUPT headers. - #return mimify.mime_encode_header(s + ' ')[:-1] - # XXX due to mime_encode_header bug - import re - p = re.compile('=\n([^ \t])'); - return p.sub(r'\1', mimify.mime_encode_header(s + ' ')[:-1]) + charset = 'us-ascii' + if nonascii.search(s): + charset = 'utf-8' + h = Header(s, charset, 50) + return h.encode() + +def header7bit_ifnonatom(s): + QP_CORRUPT headers if non-atom character exists. + charset = 'us-ascii' + if nonatom.search(s): + charset = 'utf-8' + h = Header(s, charset, 50) + return h.encode() ### Parsing Utilities ### @@ -405,12 +418,14 @@ from_addr = unu(getEmail(r.feed, entry)) message = ( - From: + quote822(header7bit(getName(r, entry))) + +from_addr+ + - \nTo: + header7bit(unu(f.to or default_to)) + # set a default email! + From: + header7bit_ifnonatom(unu(getName(r, entry))) + +from_addr+ + + \nTo: + unu(f.to or default_to) + # set a default email! \nSubject: + header7bit(title) + \nDate: + time.strftime(%a, %d %b %Y %H:%M:%S -, datetime) + \nUser-Agent: rss2email + # really should be X-Mailer BONUS_HEADER + + \nMime-Version: 1.0 + + \nContent-Transfer-Encoding: 8bit + \nContent-Type: ) # but backwards-compatibility if ishtml(content): pgpWu494XVbjr.pgp Description: PGP signature
Bug#320185: rss2email: non-ASCII long header is encoded incorrectly
On August 18, 2005 at 9:24PM +0900, tats (at vega.ocn.ne.jp) wrote: I actually encountered raw non-ASCII bytes in From field. I google'd rss2email mimify, then I found a solution of this problem. Please consider applying the following patch instead of my patch. http://nedko.arnaudov.name/soft/rss2email-2.55-folding.patch This patch uses email/Header.py instead of mimify.py, and it fixes both the newline bug and the raw non-ASCII bug. Oops, the above patch is not fine in `From:' and `To:'. After applying the patch, `From:' has encoded text between `' and `', and `To:' encodes the email addresses incorrectly. email/Header.py seems to be better than mimify.py. However, to use email/Header.py, we should have more modification in rss2email.py. -- Tatsuya Kinoshita pgpEhylSroMLw.pgp Description: PGP signature
Bug#320185: rss2email: non-ASCII long header is encoded incorrectly
Hi Joey, On July 29, 2005 at 6:02AM +0900, tats (at vega.ocn.ne.jp) wrote: I actually encountered raw non-ASCII bytes in From field. I google'd rss2email mimify, then I found a solution of this problem. Please consider applying the following patch instead of my patch. http://nedko.arnaudov.name/soft/rss2email-2.55-folding.patch This patch uses email/Header.py instead of mimify.py, and it fixes both the newline bug and the raw non-ASCII bug. Thanks, -- Tatsuya Kinoshita pgpcTbCQrH4W2.pgp Description: PGP signature
Bug#320185: rss2email: non-ASCII long header is encoded incorrectly
Tatsuya Kinoshita wrote: Package: rss2email Version: 1:2.54-6 Severity: normal I've tried using rss2email and found a bug. A Subject field is encoded incorrectly if the RSS feed contains non-ASCII characters in the title and the word is too long. For instance, titleá12345678901234567890123456789012345678901234567890123456789012345678901234567890title is converted to Subject: =?utf-8?Q?=C3=A112345678901234567890123456789012345678901234567890123456789012345678= 901234567890?= It seems that =\n is inserted incorrectly. This bug might be in Python's mimify.py. Anyway, to prevent this problem, I've applied the follwing patch to rss2email.py. --- rss2email.py.orig +++ rss2email.py @@ -137,7 +137,11 @@ def header7bit(s): QP_CORRUPT headers. - return mimify.mime_encode_header(s + ' ')[:-1] + #return mimify.mime_encode_header(s + ' ')[:-1] + # XXX due to mime_encode_header bug + import re + p = re.compile('=\n([^ \t])'); + return p.sub(r'\1', mimify.mime_encode_header(s + ' ')[:-1]) ### Parsing Utilities ### Typically, this problem is appeared in Japanese documents. Because Japanese multibyte words are not separated with the space character. Thanks, I've actually seen this once or twice with English feeds, never took the time to track it down. -- see shy jo signature.asc Description: Digital signature
Bug#320185: rss2email: non-ASCII long header is encoded incorrectly
On July 27, 2005 at 11:07PM -0400, joeyh (at debian.org) wrote: I'd like to pass this bug on to python besides working around it, but I can't seem to reproduce it with a simple test case like this: #!/usr/bin/python # -*- coding: utf-8 -*- import mimify print mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890) To reproduce the problem, add , foobar or so to the string, as follows: print mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890 ) -- Tatsuya Kinoshita pgpcwH5ix7YU0.pgp Description: PGP signature
Bug#320185: rss2email: non-ASCII long header is encoded incorrectly
Tatsuya Kinoshita wrote: To reproduce the problem, add , foobar or so to the string, as follows: print mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890 ) Hmm, still not working: [EMAIL PROTECTED]:~python foo =?ISO-8859-1?Q?=E11234567890123456789012345678901234567890123456789012?= 3456789012345678901234567890 foobar Can you attach a test case? -- see shy jo signature.asc Description: Digital signature
Bug#320185: rss2email: non-ASCII long header is encoded incorrectly
On July 28, 2005 at 9:46AM -0400, joeyh (at debian.org) wrote: Hmm, still not working: [EMAIL PROTECTED]:~python foo =?ISO-8859-1?Q?=E11234567890123456789012345678901234567890123456789012?= 3456789012345678901234567890 foobar Can you attach a test case? Hmm, I tried the follwing code: #!/usr/bin/python # -*- coding: utf-8-*- import mimify; mimify.CHARSET = 'utf-8' print mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890 foobar) and got the results: $ python2.3 -V Python 2.3.5 $ python2.3 foo.py =?utf-8?Q?=C3=A112345678901234567890123456789012345678901234567890123456789012345678= 901234567890?= foobar $ python2.4 -V Python 2.4.1 $ python2.4 foo.py =?utf-8?Q?=C3=A112345678901234567890123456789012345678901234567890123456789012345678= 901234567890?= foobar -- Tatsuya Kinoshita pgpP66FIbIIZZ.pgp Description: PGP signature
Bug#320185: rss2email: non-ASCII long header is encoded incorrectly
On July 28, 2005 at 6:54PM +0900, tats (at vega.ocn.ne.jp) wrote: print mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890) To reproduce the problem, add , foobar or so to the string, as follows: print mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890 ) BTW, the value of the former code contains non-ASCII raw character that is invalid. I also found the problem in From field with rss2email. Even if is added, the value of the follow code contains non-ASCII raw character. print mimify.mime_encode_header(\012á34 56789\ ) -- Tatsuya Kinoshita pgpLSDiat3f9j.pgp Description: PGP signature
Bug#320185: rss2email: non-ASCII long header is encoded incorrectly
On July 28, 2005 at 10:36PM +0900, tats (at vega.ocn.ne.jp) wrote: I also found the problem in From field with rss2email. Even if is added, the value of the follow code contains non-ASCII raw character. print mimify.mime_encode_header(\012á34 56789\ ) Oops, I had a misunderstanding. Please ignore the above. I actually encountered raw non-ASCII bytes in From field. However, is added by quote822() in rss2email.py. So, the cause of the problem should be in other places. -- Tatsuya Kinoshita pgpxFqzKUDFiW.pgp Description: PGP signature
Bug#320185: rss2email: non-ASCII long header is encoded incorrectly
Package: rss2email Version: 1:2.54-6 Severity: normal I've tried using rss2email and found a bug. A Subject field is encoded incorrectly if the RSS feed contains non-ASCII characters in the title and the word is too long. For instance, titleá12345678901234567890123456789012345678901234567890123456789012345678901234567890title is converted to Subject: =?utf-8?Q?=C3=A112345678901234567890123456789012345678901234567890123456789012345678= 901234567890?= It seems that =\n is inserted incorrectly. This bug might be in Python's mimify.py. Anyway, to prevent this problem, I've applied the follwing patch to rss2email.py. --- rss2email.py.orig +++ rss2email.py @@ -137,7 +137,11 @@ def header7bit(s): QP_CORRUPT headers. - return mimify.mime_encode_header(s + ' ')[:-1] + #return mimify.mime_encode_header(s + ' ')[:-1] + # XXX due to mime_encode_header bug + import re + p = re.compile('=\n([^ \t])'); + return p.sub(r'\1', mimify.mime_encode_header(s + ' ')[:-1]) ### Parsing Utilities ### Typically, this problem is appeared in Japanese documents. Because Japanese multibyte words are not separated with the space character. Thanks, -- Tatsuya Kinoshita pgp1shVWaFv01.pgp Description: PGP signature
Bug#320185: rss2email: non-ASCII long header is encoded incorrectly
Tatsuya Kinoshita wrote: I've tried using rss2email and found a bug. A Subject field is encoded incorrectly if the RSS feed contains non-ASCII characters in the title and the word is too long. For instance, titleá12345678901234567890123456789012345678901234567890123456789012345678901234567890title is converted to Subject: =?utf-8?Q?=C3=A112345678901234567890123456789012345678901234567890123456789012345678= 901234567890?= It seems that =\n is inserted incorrectly. I'd like to pass this bug on to python besides working around it, but I can't seem to reproduce it with a simple test case like this: #!/usr/bin/python # -*- coding: utf-8 -*- import mimify print mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890) Do you have any ideas for a simple test case that does not involve running rss2email on a feed? -- see shy jo signature.asc Description: Digital signature