Bug#320185: rss2email: non-ASCII long header is encoded incorrectly

2005-08-26 Thread Tatsuya Kinoshita
On August 25, 2005 at 7:34PM +0900,
tats (at vega.ocn.ne.jp) wrote:

 To fix this bug, Nedko Arnaudov revised the patch, and I sorted out
 and revised it.

 I can now recommend the attached patch.

 * feedparser.py (_sync_author_detail): Replace '' with ''.
 * rss2email.py (header7bit): Use email.Header instead of mimify.
 * rss2email.py (header7bit_ifnonatom): New function.
 * rss2email.py (run): Encode `From:' with header7bit_ifnonatom(), and don't
   encode `To:'.
 * rss2email.py (run): Insert `Mime-Version:' and `Content-Transfer-Encoding:'.

I've revised the patch.  Please replace it with the attached patch.
(Sorry, the patch of the previous mail is broken.)

The attached patch tries to use us-ascii instead of utf-8 for
header fields and body.

* feedparser.py (_sync_author_detail): Replace '' with ''.
* rss2email.py (header7bit): Use email.Header instead of mimify.
* rss2email.py (header7bit_phrase): New function.
* rss2email.py (run): Encode `From:' with header7bit_phrase(), and don't
  encode `To:'.
* rss2email.py (run): Insert `MIME-Version:' and `Content-Transfer-Encoding:'.
* rss2email.py (run): Set charset to us-ascii or utf-8.

--
Tatsuya Kinoshita
--- rss2email-2.55-1/feedparser.py
+++ rss2email-2.55/feedparser.py
@@ -811,6 +811,7 @@
 # probably a better way to do the following, but it passes all the 
tests
 author = author.replace(email, '')
 author = author.replace('()', '')
+author = author.replace('', '')
 author = author.strip()
 if author and (author[0] == '('):
 author = author[1:]
--- rss2email-2.55-1/rss2email.py
+++ rss2email-2.55/rss2email.py
@@ -107,6 +107,8 @@
 for e in ['error', 'gaierror']:
if hasattr(socket, e): socket_errors.append(getattr(socket, e))
 import mimify; from StringIO import StringIO as SIO; mimify.CHARSET = 'utf-8'
+from email.Header import Header
+import re
 if SMTP_SEND: import smtplib; smtpserver = smtplib.SMTP(SMTP_SERVER)
 else: smtpserver = None
 
@@ -135,13 +137,27 @@
Quote names in email according to RFC822.
return '' + unu(s).replace(\\, ).replace('', '\\') + ''
 
+nonascii = re.compile('[^\000-\177]')
+nonatom = 
re.compile('[^a-zA-Z0-9\011\012\015\040\!\#\$\%\\'\*\+\-\/\=\?\^\_\`\{\|\}\~]')
 # ref. RFC2822, atom.  comment is not supported
+
 def header7bit(s):
QP_CORRUPT headers.
-   #return mimify.mime_encode_header(s + ' ')[:-1]
-   # XXX due to mime_encode_header bug
-   import re
-   p = re.compile('=\n([^ \t])');
-   return p.sub(r'\1', mimify.mime_encode_header(s + ' ')[:-1])
+   charset = 'us-ascii'
+   if nonascii.search(s):
+   charset = 'utf-8'
+   h = Header(s, charset, 50)
+   return h.encode()
+
+def header7bit_phrase(s):
+   QP_CORRUPT headers for phrase.
+   if nonascii.search(s):
+   charset = 'utf-8'
+   else:
+   charset = 'us-ascii'
+   if nonatom.search(s):
+   s = quote822(s)
+   h = Header(s, charset, 50)
+   return h.encode()
 
 ### Parsing Utilities ###
 
@@ -405,12 +421,13 @@
from_addr = unu(getEmail(r.feed, entry))
 
message = (
-   From:  + 
quote822(header7bit(getName(r, entry))) +  +from_addr+ +
-   \nTo:  + header7bit(unu(f.to or 
default_to)) + # set a default email!
+   From:  + 
header7bit_phrase(unu(getName(r, entry))) +  +from_addr+ +
+   \nTo:  + unu(f.to or default_to) + # 
set a default email!
\nSubject:  + header7bit(title) +
\nDate:  + time.strftime(%a, %d %b 
%Y %H:%M:%S -, datetime) +
\nUser-Agent: rss2email + # really 
should be X-Mailer 
BONUS_HEADER +
+   \nMIME-Version: 1.0 +
\nContent-Type: ) # but 
backwards-compatibility

if ishtml(content):
@@ -425,7 +442,11 @@
message += text/plain
content = unu(content).strip() 
+ \n\nURL: +link

-   message += '; charset=utf-8\n\n' + 
content + \n
+   if nonascii.search(content):
+   message += '; 
charset=utf-8\nContent-Transfer-Encoding: 8bit'
+   else:
+   message += '; 
charset=us-ascii\nContent-Transfer-Encoding: 7bit'
+   

Bug#320185: rss2email: non-ASCII long header is encoded incorrectly

2005-08-25 Thread Tatsuya Kinoshita
reopen 320185
tags 320185 + patch
thanks

On August 20, 2005 at 1:22PM +0900,
tats (at vega.ocn.ne.jp) wrote:

   I actually encountered raw non-ASCII bytes in From field.

http://nedko.arnaudov.name/soft/rss2email-2.55-folding.patch
 
  This patch uses email/Header.py instead of mimify.py, and it fixes
  both the newline bug and the raw non-ASCII bug.

 Oops, the above patch is not fine in `From:' and `To:'.  After
 applying the patch, `From:' has encoded text between `' and `',
 and `To:' encodes the email addresses incorrectly.

 email/Header.py seems to be better than mimify.py.  However, to use
 email/Header.py, we should have more modification in rss2email.py.

To fix this bug, Nedko Arnaudov revised the patch, and I sorted out
and revised it.

I can now recommend the attached patch.

* feedparser.py (_sync_author_detail): Replace '' with ''.
* rss2email.py (header7bit): Use email.Header instead of mimify.
* rss2email.py (header7bit_ifnonatom): New function.
* rss2email.py (run): Encode `From:' with header7bit_ifnonatom(), and don't
  encode `To:'.
* rss2email.py (run): Insert `Mime-Version:' and `Content-Transfer-Encoding:'.

--
Tatsuya Kinoshita
--- rss2email-2.55-1/feedparser.py
+++ rss2email-2.55/feedparser.py
@@ -811,6 +811,7 @@
 # probably a better way to do the following, but it passes all the 
tests
 author = author.replace(email, '')
 author = author.replace('()', '')
+author = author.replace('', '')
 author = author.strip()
 if author and (author[0] == '('):
 author = author[1:]
--- rss2email-2.55-1/rss2email.py
+++ rss2email-2.55/rss2email.py
@@ -107,6 +107,8 @@
 for e in ['error', 'gaierror']:
if hasattr(socket, e): socket_errors.append(getattr(socket, e))
 import mimify; from StringIO import StringIO as SIO; mimify.CHARSET = 'utf-8'
+from email.Header import Header
+import re
 if SMTP_SEND: import smtplib; smtpserver = smtplib.SMTP(SMTP_SERVER)
 else: smtpserver = None

@@ -135,13 +137,24 @@
Quote names in email according to RFC822.
return '' + unu(s).replace(\\, ).replace('', '\\') + ''

+nonascii = re.compile('[^\000-\177]')
+nonatom = 
re.compile('[^a-zA-Z0-9\012\015\040\!\#\$\%\\'\*\+\-\/\=\?\^\_\`\{\|\}\~]') # 
ref. RFC2822, atom.  comment is not supported
+
 def header7bit(s):
QP_CORRUPT headers.
-   #return mimify.mime_encode_header(s + ' ')[:-1]
-   # XXX due to mime_encode_header bug
-   import re
-   p = re.compile('=\n([^ \t])');
-   return p.sub(r'\1', mimify.mime_encode_header(s + ' ')[:-1])
+   charset = 'us-ascii'
+   if nonascii.search(s):
+   charset = 'utf-8'
+   h = Header(s, charset, 50)
+   return h.encode()
+
+def header7bit_ifnonatom(s):
+   QP_CORRUPT headers if non-atom character exists.
+   charset = 'us-ascii'
+   if nonatom.search(s):
+   charset = 'utf-8'
+   h = Header(s, charset, 50)
+   return h.encode()

 ### Parsing Utilities ###

@@ -405,12 +418,14 @@
from_addr = unu(getEmail(r.feed, entry))

message = (
-   From:  + 
quote822(header7bit(getName(r, entry))) +  +from_addr+ +
-   \nTo:  + header7bit(unu(f.to or 
default_to)) + # set a default email!
+   From:  + 
header7bit_ifnonatom(unu(getName(r, entry))) +  +from_addr+ +
+   \nTo:  + unu(f.to or default_to) + # 
set a default email!
\nSubject:  + header7bit(title) +
\nDate:  + time.strftime(%a, %d %b 
%Y %H:%M:%S -, datetime) +
\nUser-Agent: rss2email + # really 
should be X-Mailer
BONUS_HEADER +
+   \nMime-Version: 1.0 +
+   \nContent-Transfer-Encoding: 8bit +
\nContent-Type: ) # but 
backwards-compatibility

if ishtml(content):


pgpWu494XVbjr.pgp
Description: PGP signature


Bug#320185: rss2email: non-ASCII long header is encoded incorrectly

2005-08-19 Thread Tatsuya Kinoshita
On August 18, 2005 at 9:24PM +0900,
tats (at vega.ocn.ne.jp) wrote:

  I actually encountered raw non-ASCII bytes in From field.

 I google'd rss2email mimify, then I found a solution of this
 problem.

 Please consider applying the following patch instead of my patch.

   http://nedko.arnaudov.name/soft/rss2email-2.55-folding.patch

 This patch uses email/Header.py instead of mimify.py, and it fixes
 both the newline bug and the raw non-ASCII bug.

Oops, the above patch is not fine in `From:' and `To:'.  After
applying the patch, `From:' has encoded text between `' and `',
and `To:' encodes the email addresses incorrectly.

email/Header.py seems to be better than mimify.py.  However, to use
email/Header.py, we should have more modification in rss2email.py.

--
Tatsuya Kinoshita


pgpEhylSroMLw.pgp
Description: PGP signature


Bug#320185: rss2email: non-ASCII long header is encoded incorrectly

2005-08-18 Thread Tatsuya Kinoshita
Hi Joey,

On July 29, 2005 at 6:02AM +0900,
tats (at vega.ocn.ne.jp) wrote:

 I actually encountered raw non-ASCII bytes in From field.

I google'd rss2email mimify, then I found a solution of this
problem.

Please consider applying the following patch instead of my patch.

  http://nedko.arnaudov.name/soft/rss2email-2.55-folding.patch

This patch uses email/Header.py instead of mimify.py, and it fixes
both the newline bug and the raw non-ASCII bug.

Thanks,
--
Tatsuya Kinoshita


pgpcTbCQrH4W2.pgp
Description: PGP signature


Bug#320185: rss2email: non-ASCII long header is encoded incorrectly

2005-07-28 Thread Joey Hess
Tatsuya Kinoshita wrote:
 Package: rss2email
 Version: 1:2.54-6
 Severity: normal
 
 I've tried using rss2email and found a bug.
 
 A Subject field is encoded incorrectly if the RSS feed contains
 non-ASCII characters in the title and the word is too long.
 
 For instance,
 
 titleá12345678901234567890123456789012345678901234567890123456789012345678901234567890title
 
 is converted to
 
 Subject: 
 =?utf-8?Q?=C3=A112345678901234567890123456789012345678901234567890123456789012345678=
 901234567890?=
 
 It seems that =\n is inserted incorrectly.
 
 This bug might be in Python's mimify.py.  Anyway, to prevent this
 problem, I've applied the follwing patch to rss2email.py.
 
 
 --- rss2email.py.orig
 +++ rss2email.py
 @@ -137,7 +137,11 @@
  
  def header7bit(s):
   QP_CORRUPT headers.
 - return mimify.mime_encode_header(s + ' ')[:-1]
 + #return mimify.mime_encode_header(s + ' ')[:-1]
 + # XXX due to mime_encode_header bug
 + import re
 + p = re.compile('=\n([^ \t])');
 + return p.sub(r'\1', mimify.mime_encode_header(s + ' ')[:-1])
  
  ### Parsing Utilities ###
  
 
 
 Typically, this problem is appeared in Japanese documents.  Because
 Japanese multibyte words are not separated with the space character.

Thanks, I've actually seen this once or twice with English feeds, never
took the time to track it down.

-- 
see shy jo


signature.asc
Description: Digital signature


Bug#320185: rss2email: non-ASCII long header is encoded incorrectly

2005-07-28 Thread Tatsuya Kinoshita
On July 27, 2005 at 11:07PM -0400,
joeyh (at debian.org) wrote:

 I'd like to pass this bug on to python besides working around it, but I
 can't seem to reproduce it with a simple test case like this:
 
 #!/usr/bin/python
 # -*- coding: utf-8 -*-
 import mimify
 print 
 mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890)

To reproduce the problem, add  ,  foobar or so to the string,
as follows:

print 
mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890
 )

-- 
Tatsuya Kinoshita


pgpcwH5ix7YU0.pgp
Description: PGP signature


Bug#320185: rss2email: non-ASCII long header is encoded incorrectly

2005-07-28 Thread Joey Hess
Tatsuya Kinoshita wrote:
 To reproduce the problem, add  ,  foobar or so to the string,
 as follows:
 
 print 
 mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890
  )

Hmm, still not working:

[EMAIL PROTECTED]:~python foo 
=?ISO-8859-1?Q?=E11234567890123456789012345678901234567890123456789012?= 
3456789012345678901234567890 foobar

Can you attach a test case?

-- 
see shy jo


signature.asc
Description: Digital signature


Bug#320185: rss2email: non-ASCII long header is encoded incorrectly

2005-07-28 Thread Tatsuya Kinoshita
On July 28, 2005 at 9:46AM -0400,
joeyh (at debian.org) wrote:

 Hmm, still not working:
 
 [EMAIL PROTECTED]:~python foo 
 =?ISO-8859-1?Q?=E11234567890123456789012345678901234567890123456789012?= 
 3456789012345678901234567890 foobar
 
 Can you attach a test case?

Hmm, I tried the follwing code:

#!/usr/bin/python
# -*- coding: utf-8-*-
import mimify; mimify.CHARSET = 'utf-8'
print 
mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890
 foobar)

and got the results:

$ python2.3 -V
Python 2.3.5
$ python2.3 foo.py
=?utf-8?Q?=C3=A112345678901234567890123456789012345678901234567890123456789012345678=
901234567890?= foobar
$ python2.4 -V
Python 2.4.1
$ python2.4 foo.py
=?utf-8?Q?=C3=A112345678901234567890123456789012345678901234567890123456789012345678=
901234567890?= foobar

-- 
Tatsuya Kinoshita


pgpP66FIbIIZZ.pgp
Description: PGP signature


Bug#320185: rss2email: non-ASCII long header is encoded incorrectly

2005-07-28 Thread Tatsuya Kinoshita
On July 28, 2005 at 6:54PM +0900,
tats (at vega.ocn.ne.jp) wrote:

  print 
  mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890)
 
 To reproduce the problem, add  ,  foobar or so to the string,
 as follows:
 
 print 
 mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890
  )

BTW, the value of the former code contains non-ASCII raw character
that is invalid.

I also found the problem in From field with rss2email.  Even if  
is added, the value of the follow code contains non-ASCII raw character.

print mimify.mime_encode_header(\012á34 56789\ )

-- 
Tatsuya Kinoshita


pgpLSDiat3f9j.pgp
Description: PGP signature


Bug#320185: rss2email: non-ASCII long header is encoded incorrectly

2005-07-28 Thread Tatsuya Kinoshita
On July 28, 2005 at 10:36PM +0900,
tats (at vega.ocn.ne.jp) wrote:

 I also found the problem in From field with rss2email.  Even if  
 is added, the value of the follow code contains non-ASCII raw character.
 
 print mimify.mime_encode_header(\012á34 56789\ )

Oops, I had a misunderstanding.  Please ignore the above.

I actually encountered raw non-ASCII bytes in From field.  However,
 is added by quote822() in rss2email.py.  So, the cause of the
problem should be in other places.

-- 
Tatsuya Kinoshita


pgpxFqzKUDFiW.pgp
Description: PGP signature


Bug#320185: rss2email: non-ASCII long header is encoded incorrectly

2005-07-27 Thread Tatsuya Kinoshita
Package: rss2email
Version: 1:2.54-6
Severity: normal

I've tried using rss2email and found a bug.

A Subject field is encoded incorrectly if the RSS feed contains
non-ASCII characters in the title and the word is too long.

For instance,

titleá12345678901234567890123456789012345678901234567890123456789012345678901234567890title

is converted to

Subject: 
=?utf-8?Q?=C3=A112345678901234567890123456789012345678901234567890123456789012345678=
901234567890?=

It seems that =\n is inserted incorrectly.

This bug might be in Python's mimify.py.  Anyway, to prevent this
problem, I've applied the follwing patch to rss2email.py.


--- rss2email.py.orig
+++ rss2email.py
@@ -137,7 +137,11 @@
 
 def header7bit(s):
QP_CORRUPT headers.
-   return mimify.mime_encode_header(s + ' ')[:-1]
+   #return mimify.mime_encode_header(s + ' ')[:-1]
+   # XXX due to mime_encode_header bug
+   import re
+   p = re.compile('=\n([^ \t])');
+   return p.sub(r'\1', mimify.mime_encode_header(s + ' ')[:-1])
 
 ### Parsing Utilities ###
 


Typically, this problem is appeared in Japanese documents.  Because
Japanese multibyte words are not separated with the space character.

Thanks,
-- 
Tatsuya Kinoshita


pgp1shVWaFv01.pgp
Description: PGP signature


Bug#320185: rss2email: non-ASCII long header is encoded incorrectly

2005-07-27 Thread Joey Hess
Tatsuya Kinoshita wrote:
 I've tried using rss2email and found a bug.
 
 A Subject field is encoded incorrectly if the RSS feed contains
 non-ASCII characters in the title and the word is too long.
 
 For instance,
 
 titleá12345678901234567890123456789012345678901234567890123456789012345678901234567890title
 
 is converted to
 
 Subject: 
 =?utf-8?Q?=C3=A112345678901234567890123456789012345678901234567890123456789012345678=
 901234567890?=
 
 It seems that =\n is inserted incorrectly.

I'd like to pass this bug on to python besides working around it, but I
can't seem to reproduce it with a simple test case like this:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import mimify
print 
mimify.mime_encode_header(á12345678901234567890123456789012345678901234567890123456789012345678901234567890)

Do you have any ideas for a simple test case that does not involve
running rss2email on a feed?

-- 
see shy jo


signature.asc
Description: Digital signature