Re: Issues when reading mailboxes from alioth-lists.debian.net

2020-08-20 Thread Gregor Riepl
>   File "/usr/lib/python3.8/mailbox.py", line 781, in get_message
> msg.set_from(from_line[5:].decode('ascii'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 37: 
> ordinal not in range(128)
> Exit code:   1 
> 
> IMHO it is a bug if those mailboxes can't be read.  Am I missing
> something?

I would humbly suggest that the Python2 version is silently ignoring an
encoding error here.

RFC 4155 states that the "default" mbox format is strictly 7-bit-safe
(usually ASCII), but mboxes may also be in a "local" format with
different or even mixed encodings.

Since mailbox only seems to accepts the "default" mbox format, then it's
very well possible that the Alioth mboxes are *not* in this format,
strictly speaking.



Re: Issues when reading mailboxes from alioth-lists.debian.net

2020-08-19 Thread Andreas Tille
On Wed, Aug 19, 2020 at 10:31:55PM +0530, Nilesh Patra wrote:
> 
> For me the error goes way for me when I change line 781 in 
> /usr/lib/python3.8/mailbox.py
>  to:
> 
> msg.set_from(from_line[5:].decode('utf-8'))
> 
> May be this is a minor feature enhancement since at the moment messages with 
> unicodes don't seem to be decoded.
> Or there's an API change which I'm not aware of.
> 
> Either way this should act like a temorary fix for now. Let me know if this 
> doesn't seem right.

BTW, its a regression compared to Python2.  When calling the

python2 test_mbox.py

everything works.

Kind regards

  Andreas.

-- 
http://fam-tille.de



Re: Issues when reading mailboxes from alioth-lists.debian.net

2020-08-19 Thread Nilesh Patra
Hi,

> Traceback (most recent call last):
>  File "./test_mbox.py", line 6, in 
>if mbox_file.items() != []:
>  File "/usr/lib/python3.8/mailbox.py", line 132, in items
>return list(self.iteritems())
>  File "/usr/lib/python3.8/mailbox.py", line 125, in iteritems
>value = self[key]
>  File "/usr/lib/python3.8/mailbox.py", line 73, in __getitem__
>return self.get_message(key)
>  File "/usr/lib/python3.8/mailbox.py", line 781, in get_message
>msg.set_from(from_line[5:].decode('ascii'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 37: 
> ordinal not in range(128)
> Exit code:   1 

For me the error goes way for me when I change line 781 in 
/usr/lib/python3.8/mailbox.py
 to:

msg.set_from(from_line[5:].decode('utf-8'))

May be this is a minor feature enhancement since at the moment messages with 
unicodes don't seem to be decoded.
Or there's an API change which I'm not aware of.

Either way this should act like a temorary fix for now. Let me know if this 
doesn't seem right.

Kinds Regards,
Nilesh



Issues when reading mailboxes from alioth-lists.debian.net

2020-08-19 Thread Andreas Tille
Hi,

in the teammetrics project I'm trying to parse mailboxes.  This worked
with Python2 but after porting the code to Python3 I get some encoding
troubles.  A specific one seem to be an error in the mailbox module.
Please run the attached script test_mbox which downloads one of the
critical mbox files from aliot-lists.debian.net and calls the also
attached simple Python3 script which ends in:

Traceback (most recent call last):
  File "./test_mbox.py", line 6, in 
if mbox_file.items() != []:
  File "/usr/lib/python3.8/mailbox.py", line 132, in items
return list(self.iteritems())
  File "/usr/lib/python3.8/mailbox.py", line 125, in iteritems
value = self[key]
  File "/usr/lib/python3.8/mailbox.py", line 73, in __getitem__
return self.get_message(key)
  File "/usr/lib/python3.8/mailbox.py", line 781, in get_message
msg.set_from(from_line[5:].decode('ascii'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 37: 
ordinal not in range(128)
Exit code:   1 

IMHO it is a bug if those mailboxes can't be read.  Am I missing
something?

Kind regards

   Andreas.

-- 
http://fam-tille.de
#!/bin/sh

wget 
https://alioth-lists.debian.net/pipermail/pkg-java-maintainers/2020-May.txt.gz
gunzip 2020-May.txt.gz

python3 test_mbox.py
#!/usr/bin/python3

import mailbox

mbox_file = mailbox.mbox('2020-May.txt')
if mbox_file.items() != []:
print("OK")