On 23/11/2016 08:58, Francesco Chicchiriccò wrote:
Hi Sebb,
thanks to the latest modifications, I was able to successfully
complete the mbox import despite of failing that specific message's
attachment.
Forgot to report the error message:
Error ''ascii' codec can't encode character '\ufffd' in position 3657:
ordinal not in range(128)' processing id
a01fd40ed66aeaa9580a5becb12296b612e51cbc41efa2da9e8bd2d8@1442210618@<syncope.tirasa.net>
msg <[email protected]>
Hence, I was able to isolate that message and put it in the attached
mbox: hope this will help you fixing.
Thanks for your support.
Regards.
On 22/11/2016 17:23, sebb wrote:
It looks like the problem is that the message header says that the
charset=us-ascii but the text is actually in a different encoding.
The same messages do not cause problems for the archiver.
I think that's because the archiver reads the entire message into a
string first using UTF-8, and the mail is parsed from the string, not
directly from a file.
The string has been cleansed of encoding issues.
A work-round for import-mbox is to invoke:
message.set_charset(None)
just before the as_string(), because that skips any encoding of the
payload.
But I don't think that's a long-term solution.
On 22 November 2016 at 15:47, sebb <[email protected]> wrote:
These are the file names:
00439.982a2ff6189badfe70c2fe3c972466a2
02472.5c879dd55c3d4171e1787e8529bbd7e1
On 22 November 2016 at 15:42, sebb <[email protected]> wrote:
OK, I've added a basic error report.
Note: I've since found the spamassassin e-mail corpus, and a couple of
the easy_ham mails look as though they have the same problem.
I'm about to start investigastions.
On 22 November 2016 at 12:46, Francesco Chicchiriccò
<[email protected]> wrote:
On 22/11/2016 10:16, sebb wrote:
Sorry about that, I decided to change the thread id to its name and
did not change all the references.
Should be OK now.
Yes, I confirm it is (getting the original exception).
Going back to the original encoding issue: I have tried and
failed to
reproduce it.
Can you find out which mbox caused the problem so I can take a look?
I know which mbox is causing the problem, but it's a private
mailing list,
so I'd rather be safer to extract the troublesome message into a
separate
mbox, possibly by changing some bits to avoid unwanted disclosures.
Is there an easy way to add some debug statement about which
message is
actually the one causing troubles?
FYI at the moment the stacktrace is
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in
_bootstrap_inner
self.run()
File "import-mbox.py", line 295, in run
'source': message.as_string()
File "/usr/lib/python3.5/email/message.py", line 159, in as_string
g.flatten(self, unixfrom=unixfrom)
File "/usr/lib/python3.5/email/generator.py", line 115, in flatten
self._write(msg)
File "/usr/lib/python3.5/email/generator.py", line 181, in _write
self._dispatch(msg)
File "/usr/lib/python3.5/email/generator.py", line 214, in
_dispatch
meth(msg)
File "/usr/lib/python3.5/email/generator.py", line 243, in
_handle_text
msg.set_payload(payload, charset)
File "/usr/lib/python3.5/email/message.py", line 316, in
set_payload
payload = payload.encode(charset.output_charset)
UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in
position 3657: ordinal not in range(128)
All done! 0 records inserted/updated after 19 seconds. 0 records
were bad
and ignored
Regards.
On 22 November 2016 at 07:23, Francesco
Chicchiriccò<[email protected]>
wrote:
Hi all,
after latest commits, I get now the following error when
importing from
mbox:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in
_bootstrap_inner
self.run()
File "import-mbox.py", line 314, in run
bulk.assign(self.id, ja, es, 'mbox')
AttributeError: 'SlurpThread' object has no attribute 'id'
Regards.
On 21/11/2016 17:19, sebb wrote:
On 21 November 2016 at 11:52, Daniel Gruno
<[email protected]> wrote:
On 11/21/2016 12:50 PM, sebb wrote:
On 21 November 2016 at 11:40, Francesco Chicchiriccò
<[email protected]> wrote:
Hi all,
not sure but it seems that the commit below broke my
scheduled import
from mbox:
It won't be that commit, most likely the fix for #251
https://github.com/apache/incubator-ponymail/commit/1a3bff403166c917738fd02acefc988b909d4eae#diff-0102373f79eaa72ffaff3ce7675b6a43
This presumably means the archiver would have fallen over
with the
same
e-mail.
Or there is an encoding problem with writing the mail to the
mbox - or
reading it - so the importer is not seeing the same input as the
archiver.
The importer usually sees things as ASCII, whereas the
archiver _can_
get fed input as unicode by postfix (I don't know why, but
there it
is).
This may explain why. I think as_bytes is a safer way to
archive, as
it's binary.
That all depends how the binary is generated.
As far as I can tell, the parsed message is not stored as
binary, so
it has to be encoded to create the bytes.
It would be useful to know what the message is that causes
the issue.
If you can find it I can take a look later.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in
_bootstrap_inner
self.run()
File "import-mbox.py", line 297, in run
'source': message.as_string()
File "/usr/lib/python3.5/email/message.py", line 159, in
as_string
g.flatten(self, unixfrom=unixfrom)
File "/usr/lib/python3.5/email/generator.py", line 115, in
flatten
self._write(msg)
File "/usr/lib/python3.5/email/generator.py", line 181,
in _write
self._dispatch(msg)
File "/usr/lib/python3.5/email/generator.py", line 214, in
_dispatch
meth(msg)
File "/usr/lib/python3.5/email/generator.py", line 243, in
_handle_text
msg.set_payload(payload, charset)
File "/usr/lib/python3.5/email/message.py", line 316, in
set_payload
payload = payload.encode(charset.output_charset)
UnicodeEncodeError: 'ascii' codec can't encode character
'\ufffd' in
position 3657: ordinal not in range(128)
Any hint / workaround?
--
Francesco Chicchiriccò
Tirasa - Open Source Excellence
http://www.tirasa.net/
Member at The Apache Software Foundation
Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
http://home.apache.org/~ilgrosso/