It looks like the problem is that the message header says that the
charset=us-ascii but the text is actually in a different encoding.

The same messages do not cause problems for the archiver.
I think that's because the archiver reads the entire message into a
string first using UTF-8, and the mail is parsed from the string, not
directly from a file.
The string has been cleansed of encoding issues.

A work-round for import-mbox is to invoke:

message.set_charset(None)

just before the as_string(), because that skips any encoding of the payload.

But I don't think that's a long-term solution.



On 22 November 2016 at 15:47, sebb <[email protected]> wrote:
> These are the file names:
>
> 00439.982a2ff6189badfe70c2fe3c972466a2
> 02472.5c879dd55c3d4171e1787e8529bbd7e1
>
>
>
> On 22 November 2016 at 15:42, sebb <[email protected]> wrote:
>> OK, I've added a basic error report.
>>
>> Note: I've since found the spamassassin e-mail corpus, and a couple of
>> the easy_ham mails look as though they have the same problem.
>>
>> I'm about to start investigastions.
>>
>>
>> On 22 November 2016 at 12:46, Francesco Chicchiriccò
>> <[email protected]> wrote:
>>> On 22/11/2016 10:16, sebb wrote:
>>>>
>>>> Sorry about that, I decided to change the thread id to its name and
>>>> did not change all the references.
>>>> Should be OK now.
>>>
>>>
>>> Yes, I confirm it is (getting the original exception).
>>>
>>>> Going back to the original encoding issue: I have tried and failed to
>>>> reproduce it.
>>>>
>>>> Can you find out which mbox caused the problem so I can take a look?
>>>
>>>
>>> I know which mbox is causing the problem, but it's a private mailing list,
>>> so I'd rather be safer to extract the troublesome message into a separate
>>> mbox, possibly by changing some bits to avoid unwanted disclosures.
>>>
>>> Is there an easy way to add some debug statement about which message is
>>> actually the one causing troubles?
>>>
>>> FYI at the moment the stacktrace is
>>>
>>> Traceback (most recent call last):
>>>   File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
>>>     self.run()
>>>   File "import-mbox.py", line 295, in run
>>>     'source': message.as_string()
>>>   File "/usr/lib/python3.5/email/message.py", line 159, in as_string
>>>     g.flatten(self, unixfrom=unixfrom)
>>>   File "/usr/lib/python3.5/email/generator.py", line 115, in flatten
>>>     self._write(msg)
>>>   File "/usr/lib/python3.5/email/generator.py", line 181, in _write
>>>     self._dispatch(msg)
>>>   File "/usr/lib/python3.5/email/generator.py", line 214, in _dispatch
>>>     meth(msg)
>>>   File "/usr/lib/python3.5/email/generator.py", line 243, in _handle_text
>>>     msg.set_payload(payload, charset)
>>>   File "/usr/lib/python3.5/email/message.py", line 316, in set_payload
>>>     payload = payload.encode(charset.output_charset)
>>> UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in
>>> position 3657: ordinal not in range(128)
>>>
>>> All done! 0 records inserted/updated after 19 seconds. 0 records were bad
>>> and ignored
>>>
>>> Regards.
>>>
>>>
>>>> On 22 November 2016 at 07:23, Francesco Chicchiriccò<[email protected]>
>>>> wrote:
>>>>>
>>>>> Hi all,
>>>>> after latest commits, I get now the following error when importing from
>>>>> mbox:
>>>>>
>>>>> Exception in thread Thread-1:
>>>>> Traceback (most recent call last):
>>>>>    File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
>>>>>      self.run()
>>>>>    File "import-mbox.py", line 314, in run
>>>>>      bulk.assign(self.id, ja, es, 'mbox')
>>>>> AttributeError: 'SlurpThread' object has no attribute 'id'
>>>>>
>>>>> Regards.
>>>>>
>>>>>
>>>>> On 21/11/2016 17:19, sebb wrote:
>>>>>>
>>>>>> On 21 November 2016 at 11:52, Daniel Gruno <[email protected]> wrote:
>>>>>>>
>>>>>>> On 11/21/2016 12:50 PM, sebb wrote:
>>>>>>>>
>>>>>>>> On 21 November 2016 at 11:40, Francesco Chicchiriccò
>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>> not sure but it seems that the commit below broke my scheduled import
>>>>>>>>> from mbox:
>>>>>>>>
>>>>>>>> It won't be that commit, most likely the fix for #251
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/apache/incubator-ponymail/commit/1a3bff403166c917738fd02acefc988b909d4eae#diff-0102373f79eaa72ffaff3ce7675b6a43
>>>>>>>>
>>>>>>>> This presumably means the archiver would have fallen over with the
>>>>>>>> same
>>>>>>>> e-mail.
>>>>>>>> Or there is an encoding problem with writing the mail to the mbox - or
>>>>>>>> reading it - so the importer is not seeing the same input as the
>>>>>>>> archiver.
>>>>>>>
>>>>>>> The importer usually sees things as ASCII, whereas the archiver _can_
>>>>>>> get fed input as unicode by postfix (I don't know why, but there it
>>>>>>> is).
>>>>>>> This may explain why. I think as_bytes is a safer way to archive, as
>>>>>>> it's binary.
>>>>>>
>>>>>> That all depends how the binary is generated.
>>>>>> As far as I can tell, the parsed message is not stored as binary, so
>>>>>> it has to be encoded to create the bytes.
>>>>>>
>>>>>>>> It would be useful to know what the message is that causes the issue.
>>>>>>>>
>>>>>>>> If you can find it I can take a look later.
>>>>>>>>
>>>>>>>>> Exception in thread Thread-1:
>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>     File "/usr/lib/python3.5/threading.py", line 914, in
>>>>>>>>> _bootstrap_inner
>>>>>>>>>       self.run()
>>>>>>>>>     File "import-mbox.py", line 297, in run
>>>>>>>>>       'source': message.as_string()
>>>>>>>>>     File "/usr/lib/python3.5/email/message.py", line 159, in
>>>>>>>>> as_string
>>>>>>>>>       g.flatten(self, unixfrom=unixfrom)
>>>>>>>>>     File "/usr/lib/python3.5/email/generator.py", line 115, in
>>>>>>>>> flatten
>>>>>>>>>       self._write(msg)
>>>>>>>>>     File "/usr/lib/python3.5/email/generator.py", line 181, in _write
>>>>>>>>>       self._dispatch(msg)
>>>>>>>>>     File "/usr/lib/python3.5/email/generator.py", line 214, in
>>>>>>>>> _dispatch
>>>>>>>>>       meth(msg)
>>>>>>>>>     File "/usr/lib/python3.5/email/generator.py", line 243, in
>>>>>>>>> _handle_text
>>>>>>>>>       msg.set_payload(payload, charset)
>>>>>>>>>     File "/usr/lib/python3.5/email/message.py", line 316, in
>>>>>>>>> set_payload
>>>>>>>>>       payload = payload.encode(charset.output_charset)
>>>>>>>>> UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in
>>>>>>>>> position 3657: ordinal not in range(128)
>>>>>>>>>
>>>>>>>>> Any hint / workaround?
>>>
>>>
>>> --
>>> Francesco Chicchiriccò
>>>
>>> Tirasa - Open Source Excellence
>>> http://www.tirasa.net/
>>>
>>> Member at The Apache Software Foundation
>>> Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
>>> http://home.apache.org/~ilgrosso/
>>>

Reply via email to