There are two issues which prevent that message from being parsed OK:
1) encoding
The header says:
Content-Type: text/html;
charset="us-ascii"
The html itself says:
<meta charset="utf-8">
But AFAICT the html is actually in Windows CP1252
2) HTML-only content
Even without the encoding issues, the content will be rejected unless
the import/archiver uses html2text
[Note that the message itself is SPAM.]
On 23 November 2016 at 07:59, Francesco Chicchiriccò
<[email protected]> wrote:
> On 23/11/2016 08:58, Francesco Chicchiriccò wrote:
>>
>> Hi Sebb,
>> thanks to the latest modifications, I was able to successfully complete
>> the mbox import despite of failing that specific message's attachment.
>
>
> Forgot to report the error message:
>
> Error ''ascii' codec can't encode character '\ufffd' in position 3657:
> ordinal not in range(128)' processing id
> a01fd40ed66aeaa9580a5becb12296b612e51cbc41efa2da9e8bd2d8@1442210618@<syncope.tirasa.net>
> msg <[email protected]>
>
>
>> Hence, I was able to isolate that message and put it in the attached mbox:
>> hope this will help you fixing.
>>
>> Thanks for your support.
>> Regards.
>>
>> On 22/11/2016 17:23, sebb wrote:
>>>
>>> It looks like the problem is that the message header says that the
>>> charset=us-ascii but the text is actually in a different encoding.
>>>
>>> The same messages do not cause problems for the archiver.
>>> I think that's because the archiver reads the entire message into a
>>> string first using UTF-8, and the mail is parsed from the string, not
>>> directly from a file.
>>> The string has been cleansed of encoding issues.
>>>
>>> A work-round for import-mbox is to invoke:
>>>
>>> message.set_charset(None)
>>>
>>> just before the as_string(), because that skips any encoding of the
>>> payload.
>>>
>>> But I don't think that's a long-term solution.
>>>
>>>
>>>
>>> On 22 November 2016 at 15:47, sebb <[email protected]> wrote:
>>>>
>>>> These are the file names:
>>>>
>>>> 00439.982a2ff6189badfe70c2fe3c972466a2
>>>> 02472.5c879dd55c3d4171e1787e8529bbd7e1
>>>>
>>>>
>>>>
>>>> On 22 November 2016 at 15:42, sebb <[email protected]> wrote:
>>>>>
>>>>> OK, I've added a basic error report.
>>>>>
>>>>> Note: I've since found the spamassassin e-mail corpus, and a couple of
>>>>> the easy_ham mails look as though they have the same problem.
>>>>>
>>>>> I'm about to start investigastions.
>>>>>
>>>>>
>>>>> On 22 November 2016 at 12:46, Francesco Chicchiriccò
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> On 22/11/2016 10:16, sebb wrote:
>>>>>>>
>>>>>>> Sorry about that, I decided to change the thread id to its name and
>>>>>>> did not change all the references.
>>>>>>> Should be OK now.
>>>>>>
>>>>>>
>>>>>> Yes, I confirm it is (getting the original exception).
>>>>>>
>>>>>>> Going back to the original encoding issue: I have tried and failed to
>>>>>>> reproduce it.
>>>>>>>
>>>>>>> Can you find out which mbox caused the problem so I can take a look?
>>>>>>
>>>>>>
>>>>>> I know which mbox is causing the problem, but it's a private mailing
>>>>>> list,
>>>>>> so I'd rather be safer to extract the troublesome message into a
>>>>>> separate
>>>>>> mbox, possibly by changing some bits to avoid unwanted disclosures.
>>>>>>
>>>>>> Is there an easy way to add some debug statement about which message
>>>>>> is
>>>>>> actually the one causing troubles?
>>>>>>
>>>>>> FYI at the moment the stacktrace is
>>>>>>
>>>>>> Traceback (most recent call last):
>>>>>> File "/usr/lib/python3.5/threading.py", line 914, in
>>>>>> _bootstrap_inner
>>>>>> self.run()
>>>>>> File "import-mbox.py", line 295, in run
>>>>>> 'source': message.as_string()
>>>>>> File "/usr/lib/python3.5/email/message.py", line 159, in as_string
>>>>>> g.flatten(self, unixfrom=unixfrom)
>>>>>> File "/usr/lib/python3.5/email/generator.py", line 115, in flatten
>>>>>> self._write(msg)
>>>>>> File "/usr/lib/python3.5/email/generator.py", line 181, in _write
>>>>>> self._dispatch(msg)
>>>>>> File "/usr/lib/python3.5/email/generator.py", line 214, in
>>>>>> _dispatch
>>>>>> meth(msg)
>>>>>> File "/usr/lib/python3.5/email/generator.py", line 243, in
>>>>>> _handle_text
>>>>>> msg.set_payload(payload, charset)
>>>>>> File "/usr/lib/python3.5/email/message.py", line 316, in
>>>>>> set_payload
>>>>>> payload = payload.encode(charset.output_charset)
>>>>>> UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in
>>>>>> position 3657: ordinal not in range(128)
>>>>>>
>>>>>> All done! 0 records inserted/updated after 19 seconds. 0 records were
>>>>>> bad
>>>>>> and ignored
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>>
>>>>>>> On 22 November 2016 at 07:23, Francesco
>>>>>>> Chicchiriccò<[email protected]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> after latest commits, I get now the following error when importing
>>>>>>>> from
>>>>>>>> mbox:
>>>>>>>>
>>>>>>>> Exception in thread Thread-1:
>>>>>>>> Traceback (most recent call last):
>>>>>>>> File "/usr/lib/python3.5/threading.py", line 914, in
>>>>>>>> _bootstrap_inner
>>>>>>>> self.run()
>>>>>>>> File "import-mbox.py", line 314, in run
>>>>>>>> bulk.assign(self.id, ja, es, 'mbox')
>>>>>>>> AttributeError: 'SlurpThread' object has no attribute 'id'
>>>>>>>>
>>>>>>>> Regards.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 21/11/2016 17:19, sebb wrote:
>>>>>>>>>
>>>>>>>>> On 21 November 2016 at 11:52, Daniel Gruno <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/21/2016 12:50 PM, sebb wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 21 November 2016 at 11:40, Francesco Chicchiriccò
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>> not sure but it seems that the commit below broke my scheduled
>>>>>>>>>>>> import
>>>>>>>>>>>> from mbox:
>>>>>>>>>>>
>>>>>>>>>>> It won't be that commit, most likely the fix for #251
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/incubator-ponymail/commit/1a3bff403166c917738fd02acefc988b909d4eae#diff-0102373f79eaa72ffaff3ce7675b6a43
>>>>>>>>>>>
>>>>>>>>>>> This presumably means the archiver would have fallen over with
>>>>>>>>>>> the
>>>>>>>>>>> same
>>>>>>>>>>> e-mail.
>>>>>>>>>>> Or there is an encoding problem with writing the mail to the mbox
>>>>>>>>>>> - or
>>>>>>>>>>> reading it - so the importer is not seeing the same input as the
>>>>>>>>>>> archiver.
>>>>>>>>>>
>>>>>>>>>> The importer usually sees things as ASCII, whereas the archiver
>>>>>>>>>> _can_
>>>>>>>>>> get fed input as unicode by postfix (I don't know why, but there
>>>>>>>>>> it
>>>>>>>>>> is).
>>>>>>>>>> This may explain why. I think as_bytes is a safer way to archive,
>>>>>>>>>> as
>>>>>>>>>> it's binary.
>>>>>>>>>
>>>>>>>>> That all depends how the binary is generated.
>>>>>>>>> As far as I can tell, the parsed message is not stored as binary,
>>>>>>>>> so
>>>>>>>>> it has to be encoded to create the bytes.
>>>>>>>>>
>>>>>>>>>>> It would be useful to know what the message is that causes the
>>>>>>>>>>> issue.
>>>>>>>>>>>
>>>>>>>>>>> If you can find it I can take a look later.
>>>>>>>>>>>
>>>>>>>>>>>> Exception in thread Thread-1:
>>>>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>>>> File "/usr/lib/python3.5/threading.py", line 914, in
>>>>>>>>>>>> _bootstrap_inner
>>>>>>>>>>>> self.run()
>>>>>>>>>>>> File "import-mbox.py", line 297, in run
>>>>>>>>>>>> 'source': message.as_string()
>>>>>>>>>>>> File "/usr/lib/python3.5/email/message.py", line 159, in
>>>>>>>>>>>> as_string
>>>>>>>>>>>> g.flatten(self, unixfrom=unixfrom)
>>>>>>>>>>>> File "/usr/lib/python3.5/email/generator.py", line 115, in
>>>>>>>>>>>> flatten
>>>>>>>>>>>> self._write(msg)
>>>>>>>>>>>> File "/usr/lib/python3.5/email/generator.py", line 181, in
>>>>>>>>>>>> _write
>>>>>>>>>>>> self._dispatch(msg)
>>>>>>>>>>>> File "/usr/lib/python3.5/email/generator.py", line 214, in
>>>>>>>>>>>> _dispatch
>>>>>>>>>>>> meth(msg)
>>>>>>>>>>>> File "/usr/lib/python3.5/email/generator.py", line 243, in
>>>>>>>>>>>> _handle_text
>>>>>>>>>>>> msg.set_payload(payload, charset)
>>>>>>>>>>>> File "/usr/lib/python3.5/email/message.py", line 316, in
>>>>>>>>>>>> set_payload
>>>>>>>>>>>> payload = payload.encode(charset.output_charset)
>>>>>>>>>>>> UnicodeEncodeError: 'ascii' codec can't encode character
>>>>>>>>>>>> '\ufffd' in
>>>>>>>>>>>> position 3657: ordinal not in range(128)
>>>>>>>>>>>>
>>>>>>>>>>>> Any hint / workaround?
>>
>>
> --
> Francesco Chicchiriccò
>
> Tirasa - Open Source Excellence
> http://www.tirasa.net/
>
> Member at The Apache Software Foundation
> Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
> http://home.apache.org/~ilgrosso/
>