These are the file names:

00439.982a2ff6189badfe70c2fe3c972466a2
02472.5c879dd55c3d4171e1787e8529bbd7e1



On 22 November 2016 at 15:42, sebb <[email protected]> wrote:
> OK, I've added a basic error report.
>
> Note: I've since found the spamassassin e-mail corpus, and a couple of
> the easy_ham mails look as though they have the same problem.
>
> I'm about to start investigastions.
>
>
> On 22 November 2016 at 12:46, Francesco Chicchiriccò
> <[email protected]> wrote:
>> On 22/11/2016 10:16, sebb wrote:
>>>
>>> Sorry about that, I decided to change the thread id to its name and
>>> did not change all the references.
>>> Should be OK now.
>>
>>
>> Yes, I confirm it is (getting the original exception).
>>
>>> Going back to the original encoding issue: I have tried and failed to
>>> reproduce it.
>>>
>>> Can you find out which mbox caused the problem so I can take a look?
>>
>>
>> I know which mbox is causing the problem, but it's a private mailing list,
>> so I'd rather be safer to extract the troublesome message into a separate
>> mbox, possibly by changing some bits to avoid unwanted disclosures.
>>
>> Is there an easy way to add some debug statement about which message is
>> actually the one causing troubles?
>>
>> FYI at the moment the stacktrace is
>>
>> Traceback (most recent call last):
>>   File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
>>     self.run()
>>   File "import-mbox.py", line 295, in run
>>     'source': message.as_string()
>>   File "/usr/lib/python3.5/email/message.py", line 159, in as_string
>>     g.flatten(self, unixfrom=unixfrom)
>>   File "/usr/lib/python3.5/email/generator.py", line 115, in flatten
>>     self._write(msg)
>>   File "/usr/lib/python3.5/email/generator.py", line 181, in _write
>>     self._dispatch(msg)
>>   File "/usr/lib/python3.5/email/generator.py", line 214, in _dispatch
>>     meth(msg)
>>   File "/usr/lib/python3.5/email/generator.py", line 243, in _handle_text
>>     msg.set_payload(payload, charset)
>>   File "/usr/lib/python3.5/email/message.py", line 316, in set_payload
>>     payload = payload.encode(charset.output_charset)
>> UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in
>> position 3657: ordinal not in range(128)
>>
>> All done! 0 records inserted/updated after 19 seconds. 0 records were bad
>> and ignored
>>
>> Regards.
>>
>>
>>> On 22 November 2016 at 07:23, Francesco Chicchiriccò<[email protected]>
>>> wrote:
>>>>
>>>> Hi all,
>>>> after latest commits, I get now the following error when importing from
>>>> mbox:
>>>>
>>>> Exception in thread Thread-1:
>>>> Traceback (most recent call last):
>>>>    File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
>>>>      self.run()
>>>>    File "import-mbox.py", line 314, in run
>>>>      bulk.assign(self.id, ja, es, 'mbox')
>>>> AttributeError: 'SlurpThread' object has no attribute 'id'
>>>>
>>>> Regards.
>>>>
>>>>
>>>> On 21/11/2016 17:19, sebb wrote:
>>>>>
>>>>> On 21 November 2016 at 11:52, Daniel Gruno <[email protected]> wrote:
>>>>>>
>>>>>> On 11/21/2016 12:50 PM, sebb wrote:
>>>>>>>
>>>>>>> On 21 November 2016 at 11:40, Francesco Chicchiriccò
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> not sure but it seems that the commit below broke my scheduled import
>>>>>>>> from mbox:
>>>>>>>
>>>>>>> It won't be that commit, most likely the fix for #251
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/apache/incubator-ponymail/commit/1a3bff403166c917738fd02acefc988b909d4eae#diff-0102373f79eaa72ffaff3ce7675b6a43
>>>>>>>
>>>>>>> This presumably means the archiver would have fallen over with the
>>>>>>> same
>>>>>>> e-mail.
>>>>>>> Or there is an encoding problem with writing the mail to the mbox - or
>>>>>>> reading it - so the importer is not seeing the same input as the
>>>>>>> archiver.
>>>>>>
>>>>>> The importer usually sees things as ASCII, whereas the archiver _can_
>>>>>> get fed input as unicode by postfix (I don't know why, but there it
>>>>>> is).
>>>>>> This may explain why. I think as_bytes is a safer way to archive, as
>>>>>> it's binary.
>>>>>
>>>>> That all depends how the binary is generated.
>>>>> As far as I can tell, the parsed message is not stored as binary, so
>>>>> it has to be encoded to create the bytes.
>>>>>
>>>>>>> It would be useful to know what the message is that causes the issue.
>>>>>>>
>>>>>>> If you can find it I can take a look later.
>>>>>>>
>>>>>>>> Exception in thread Thread-1:
>>>>>>>> Traceback (most recent call last):
>>>>>>>>     File "/usr/lib/python3.5/threading.py", line 914, in
>>>>>>>> _bootstrap_inner
>>>>>>>>       self.run()
>>>>>>>>     File "import-mbox.py", line 297, in run
>>>>>>>>       'source': message.as_string()
>>>>>>>>     File "/usr/lib/python3.5/email/message.py", line 159, in
>>>>>>>> as_string
>>>>>>>>       g.flatten(self, unixfrom=unixfrom)
>>>>>>>>     File "/usr/lib/python3.5/email/generator.py", line 115, in
>>>>>>>> flatten
>>>>>>>>       self._write(msg)
>>>>>>>>     File "/usr/lib/python3.5/email/generator.py", line 181, in _write
>>>>>>>>       self._dispatch(msg)
>>>>>>>>     File "/usr/lib/python3.5/email/generator.py", line 214, in
>>>>>>>> _dispatch
>>>>>>>>       meth(msg)
>>>>>>>>     File "/usr/lib/python3.5/email/generator.py", line 243, in
>>>>>>>> _handle_text
>>>>>>>>       msg.set_payload(payload, charset)
>>>>>>>>     File "/usr/lib/python3.5/email/message.py", line 316, in
>>>>>>>> set_payload
>>>>>>>>       payload = payload.encode(charset.output_charset)
>>>>>>>> UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in
>>>>>>>> position 3657: ordinal not in range(128)
>>>>>>>>
>>>>>>>> Any hint / workaround?
>>
>>
>> --
>> Francesco Chicchiriccò
>>
>> Tirasa - Open Source Excellence
>> http://www.tirasa.net/
>>
>> Member at The Apache Software Foundation
>> Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
>> http://home.apache.org/~ilgrosso/
>>

Reply via email to