It looks like the problem is that the message header says that the charset=us-ascii but the text is actually in a different encoding.
The same messages do not cause problems for the archiver. I think that's because the archiver reads the entire message into a string first using UTF-8, and the mail is parsed from the string, not directly from a file. The string has been cleansed of encoding issues. A work-round for import-mbox is to invoke: message.set_charset(None) just before the as_string(), because that skips any encoding of the payload. But I don't think that's a long-term solution. On 22 November 2016 at 15:47, sebb <[email protected]> wrote: > These are the file names: > > 00439.982a2ff6189badfe70c2fe3c972466a2 > 02472.5c879dd55c3d4171e1787e8529bbd7e1 > > > > On 22 November 2016 at 15:42, sebb <[email protected]> wrote: >> OK, I've added a basic error report. >> >> Note: I've since found the spamassassin e-mail corpus, and a couple of >> the easy_ham mails look as though they have the same problem. >> >> I'm about to start investigastions. >> >> >> On 22 November 2016 at 12:46, Francesco Chicchiriccò >> <[email protected]> wrote: >>> On 22/11/2016 10:16, sebb wrote: >>>> >>>> Sorry about that, I decided to change the thread id to its name and >>>> did not change all the references. >>>> Should be OK now. >>> >>> >>> Yes, I confirm it is (getting the original exception). >>> >>>> Going back to the original encoding issue: I have tried and failed to >>>> reproduce it. >>>> >>>> Can you find out which mbox caused the problem so I can take a look? >>> >>> >>> I know which mbox is causing the problem, but it's a private mailing list, >>> so I'd rather be safer to extract the troublesome message into a separate >>> mbox, possibly by changing some bits to avoid unwanted disclosures. >>> >>> Is there an easy way to add some debug statement about which message is >>> actually the one causing troubles? >>> >>> FYI at the moment the stacktrace is >>> >>> Traceback (most recent call last): >>> File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner >>> self.run() >>> File "import-mbox.py", line 295, in run >>> 'source': message.as_string() >>> File "/usr/lib/python3.5/email/message.py", line 159, in as_string >>> g.flatten(self, unixfrom=unixfrom) >>> File "/usr/lib/python3.5/email/generator.py", line 115, in flatten >>> self._write(msg) >>> File "/usr/lib/python3.5/email/generator.py", line 181, in _write >>> self._dispatch(msg) >>> File "/usr/lib/python3.5/email/generator.py", line 214, in _dispatch >>> meth(msg) >>> File "/usr/lib/python3.5/email/generator.py", line 243, in _handle_text >>> msg.set_payload(payload, charset) >>> File "/usr/lib/python3.5/email/message.py", line 316, in set_payload >>> payload = payload.encode(charset.output_charset) >>> UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in >>> position 3657: ordinal not in range(128) >>> >>> All done! 0 records inserted/updated after 19 seconds. 0 records were bad >>> and ignored >>> >>> Regards. >>> >>> >>>> On 22 November 2016 at 07:23, Francesco Chicchiriccò<[email protected]> >>>> wrote: >>>>> >>>>> Hi all, >>>>> after latest commits, I get now the following error when importing from >>>>> mbox: >>>>> >>>>> Exception in thread Thread-1: >>>>> Traceback (most recent call last): >>>>> File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner >>>>> self.run() >>>>> File "import-mbox.py", line 314, in run >>>>> bulk.assign(self.id, ja, es, 'mbox') >>>>> AttributeError: 'SlurpThread' object has no attribute 'id' >>>>> >>>>> Regards. >>>>> >>>>> >>>>> On 21/11/2016 17:19, sebb wrote: >>>>>> >>>>>> On 21 November 2016 at 11:52, Daniel Gruno <[email protected]> wrote: >>>>>>> >>>>>>> On 11/21/2016 12:50 PM, sebb wrote: >>>>>>>> >>>>>>>> On 21 November 2016 at 11:40, Francesco Chicchiriccò >>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> not sure but it seems that the commit below broke my scheduled import >>>>>>>>> from mbox: >>>>>>>> >>>>>>>> It won't be that commit, most likely the fix for #251 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> https://github.com/apache/incubator-ponymail/commit/1a3bff403166c917738fd02acefc988b909d4eae#diff-0102373f79eaa72ffaff3ce7675b6a43 >>>>>>>> >>>>>>>> This presumably means the archiver would have fallen over with the >>>>>>>> same >>>>>>>> e-mail. >>>>>>>> Or there is an encoding problem with writing the mail to the mbox - or >>>>>>>> reading it - so the importer is not seeing the same input as the >>>>>>>> archiver. >>>>>>> >>>>>>> The importer usually sees things as ASCII, whereas the archiver _can_ >>>>>>> get fed input as unicode by postfix (I don't know why, but there it >>>>>>> is). >>>>>>> This may explain why. I think as_bytes is a safer way to archive, as >>>>>>> it's binary. >>>>>> >>>>>> That all depends how the binary is generated. >>>>>> As far as I can tell, the parsed message is not stored as binary, so >>>>>> it has to be encoded to create the bytes. >>>>>> >>>>>>>> It would be useful to know what the message is that causes the issue. >>>>>>>> >>>>>>>> If you can find it I can take a look later. >>>>>>>> >>>>>>>>> Exception in thread Thread-1: >>>>>>>>> Traceback (most recent call last): >>>>>>>>> File "/usr/lib/python3.5/threading.py", line 914, in >>>>>>>>> _bootstrap_inner >>>>>>>>> self.run() >>>>>>>>> File "import-mbox.py", line 297, in run >>>>>>>>> 'source': message.as_string() >>>>>>>>> File "/usr/lib/python3.5/email/message.py", line 159, in >>>>>>>>> as_string >>>>>>>>> g.flatten(self, unixfrom=unixfrom) >>>>>>>>> File "/usr/lib/python3.5/email/generator.py", line 115, in >>>>>>>>> flatten >>>>>>>>> self._write(msg) >>>>>>>>> File "/usr/lib/python3.5/email/generator.py", line 181, in _write >>>>>>>>> self._dispatch(msg) >>>>>>>>> File "/usr/lib/python3.5/email/generator.py", line 214, in >>>>>>>>> _dispatch >>>>>>>>> meth(msg) >>>>>>>>> File "/usr/lib/python3.5/email/generator.py", line 243, in >>>>>>>>> _handle_text >>>>>>>>> msg.set_payload(payload, charset) >>>>>>>>> File "/usr/lib/python3.5/email/message.py", line 316, in >>>>>>>>> set_payload >>>>>>>>> payload = payload.encode(charset.output_charset) >>>>>>>>> UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in >>>>>>>>> position 3657: ordinal not in range(128) >>>>>>>>> >>>>>>>>> Any hint / workaround? >>> >>> >>> -- >>> Francesco Chicchiriccò >>> >>> Tirasa - Open Source Excellence >>> http://www.tirasa.net/ >>> >>> Member at The Apache Software Foundation >>> Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail >>> http://home.apache.org/~ilgrosso/ >>>
