These are the file names: 00439.982a2ff6189badfe70c2fe3c972466a2 02472.5c879dd55c3d4171e1787e8529bbd7e1
On 22 November 2016 at 15:42, sebb <[email protected]> wrote: > OK, I've added a basic error report. > > Note: I've since found the spamassassin e-mail corpus, and a couple of > the easy_ham mails look as though they have the same problem. > > I'm about to start investigastions. > > > On 22 November 2016 at 12:46, Francesco Chicchiriccò > <[email protected]> wrote: >> On 22/11/2016 10:16, sebb wrote: >>> >>> Sorry about that, I decided to change the thread id to its name and >>> did not change all the references. >>> Should be OK now. >> >> >> Yes, I confirm it is (getting the original exception). >> >>> Going back to the original encoding issue: I have tried and failed to >>> reproduce it. >>> >>> Can you find out which mbox caused the problem so I can take a look? >> >> >> I know which mbox is causing the problem, but it's a private mailing list, >> so I'd rather be safer to extract the troublesome message into a separate >> mbox, possibly by changing some bits to avoid unwanted disclosures. >> >> Is there an easy way to add some debug statement about which message is >> actually the one causing troubles? >> >> FYI at the moment the stacktrace is >> >> Traceback (most recent call last): >> File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner >> self.run() >> File "import-mbox.py", line 295, in run >> 'source': message.as_string() >> File "/usr/lib/python3.5/email/message.py", line 159, in as_string >> g.flatten(self, unixfrom=unixfrom) >> File "/usr/lib/python3.5/email/generator.py", line 115, in flatten >> self._write(msg) >> File "/usr/lib/python3.5/email/generator.py", line 181, in _write >> self._dispatch(msg) >> File "/usr/lib/python3.5/email/generator.py", line 214, in _dispatch >> meth(msg) >> File "/usr/lib/python3.5/email/generator.py", line 243, in _handle_text >> msg.set_payload(payload, charset) >> File "/usr/lib/python3.5/email/message.py", line 316, in set_payload >> payload = payload.encode(charset.output_charset) >> UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in >> position 3657: ordinal not in range(128) >> >> All done! 0 records inserted/updated after 19 seconds. 0 records were bad >> and ignored >> >> Regards. >> >> >>> On 22 November 2016 at 07:23, Francesco Chicchiriccò<[email protected]> >>> wrote: >>>> >>>> Hi all, >>>> after latest commits, I get now the following error when importing from >>>> mbox: >>>> >>>> Exception in thread Thread-1: >>>> Traceback (most recent call last): >>>> File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner >>>> self.run() >>>> File "import-mbox.py", line 314, in run >>>> bulk.assign(self.id, ja, es, 'mbox') >>>> AttributeError: 'SlurpThread' object has no attribute 'id' >>>> >>>> Regards. >>>> >>>> >>>> On 21/11/2016 17:19, sebb wrote: >>>>> >>>>> On 21 November 2016 at 11:52, Daniel Gruno <[email protected]> wrote: >>>>>> >>>>>> On 11/21/2016 12:50 PM, sebb wrote: >>>>>>> >>>>>>> On 21 November 2016 at 11:40, Francesco Chicchiriccò >>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>> not sure but it seems that the commit below broke my scheduled import >>>>>>>> from mbox: >>>>>>> >>>>>>> It won't be that commit, most likely the fix for #251 >>>>>>> >>>>>>> >>>>>>> >>>>>>> https://github.com/apache/incubator-ponymail/commit/1a3bff403166c917738fd02acefc988b909d4eae#diff-0102373f79eaa72ffaff3ce7675b6a43 >>>>>>> >>>>>>> This presumably means the archiver would have fallen over with the >>>>>>> same >>>>>>> e-mail. >>>>>>> Or there is an encoding problem with writing the mail to the mbox - or >>>>>>> reading it - so the importer is not seeing the same input as the >>>>>>> archiver. >>>>>> >>>>>> The importer usually sees things as ASCII, whereas the archiver _can_ >>>>>> get fed input as unicode by postfix (I don't know why, but there it >>>>>> is). >>>>>> This may explain why. I think as_bytes is a safer way to archive, as >>>>>> it's binary. >>>>> >>>>> That all depends how the binary is generated. >>>>> As far as I can tell, the parsed message is not stored as binary, so >>>>> it has to be encoded to create the bytes. >>>>> >>>>>>> It would be useful to know what the message is that causes the issue. >>>>>>> >>>>>>> If you can find it I can take a look later. >>>>>>> >>>>>>>> Exception in thread Thread-1: >>>>>>>> Traceback (most recent call last): >>>>>>>> File "/usr/lib/python3.5/threading.py", line 914, in >>>>>>>> _bootstrap_inner >>>>>>>> self.run() >>>>>>>> File "import-mbox.py", line 297, in run >>>>>>>> 'source': message.as_string() >>>>>>>> File "/usr/lib/python3.5/email/message.py", line 159, in >>>>>>>> as_string >>>>>>>> g.flatten(self, unixfrom=unixfrom) >>>>>>>> File "/usr/lib/python3.5/email/generator.py", line 115, in >>>>>>>> flatten >>>>>>>> self._write(msg) >>>>>>>> File "/usr/lib/python3.5/email/generator.py", line 181, in _write >>>>>>>> self._dispatch(msg) >>>>>>>> File "/usr/lib/python3.5/email/generator.py", line 214, in >>>>>>>> _dispatch >>>>>>>> meth(msg) >>>>>>>> File "/usr/lib/python3.5/email/generator.py", line 243, in >>>>>>>> _handle_text >>>>>>>> msg.set_payload(payload, charset) >>>>>>>> File "/usr/lib/python3.5/email/message.py", line 316, in >>>>>>>> set_payload >>>>>>>> payload = payload.encode(charset.output_charset) >>>>>>>> UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in >>>>>>>> position 3657: ordinal not in range(128) >>>>>>>> >>>>>>>> Any hint / workaround? >> >> >> -- >> Francesco Chicchiriccò >> >> Tirasa - Open Source Excellence >> http://www.tirasa.net/ >> >> Member at The Apache Software Foundation >> Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail >> http://home.apache.org/~ilgrosso/ >>
