OK, I've added a basic error report. Note: I've since found the spamassassin e-mail corpus, and a couple of the easy_ham mails look as though they have the same problem.
I'm about to start investigastions. On 22 November 2016 at 12:46, Francesco Chicchiriccò <[email protected]> wrote: > On 22/11/2016 10:16, sebb wrote: >> >> Sorry about that, I decided to change the thread id to its name and >> did not change all the references. >> Should be OK now. > > > Yes, I confirm it is (getting the original exception). > >> Going back to the original encoding issue: I have tried and failed to >> reproduce it. >> >> Can you find out which mbox caused the problem so I can take a look? > > > I know which mbox is causing the problem, but it's a private mailing list, > so I'd rather be safer to extract the troublesome message into a separate > mbox, possibly by changing some bits to avoid unwanted disclosures. > > Is there an easy way to add some debug statement about which message is > actually the one causing troubles? > > FYI at the moment the stacktrace is > > Traceback (most recent call last): > File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner > self.run() > File "import-mbox.py", line 295, in run > 'source': message.as_string() > File "/usr/lib/python3.5/email/message.py", line 159, in as_string > g.flatten(self, unixfrom=unixfrom) > File "/usr/lib/python3.5/email/generator.py", line 115, in flatten > self._write(msg) > File "/usr/lib/python3.5/email/generator.py", line 181, in _write > self._dispatch(msg) > File "/usr/lib/python3.5/email/generator.py", line 214, in _dispatch > meth(msg) > File "/usr/lib/python3.5/email/generator.py", line 243, in _handle_text > msg.set_payload(payload, charset) > File "/usr/lib/python3.5/email/message.py", line 316, in set_payload > payload = payload.encode(charset.output_charset) > UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in > position 3657: ordinal not in range(128) > > All done! 0 records inserted/updated after 19 seconds. 0 records were bad > and ignored > > Regards. > > >> On 22 November 2016 at 07:23, Francesco Chicchiriccò<[email protected]> >> wrote: >>> >>> Hi all, >>> after latest commits, I get now the following error when importing from >>> mbox: >>> >>> Exception in thread Thread-1: >>> Traceback (most recent call last): >>> File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner >>> self.run() >>> File "import-mbox.py", line 314, in run >>> bulk.assign(self.id, ja, es, 'mbox') >>> AttributeError: 'SlurpThread' object has no attribute 'id' >>> >>> Regards. >>> >>> >>> On 21/11/2016 17:19, sebb wrote: >>>> >>>> On 21 November 2016 at 11:52, Daniel Gruno <[email protected]> wrote: >>>>> >>>>> On 11/21/2016 12:50 PM, sebb wrote: >>>>>> >>>>>> On 21 November 2016 at 11:40, Francesco Chicchiriccò >>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> not sure but it seems that the commit below broke my scheduled import >>>>>>> from mbox: >>>>>> >>>>>> It won't be that commit, most likely the fix for #251 >>>>>> >>>>>> >>>>>> >>>>>> https://github.com/apache/incubator-ponymail/commit/1a3bff403166c917738fd02acefc988b909d4eae#diff-0102373f79eaa72ffaff3ce7675b6a43 >>>>>> >>>>>> This presumably means the archiver would have fallen over with the >>>>>> same >>>>>> e-mail. >>>>>> Or there is an encoding problem with writing the mail to the mbox - or >>>>>> reading it - so the importer is not seeing the same input as the >>>>>> archiver. >>>>> >>>>> The importer usually sees things as ASCII, whereas the archiver _can_ >>>>> get fed input as unicode by postfix (I don't know why, but there it >>>>> is). >>>>> This may explain why. I think as_bytes is a safer way to archive, as >>>>> it's binary. >>>> >>>> That all depends how the binary is generated. >>>> As far as I can tell, the parsed message is not stored as binary, so >>>> it has to be encoded to create the bytes. >>>> >>>>>> It would be useful to know what the message is that causes the issue. >>>>>> >>>>>> If you can find it I can take a look later. >>>>>> >>>>>>> Exception in thread Thread-1: >>>>>>> Traceback (most recent call last): >>>>>>> File "/usr/lib/python3.5/threading.py", line 914, in >>>>>>> _bootstrap_inner >>>>>>> self.run() >>>>>>> File "import-mbox.py", line 297, in run >>>>>>> 'source': message.as_string() >>>>>>> File "/usr/lib/python3.5/email/message.py", line 159, in >>>>>>> as_string >>>>>>> g.flatten(self, unixfrom=unixfrom) >>>>>>> File "/usr/lib/python3.5/email/generator.py", line 115, in >>>>>>> flatten >>>>>>> self._write(msg) >>>>>>> File "/usr/lib/python3.5/email/generator.py", line 181, in _write >>>>>>> self._dispatch(msg) >>>>>>> File "/usr/lib/python3.5/email/generator.py", line 214, in >>>>>>> _dispatch >>>>>>> meth(msg) >>>>>>> File "/usr/lib/python3.5/email/generator.py", line 243, in >>>>>>> _handle_text >>>>>>> msg.set_payload(payload, charset) >>>>>>> File "/usr/lib/python3.5/email/message.py", line 316, in >>>>>>> set_payload >>>>>>> payload = payload.encode(charset.output_charset) >>>>>>> UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in >>>>>>> position 3657: ordinal not in range(128) >>>>>>> >>>>>>> Any hint / workaround? > > > -- > Francesco Chicchiriccò > > Tirasa - Open Source Excellence > http://www.tirasa.net/ > > Member at The Apache Software Foundation > Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail > http://home.apache.org/~ilgrosso/ >
