Thank you all! I've added export LC_ALL=en_US.UTF-8 to my launch bash script and all works correctly.
Best, Martin so 17. 6. 2017 v 11:15 odesílatel Merlijn van Deen (valhallasw) < [email protected]> napsal: > Hi all, > > This is a combination of a Python 3 design choice (PEP 383 [1]) and T60786 > [2]. What happens is the following: > > 1) The locale is set to a encoding that cannot decode certain bytes -- for > example, ASCII, which can only decode bytes < 128. > 2) Python is started with a command line parameter that contains a byte > > 128 (\x80), for example, "ř' when UTF-8 encoded is represented by two > bytes: \xc5\x99. Both of these are > \x80, and can therefore not be > interpreted as ASCII > 3) Python3 needs to somehow decode these bytes into a text string. But > there is no valid way to do so! Instead of complaining loudly with a > UnicodeDecodeError, Python3 embeds the bytes as 'fake characters' in the > string -- as described in PEP 383. > \xc5\x59 is therefore now suddenly decoded as "'\udcc5\udc99". instead of > "ř". > 4) Pywikibot tries to encode these characters using utf-8, but they are > fake characters, and the .encode step blows up. > > A simple way to reproduce this is the following: > > valhallasw@tools-bastion-03:~/ucm$ cat test.py > import sys > encoded = sys.argv[1].encode('utf-8') > > valhallasw@tools-bastion-03:~/ucm$ LC_ALL=C python3 test.py řeklad > Traceback (most recent call last): > File "test.py", line 2, in <module> > encoded = sys.argv[1].encode('utf-8') > UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in > position 0: surrogates not allowed > > This should be fixed in future Python versions (likely 3.7), when PEP540 > [3] is implemented. > > As for the current situation, the simplest solution is to add 'export > LC_ALL=en_US.UTF-8' to your script, before the 'python ...' line. > > Best, > Merlijn > > [1] https://www.python.org/dev/peps/pep-0383/ > [2] https://phabricator.wikimedia.org/T60784 > [3] https://www.python.org/dev/peps/pep-0540/ > > > On 16 June 2017 at 23:58, Bryan Davis <[email protected]> wrote: > >> On Fri, Jun 16, 2017 at 10:15 AM, Martin Urbanec >> <[email protected]> wrote: >> > Hello, >> > >> > I have a script which should add a template to articles which are >> created by >> > the ContentTranslation tool (the template has parameters which depends >> on >> > language and revision which were used as the source one; this is the >> reason >> > why I use separate script). It may be found at >> > https://github.com/urbanecm/addPrekladCT/blob/master/addmissing.py. The >> > script work perfectly on my local PC and on bastion host but I can't >> get it >> > work on the grid. >> > >> > The script itself is run by python3 addmissing.py -always >> -file:pages.txt >> > -search:'-insource:/\{\{[Pp]řeklad/' and require pages.txt file and >> > preklads.txt file at >> > https://tools.wmflabs.org/urbanecmbot/test/preklads.txt. The first >> contains >> > pages that should be processed and act as the generator, the second one >> is >> > something like a database with exact templates which should be inserted. >> > Both files are as an example in the attachments. >> > >> > When I try to run it at toollabs bastion, all works as it should. When I >> > send the script to grid, it do not work (see sample output below). Why? >> Can >> > somebody help me with it? >> > >> > Thank you in advance, >> > Martin Urbanec / Urbanecm >> > >> > ; Output >> > >> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT >> > $ cat test.sh >> > python3 addmissing.py -always -file:pages.txt >> > -search:'-insource:/\{\{[Pp]řeklad/' >> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT >> > $ jsub bash test.sh >> > Your job 6201363 ("bash") has been submitted >> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT >> > $ qstat >> > job-ID prior name user state submit/start at queue >> > slots ja-task-ID >> > >> ----------------------------------------------------------------------------------------------------------------- >> > 6201363 0.30000 bash urbanecm r 06/16/2017 18:14:42 >> > [email protected] 1 >> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT >> > $ ls ~/bash.* >> > /home/urbanecm/bash.err /home/urbanecm/bash.out >> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT >> > $ cat ~/bash.* >> > Traceback (most recent call last): >> > File "addmissing.py", line 223, in <module> >> > main() >> > File "addmissing.py", line 183, in main >> > local_args = pywikibot.handle_args(args) >> > File "/shared/pywikipedia/core/pywikibot/bot.py", line 954, in >> handle_args >> > writeToCommandLogFile() >> > File "/shared/pywikipedia/core/pywikibot/bot.py", line 1128, in >> > writeToCommandLogFile >> > command_log_file.write(s + os.linesep) >> > File "/usr/lib/python3.4/codecs.py", line 711, in write >> > return self.writer.write(data) >> > File "/usr/lib/python3.4/codecs.py", line 368, in write >> > data, consumed = self.encode(object, self.errors) >> > UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in >> > position 67: surrogates not allowed >> > CRITICAL: Closing network session. >> > <class 'UnicodeEncodeError'> >> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT >> > $ >> >> Zhuyifei1999 saw your email and noted on irc that it looks to be a >> case of the known bug that I just retitled as "Shell LOCALE neither >> consistent nor sane across grid engine nodes" >> (<https://phabricator.wikimedia.org/T60784>). The current best work >> around that bug is to launch the job as a shell script that sets >> either LANG=C.UTF-8 or PYTHONIOENCODING=utf-8. >> >> If setting the job to run with the same locale you are using in your >> interactive tests does not work to fix the problem, you may also be >> hitting a deeper Python3 unicode issue related to surrogate codepoints >> (<https://bugs.python.org/issue12892>). This is hinted by the >> "position 67: surrogates not allowed" error message. >> >> I can actually reproduce your error message in an interactive python >> session on tools-dev from a starting state of LANG=en_US.UTF-8: >> >> $ python3 >> Python 3.4.0 (default, Jun 19 2015, 14:20:21) >> [GCC 4.8.2] on linux >> Type "help", "copyright", "credits" or "license" for more information. >> >>> print('\udcc5') >> Traceback (most recent call last): >> File "<stdin>", line 1, in <module> >> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in >> position 0: surrogates not allowed >> >>> >> >> Explictly encoding using 'surrogateescape' does work: >> >>> print('\udcc5'.encode('utf-8', 'surrogateescape')) >> b'\xc5' >> >> It looks like the error could be dealt with in pywikibot by patching >> writeToCommandLogFile() to open the codec used for output with any >> value other than the default errors='strict' >> (<https://docs.python.org/3/library/codecs.html#error-handlers>). >> >> $ python3 >> Python 3.4.0 (default, Jun 19 2015, 14:20:21) >> [GCC 4.8.2] on linux >> Type "help", "copyright", "credits" or "license" for more information. >> >>> print('\udcc5'.encode('utf-8', 'ignore')) >> b'' >> >>> print('\udcc5'.encode('utf-8', 'replace')) >> b'?' >> >>> print('\udcc5'.encode('utf-8', 'xmlcharrefreplace')) >> b'�' >> >>> print('\udcc5'.encode('utf-8', 'backslashreplace')) >> b'\\udcc5' >> >>> print('\udcc5'.encode('utf-8', 'surrogateescape')) >> b'\xc5' >> >>> print('\udcc5'.encode('utf-8', 'surrogatepass')) >> b'\xed\xb3\x85' >> >>> >> >> >> Bryan >> -- >> Bryan Davis Wikimedia Foundation <[email protected]> >> [[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA >> irc: bd808 v:415.839.6885 x6855 >> >> _______________________________________________ >> Labs-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/labs-l >> > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
