I have modified the python config to work with utf8 by default, that's why I have no notice any conversion error, thanks for the tip Tim. David Reyes Samblas Martinez http://www.tuxbrain.com Open ultraportable & embedded solutions Openmoko, Openpandora, Arduino Hey, watch out!!! There's a linux in your pocket!!!
2009/11/30 Tim Besard <[email protected]>: > Hi, > > It seems that the Dutch wikipedia contains some UTF-8 only characters, > which crashes the parser after all due to the "system echo" in the > exception handler. Changing the offending line to > os.system('echo \"%s\" >> fault_articles.txt' % > title.encode("utf8")) > fixes the issue. > > Tim > > Op maandag 30-11-2009 om 14:49 uur [tijdzone +0100], schreef David Reyes > Samblas Martinez: >> Here you have :) >> >> David Reyes Samblas Martinez >> http://www.tuxbrain.com >> Open ultraportable & embedded solutions >> Openmoko, Openpandora, Arduino >> Hey, watch out!!! There's a linux in your pocket!!! >> >> >> >> >> 2009/11/30 Tilman Baumann <[email protected]>: >> > Hi, >> > >> > can you maybe release this as a patch? >> > I like to inegrate this in github. But I fear I might miss something if I >> > try to fiddle out the changes by hand. >> > >> > Thanks >> > >> > David Reyes Samblas Martinez wrote: >> >> Sorry for the wait Thomas, >> >> I was working to solve the broken pipe issue that stops the parser >> >> when it finds an error. I have applied a quick and dirty workaround >> >> using try-catch technique and now the process will not stop and just >> >> skip the faulty article and keeps going :) it logs the faulty ones in >> >> a text file (title and position) for posterior forensics, but my first >> >> guesses in that is not a codification issue with utf8 is more an >> >> unexpected formating tag the php parser don't know how to deal with >> >> Actually parsing the german wikipedia with more than 1.3 million articles >> >> >> >> Count: 1043000 >> >> Failing count: 2 >> >> >> >> and keeps going I supose we can sacrificate two articles for having >> >> one milion available now :) >> >> >> >> as you requested I uploaded my working compiled tools[1] but without >> >> any xml sources it's about 113Mb, but if you have a working tools on >> >> your system you just have to change >> >> host-tools/offline-renderer/ArticleParser.py by the attached on this >> >> mail and you can forget to cry like a child that his ice cream has >> >> fall to the floor when after more than 24h parsing hundred of thousand >> >> articles pased the process you see this ugly python error backtrace >> >> blablabla and not your desired file :) >> >> >> >> by the way the faultyarticles.txt is saved at same >> >> host-tools/offline-renderer directory, (i'm too lazy to put a >> >> parameter for change that and I hardcoded the name of the file , >> >> yes... don't waste typing on correct that bad habit, I know) >> >> >> >> If you have curiosity of what articles on the german wiki are causing >> >> troubles >> >> on dewiki-latest-pages-articles.xml (date 2009-11-20) >> >> >> >> ~Storck Bicycle >> >> 832673 >> >> ~Musculus serratus posterior inferior >> >> 857334 >> >> >> >> Regards I hope I will upload the German wikipedia on Sunday... and >> >> will be available on Monday, sorry for the wait but my Asymmetric DSL >> >> is very asymmetric and upload 1.5-2 Gb (expected file size) will take >> >> a bunch of hours. >> >> >> >> For those than wants to compile his own , go for it :) the >> >> Quickreference in the doc directory on the souce is all you need to >> >> start working, just remember than if you have a 64 bit system you >> >> will have to follow the 64 bits method to compile the tools, >> >> >> >> Regards >> >> [1]http://tuxbrain.org/downloads/wikireader/wikireaderbinaries20091127_dsamblas_modified_trycatch.tar.bz2 >> >> David Reyes Samblas Martinez >> >> http://www.tuxbrain.com >> >> Open ultraportable & embedded solutions >> >> Openmoko, Openpandora, Arduino >> >> Hey, watch out!!! There's a linux in your pocket!!! >> >> >> >> >> >> >> >> >> >> 2009/11/27 Thomas HOCEDEZ <[email protected]>: >> >>> Thomas HOCEDEZ a écrit : >> >>>> >> >>>> Hi DAvid, >> >>>> >> >>>> Can you share your scripts & configs to do the same in French (and >> >>>> other >> >>>> languages) ? >> >>>> Thanks >> >>>> >> >>>> Thomas >> >>>> >> >>>> >> >>> >> >>> As the Mailing list seems to be broken (or users started hibernating for >> >>> winter...) I find by myself the way to compile things step by step. >> >>> I'm for now rendering the French Wikipedia. As it started a few minutes >> >>> ago, >> >>> the result will be availabel during the weekend (I hope). >> >>> >> >>> I'll also post the way I managed to do so ! (I'm at the office for now, >> >>> and >> >>> I'm leaving...) >> >>> >> >>> Regards to you all ! >> >>> >> >>> Thomas >> >>> >> >> _______________________________________________ >> >> Openmoko community mailing list >> >> [email protected] >> >> http://lists.openmoko.org/mailman/listinfo/community >> >> >> > >> > >> > -- >> > >> > >> > >> > _______________________________________________ >> > Openmoko community mailing list >> > [email protected] >> > http://lists.openmoko.org/mailman/listinfo/community >> > >> _______________________________________________ >> Openmoko community mailing list >> [email protected] >> http://lists.openmoko.org/mailman/listinfo/community > > > > _______________________________________________ > Openmoko community mailing list > [email protected] > http://lists.openmoko.org/mailman/listinfo/community > _______________________________________________ Openmoko community mailing list [email protected] http://lists.openmoko.org/mailman/listinfo/community

