(Reminder: this thread is on encoded section titles that are copied from URL bar to wikitext rather than from wikitext to wikitext, and thus they are pretty unreadable for humans.)
2011/8/10 Merlijn van Deen <[email protected]> > > So -- yes, the code is already there (as pwb is able to decode the section > title, as indicated by the representation (returned by Page.__repr__ or > Page.__str__). However, title() seems to have some bug. > For me they return still encoded titles. :-( > > Oh, and I think the fix is already in cosmetic_changes.py, too. Check def > cleanUpLinks > (line 314). > Fail again. If you go to http://hu.wikipedia.org/wiki/Mafia_II#T.C3.B6rt.C3.A9net and edit the section, you will wind an example in the 2nd paragraph: [[Második világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban .28Huskey hadm.C5.B1velet.29|Huskey hadműveletben]] My code is: >>> import wikipedia as p >>> site=p.getSite('hu') >>> import cosmetic_changes as cc >>> bot=cc.CosmeticChangesToolkit(site) >>> title=u'Mafia II' >>> lap=p.Page(site,title) >>> text=lap.get() >>> text2=bot.cleanUpLinks(text) >>> text==text2 True Did I miss something? I can solve the problem with listing the most frequent characters used in huwiki but I would better like a nice and general solution. Another issue: *á* is encoded as .C3.A1. However, a literal .C3.A1 in section title will also appear the same. Is there any way to decide if .C3.A1 stands for *á* or for .C3.A1? I guess the likelihood of someone writing a literal .C3.A1 into the section title is very small, so this question may be theoretical, but I am a theoretical man. :-) -- Bináris
_______________________________________________ Pywikipedia-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
