Re: [Pywikipedia-l] Urlencoded section titles

Bináris Wed, 28 Mar 2012 01:22:38 -0700

(Reminder: this thread is on encoded section titles that are copied from
URL bar to wikitext rather than from wikitext to wikitext, and thus they
are pretty unreadable for humans.)


2011/8/10 Merlijn van Deen <[email protected]>

>
> So -- yes, the code is already there (as pwb is able to decode the section
> title, as indicated by the representation (returned by Page.__repr__ or
> Page.__str__). However, title() seems to have some bug.
>
For me they return still encoded titles. :-(


>
> Oh, and I think the fix is already in cosmetic_changes.py, too. Check def 
> cleanUpLinks
> (line 314).
>

Fail again. If you go to
http://hu.wikipedia.org/wiki/Mafia_II#T.C3.B6rt.C3.A9net and edit the
section, you will wind an example in the 2nd paragraph:
[[Második világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban
.28Huskey hadm.C5.B1velet.29|Huskey hadműveletben]]

My code is:
>>> import wikipedia as p
>>> site=p.getSite('hu')
>>> import cosmetic_changes as cc
>>> bot=cc.CosmeticChangesToolkit(site)
>>> title=u'Mafia II'
>>> lap=p.Page(site,title)
>>> text=lap.get()
>>> text2=bot.cleanUpLinks(text)
>>> text==text2
True

Did I miss something?
I can solve the problem with listing the most frequent characters used in
huwiki but I would better like a nice and general solution.

Another issue: *á* is encoded as .C3.A1. However, a literal .C3.A1 in
section title will also appear the same. Is there any way to decide if
.C3.A1 stands for *á* or for .C3.A1? I guess the likelihood of someone
writing a literal .C3.A1 into the section title is very small, so this
question may be theoretical, but I am a theoretical man. :-)

-- 
Bináris

_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Re: [Pywikipedia-l] Urlencoded section titles

Reply via email to