-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 If you have to whole page content accessible, you can try to use 'Page.getSections()'... might help?
Greetings DrTrigon On 28.03.2012 13:20, Bináris wrote: > > > 2012/3/28 Bináris <[email protected] > <mailto:[email protected]>> > > > Another issue: *á* is encoded as .C3.A1. However, a literal .C3.A1 > in section title will also appear the same. Is there any way to > decide if .C3.A1 stands for *á* or for .C3.A1? I guess the > likelihood of someone writing a literal .C3.A1 into the section > title is very small, so this question may be theoretical, but I am > a theoretical man. :-) > > > While this was a theoratical problem, I created a practical one. > There are characters with a shorter code, such as quotation mark > (.22) and parentheses (.28, .29). Have a look at this section > title: > http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/semmi#.22D.C3.A1tum.22:_2012.03.22 > > You will see that the first two .22's (marked here with red, excuse me > if this causes a problem for someone) are encoded quotation marks, > while the last (blue) one a literal .22 as part of a date > (Hungarian date order is yyyy. mm. dd.). I simply don't see any > chance to make the difference by bot unless searching for all > section titles in question (as well as anchor templates) and try to > make a reverse match. So this is something very easy to spoil and > almost hopeless to correct. > > *:-(* > > -- Bináris > > > _______________________________________________ Pywikipedia-l > mailing list [email protected] > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk/XpCwACgkQAXWvBxzBrDBRJgCfe0SF+Ym7S+l5rIHW3fc4db8j 3moAnjZqX/tGut+McHhecExN8VR1Ado5 =ehn+ -----END PGP SIGNATURE----- _______________________________________________ Pywikipedia-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
