2012/3/28 Bináris <[email protected]>

>
> Another issue: *á* is encoded as .C3.A1. However, a literal .C3.A1 in
> section title will also appear the same. Is there any way to decide if
> .C3.A1 stands for *á* or for .C3.A1? I guess the likelihood of someone
> writing a literal .C3.A1 into the section title is very small, so this
> question may be theoretical, but I am a theoretical man. :-)
>
>
While this was a theoratical problem, I created a practical one. There are
characters with a shorter code, such as quotation mark (.22) and
parentheses (.28, .29).
Have a look at this section title:
http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/semmi#.22D.C3.A1tum.22
:_2012.03.22
You will see that the first two .22's (marked here with red, excuse me if
this causes a problem for someone) are encoded quotation marks, while the
last (blue) one a literal .22 as part of a date (Hungarian date order is
yyyy. mm. dd.). I simply don't see any chance to make the difference by bot
unless searching for all section titles in question (as well as anchor
templates) and try to make a reverse match. So this is something very easy
to spoil and almost hopeless to correct.

*:-(*

-- 
Bináris
_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Reply via email to