On 28 March 2012 13:20, Bináris <[email protected]> wrote:
> While this was a theoratical problem, I created a practical one. There are
> characters with a shorter code, such as quotation mark (.22) and parentheses
> (.28, .29).
> Have a look at this section title:
> http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/semmi#.22D.C3.A1tum.22:_2012.03.22
> You will see that the first two .22's (marked here with red, excuse me if
> this causes a problem for someone) are encoded quotation marks, while the
> last (blue) one a literal .22 as part of a date (Hungarian date order is
> yyyy. mm. dd.). I simply don't see any chance to make the difference by bot
> unless searching for all section titles in question (as well as anchor
> templates) and try to make a reverse match. So this is something very easy
> to spoil and almost hopeless to correct.

Another example is listed in the bug report at:
http://sourceforge.net/tracker/?func=detail&group_id=93107&atid=603138&aid=2989218
 - #802.11n becomes Page{[[Page#IEEE 802n]]} (because \x11 is a
non-printable character).

however: I think we can might be able to work around these two
problems as only characters outside of ASCII are escaped /and/ it has
to be a correct UTF-8 string. Again: check the mediawiki source.

Best,
Merlijn

_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Reply via email to