Re: [mwlib] mw-render surrogate pair problem

Ralf Schmitt Mon, 04 Jan 2010 01:57:08 -0800

Jeremy <[email protected]> writes:

> Notice that the link text consists of a number of surrogate pairs
> (i.e. &#55296;&#57145;) which have been broken into 2 separate
> characters. According to this link: http://unicode.org/faq/utf_bom.html#utf8-4
> I believe this output may be in error. For example, the two characters
> &#55296;&#57145; should actually have been a single character &#66361;


I think you need to use a wide-unicode build to fix this issue. Python
can be configured to use ucs-2 or ucs-4 internally to store unicode
strings. Passing "--enable-unicode=ucs4" to python's configure script
should do the trick. Unfortunately you'll also have to recompile all
your c extension modules.

>
> A related question about mw-render is why this link appears in the
> output at all. It appears to be a link to the gothic language wiki
> page about Jesus. Is there an option that would exclude wiki pages in
> other languages from being processed? Also, where does mw-render get
> this list of links? Is it possible that wikipedia is serving them
> incorrectly, rather than the parser writing them incorrectly?
>

yes, the interwikimap is somehow cached by wikipedia. new entries take
some time until they are returned by api.php. (I think
https://bugzilla.wikimedia.org/show_bug.cgi?id=19838 describes that
issue.) 

In your case they try to return the "got" language (see
http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=interwikimap),
but unfortunately pad those values with garbage (see
https://bugzilla.wikimedia.org/show_bug.cgi?id=21818)

- ralf

--

You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mwlib?hl=en.

Re: [mwlib] mw-render surrogate pair problem

Reply via email to