Jeremy <[email protected]> writes: > Notice that the link text consists of a number of surrogate pairs > (i.e. ��) which have been broken into 2 separate > characters. According to this link: http://unicode.org/faq/utf_bom.html#utf8-4 > I believe this output may be in error. For example, the two characters > �� should actually have been a single character 𐌹
I think you need to use a wide-unicode build to fix this issue. Python can be configured to use ucs-2 or ucs-4 internally to store unicode strings. Passing "--enable-unicode=ucs4" to python's configure script should do the trick. Unfortunately you'll also have to recompile all your c extension modules. > > A related question about mw-render is why this link appears in the > output at all. It appears to be a link to the gothic language wiki > page about Jesus. Is there an option that would exclude wiki pages in > other languages from being processed? Also, where does mw-render get > this list of links? Is it possible that wikipedia is serving them > incorrectly, rather than the parser writing them incorrectly? > yes, the interwikimap is somehow cached by wikipedia. new entries take some time until they are returned by api.php. (I think https://bugzilla.wikimedia.org/show_bug.cgi?id=19838 describes that issue.) In your case they try to return the "got" language (see http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=interwikimap), but unfortunately pad those values with garbage (see https://bugzilla.wikimedia.org/show_bug.cgi?id=21818) - ralf -- You received this message because you are subscribed to the Google Groups "mwlib" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/mwlib?hl=en.
