Hi all, I am using mwlib 0.12.12, and I am running into a situation where the xhtml writer is producing invalid characters. An example of a command which produces invalid output is:
mw-render -x -c :en -w xhtml --output output.xml Jesus in output.xml you will find the following: <a class="mwx.link.article" href="http://en.wikipedia.org/w/index.php? title=Got:%F0%90%8C%B9%F0%90%8C%B4%F0%90%8D%83%F0%90%8C%BF%F0%90%8D%83_ %F0%90%8D%87%F0%90%8D%82%F0%90%8C%B9%F0%90%8D%83%F0%90%8D%84%F0%90%8C %BF%F0%90%8D %83">got:���������� ��������������</ a> Notice that the link text consists of a number of surrogate pairs (i.e. ��) which have been broken into 2 separate characters. According to this link: http://unicode.org/faq/utf_bom.html#utf8-4 I believe this output may be in error. For example, the two characters �� should actually have been a single character 𐌹 A related question about mw-render is why this link appears in the output at all. It appears to be a link to the gothic language wiki page about Jesus. Is there an option that would exclude wiki pages in other languages from being processed? Also, where does mw-render get this list of links? Is it possible that wikipedia is serving them incorrectly, rather than the parser writing them incorrectly? Thanks for any help you can provide, Jeremy -- You received this message because you are subscribed to the Google Groups "mwlib" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/mwlib?hl=en.
