Hi all,

I am using mwlib 0.12.12, and I am running into a situation where the
xhtml writer is producing invalid characters. An example of a command
which produces invalid output is:

mw-render -x -c :en -w xhtml --output output.xml Jesus

in output.xml you will find the following:

<a class="mwx.link.article" href="http://en.wikipedia.org/w/index.php?
title=Got:%F0%90%8C%B9%F0%90%8C%B4%F0%90%8D%83%F0%90%8C%BF%F0%90%8D%83_
%F0%90%8D%87%F0%90%8D%82%F0%90%8C%B9%F0%90%8D%83%F0%90%8D%84%F0%90%8C
%BF%F0%90%8D
%83">got:&#55296;&#57145;&#55296;&#57140;&#55296;&#57155;&#55296;&#57151;&#55296;&#57155;
&#55296;&#57159;&#55296;&#57154;&#55296;&#57145;&#55296;&#57155;&#55296;&#57156;&#55296;&#57151;&#55296;&#57155;</
a>

Notice that the link text consists of a number of surrogate pairs
(i.e. &#55296;&#57145;) which have been broken into 2 separate
characters. According to this link: http://unicode.org/faq/utf_bom.html#utf8-4
I believe this output may be in error. For example, the two characters
&#55296;&#57145; should actually have been a single character &#66361;

A related question about mw-render is why this link appears in the
output at all. It appears to be a link to the gothic language wiki
page about Jesus. Is there an option that would exclude wiki pages in
other languages from being processed? Also, where does mw-render get
this list of links? Is it possible that wikipedia is serving them
incorrectly, rather than the parser writing them incorrectly?

Thanks for any help you can provide,
Jeremy

--

You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mwlib?hl=en.


Reply via email to