Thanks. The wide-unicode build is only half the story, I think. I was
previously using python 2.5.1 which shipped with Mac OSX 10.5. After
compiling python 2.6.4 using --enable-unicode=ucs4, I encountered this
error:

Traceback (most recent call last):
  File "/usr/local/bin/mw-render", line 8, in <module>
    load_entry_point('mwlib==0.12.12', 'console_scripts', 'mw-render')
()
  File "/usr/local/lib/python2.6/site-packages/mwlib-0.12.12-py2.6-
macosx-10.4-i386.egg/mwlib/apps/render.py", line 210, in main
    return Main()()
  File "/usr/local/lib/python2.6/site-packages/mwlib-0.12.12-py2.6-
macosx-10.4-i386.egg/mwlib/apps/render.py", line 173, in __call__
    writer(env, output=tmpout, status_callback=self.status,
**writer_options)
  File "/usr/local/lib/python2.6/site-packages/mwlib-0.12.12-py2.6-
macosx-10.4-i386.egg/mwlib/xhtmlwriter.py", line 713, in xhtmlwriter
    writer(env, status_callback=scb).writeBook(book, output=output)
  File "/usr/local/lib/python2.6/site-packages/mwlib-0.12.12-py2.6-
macosx-10.4-i386.egg/mwlib/xhtmlwriter.py", line 258, in writeBook
    output.write(self.asstring())
  File "/usr/local/lib/python2.6/site-packages/mwlib-0.12.12-py2.6-
macosx-10.4-i386.egg/mwlib/xhtmlwriter.py", line 178, in asstring
    res = self.header + ET.tostring(self.root)
  File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 1009,
in tostring
    ElementTree(element).write(file, encoding)
  File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 663,
in write
    self._write(file, self._root, encoding, {})
  File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 707,
in _write
    self._write(file, n, encoding, namespaces)
  File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 707,
in _write
    self._write(file, n, encoding, namespaces)
  File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 707,
in _write
    self._write(file, n, encoding, namespaces)
  File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 707,
in _write
    self._write(file, n, encoding, namespaces)
  File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 707,
in _write
    self._write(file, n, encoding, namespaces)
  File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 707,
in _write
    self._write(file, n, encoding, namespaces)
  File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 698,
in _write
    _escape_attrib(v, encoding)))
  File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 822,
in _escape_attrib
    return _encode_entity(text)
  File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 792,
in _encode_entity
    return _encode(pattern.sub(escape_entities, text), "ascii")
  File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 751,
in _encode
    return s.encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
4-8: ordinal not in range(128)

I was able to fix this problem by editing the asstring method in
xhtmlwriter.py to default to "utf-8" encoding as suggested (but not
committed) in this ticket: http://code.pediapress.com/wiki/ticket/583
I re-opened the ticket to discuss getting the fix into the codebase.

- Jeremy


On Jan 4, 1:56 am, Ralf Schmitt <[email protected]> wrote:
> Jeremy <[email protected]> writes:
> > Notice that the link text consists of a number of surrogate pairs
> > (i.e. &#55296;&#57145;) which have been broken into 2 separate
> > characters. According to this 
> > link:http://unicode.org/faq/utf_bom.html#utf8-4
> > I believe this output may be in error. For example, the two characters
> > &#55296;&#57145; should actually have been a single character &#66361;
>
> I think you need to use a wide-unicode build to fix this issue. Python
> can be configured to use ucs-2 or ucs-4 internally to store unicode
> strings. Passing "--enable-unicode=ucs4" to python's configure script
> should do the trick. Unfortunately you'll also have to recompile all
> your c extension modules.
>
>
>
> > A related question about mw-render is why this link appears in the
> > output at all. It appears to be a link to the gothic language wiki
> > page about Jesus. Is there an option that would exclude wiki pages in
> > other languages from being processed? Also, where does mw-render get
> > this list of links? Is it possible that wikipedia is serving them
> > incorrectly, rather than the parser writing them incorrectly?
>
> yes, the interwikimap is somehow cached by wikipedia. new entries take
> some time until they are returned by api.php. (I 
> thinkhttps://bugzilla.wikimedia.org/show_bug.cgi?id=19838describes that
> issue.)
>
> In your case they try to return the "got" language 
> (seehttp://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=i...),
> but unfortunately pad those values with garbage 
> (seehttps://bugzilla.wikimedia.org/show_bug.cgi?id=21818)
>
> - ralf

--

You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/mwlib?hl=en.


Reply via email to