Thanks. The wide-unicode build is only half the story, I think. I was
previously using python 2.5.1 which shipped with Mac OSX 10.5. After
compiling python 2.6.4 using --enable-unicode=ucs4, I encountered this
error:
Traceback (most recent call last):
File "/usr/local/bin/mw-render", line 8, in <module>
load_entry_point('mwlib==0.12.12', 'console_scripts', 'mw-render')
()
File "/usr/local/lib/python2.6/site-packages/mwlib-0.12.12-py2.6-
macosx-10.4-i386.egg/mwlib/apps/render.py", line 210, in main
return Main()()
File "/usr/local/lib/python2.6/site-packages/mwlib-0.12.12-py2.6-
macosx-10.4-i386.egg/mwlib/apps/render.py", line 173, in __call__
writer(env, output=tmpout, status_callback=self.status,
**writer_options)
File "/usr/local/lib/python2.6/site-packages/mwlib-0.12.12-py2.6-
macosx-10.4-i386.egg/mwlib/xhtmlwriter.py", line 713, in xhtmlwriter
writer(env, status_callback=scb).writeBook(book, output=output)
File "/usr/local/lib/python2.6/site-packages/mwlib-0.12.12-py2.6-
macosx-10.4-i386.egg/mwlib/xhtmlwriter.py", line 258, in writeBook
output.write(self.asstring())
File "/usr/local/lib/python2.6/site-packages/mwlib-0.12.12-py2.6-
macosx-10.4-i386.egg/mwlib/xhtmlwriter.py", line 178, in asstring
res = self.header + ET.tostring(self.root)
File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 1009,
in tostring
ElementTree(element).write(file, encoding)
File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 663,
in write
self._write(file, self._root, encoding, {})
File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 707,
in _write
self._write(file, n, encoding, namespaces)
File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 707,
in _write
self._write(file, n, encoding, namespaces)
File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 707,
in _write
self._write(file, n, encoding, namespaces)
File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 707,
in _write
self._write(file, n, encoding, namespaces)
File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 707,
in _write
self._write(file, n, encoding, namespaces)
File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 707,
in _write
self._write(file, n, encoding, namespaces)
File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 698,
in _write
_escape_attrib(v, encoding)))
File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 822,
in _escape_attrib
return _encode_entity(text)
File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 792,
in _encode_entity
return _encode(pattern.sub(escape_entities, text), "ascii")
File "/usr/local/lib/python2.6/xml/etree/ElementTree.py", line 751,
in _encode
return s.encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
4-8: ordinal not in range(128)
I was able to fix this problem by editing the asstring method in
xhtmlwriter.py to default to "utf-8" encoding as suggested (but not
committed) in this ticket: http://code.pediapress.com/wiki/ticket/583
I re-opened the ticket to discuss getting the fix into the codebase.
- Jeremy
On Jan 4, 1:56 am, Ralf Schmitt <[email protected]> wrote:
> Jeremy <[email protected]> writes:
> > Notice that the link text consists of a number of surrogate pairs
> > (i.e. ��) which have been broken into 2 separate
> > characters. According to this
> > link:http://unicode.org/faq/utf_bom.html#utf8-4
> > I believe this output may be in error. For example, the two characters
> > �� should actually have been a single character 𐌹
>
> I think you need to use a wide-unicode build to fix this issue. Python
> can be configured to use ucs-2 or ucs-4 internally to store unicode
> strings. Passing "--enable-unicode=ucs4" to python's configure script
> should do the trick. Unfortunately you'll also have to recompile all
> your c extension modules.
>
>
>
> > A related question about mw-render is why this link appears in the
> > output at all. It appears to be a link to the gothic language wiki
> > page about Jesus. Is there an option that would exclude wiki pages in
> > other languages from being processed? Also, where does mw-render get
> > this list of links? Is it possible that wikipedia is serving them
> > incorrectly, rather than the parser writing them incorrectly?
>
> yes, the interwikimap is somehow cached by wikipedia. new entries take
> some time until they are returned by api.php. (I
> thinkhttps://bugzilla.wikimedia.org/show_bug.cgi?id=19838describes that
> issue.)
>
> In your case they try to return the "got" language
> (seehttp://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=i...),
> but unfortunately pad those values with garbage
> (seehttps://bugzilla.wikimedia.org/show_bug.cgi?id=21818)
>
> - ralf
--
You received this message because you are subscribed to the Google Groups
"mwlib" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/mwlib?hl=en.