Bryan Murdock wrote:
All right, let's see if this list can really talk software :-)

I'm writing a little python app to parse the lesson manuals that the
church provides online, in particular this one:

http://library.lds.org/nxt/gateway.dll/Curriculum/aaronic%20priesthood.htm/ap3.htm?f=templates$fn=document-frame.htm$3.0$q=$x

I parse it with python's HTMLParser and then hand off some text from
the page to another module that inserts it between some xml tags to
send to backpackit.com's cool API service.  It then parses the xml
with xml.dom.minidom and spits out an error saying the xml is not well
formed.

Looking closer, it turns out that the html is utf-8 encoded, and
lesson 5 uses some fancy quotation marks that are causing the problem.

I don't know much about all this unicode stuff, other than I've heard
that xml is supposed to be real strict about this kind of stuff.  I've
tried using the built-in unicode function to convert the string to
utf-8 at different points in my code, but with no luck.  Any ideas?

The HTML declares that it's encoded using UTF-8, but the document is actually encoded using a Windows character set. I'll paste what I typed into the Python interpreter:

>>> url = \
'http://library.lds.org/nxt/gateway.dll/Curriculum/aaronic%20priesthood.htm/ap3.htm?f=templates$fn=document-frame.htm$3.0$q=$x'
>>> import urllib2
>>> f = urllib2.urlopen(url)
>>> raw = f.read()
>>> data = raw.decode('cp1252')
>>> data[3035]
u'\u201c'
>>> fixed_raw = data.encode('utf8')

Character U+201c is the character you want, according to a handy reference I just found:

http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html

Shane
_______________________________________________
Ldsoss mailing list
[email protected]
http://lists.ldsoss.org/mailman/listinfo/ldsoss

Reply via email to