Re: [Ldsoss] parsing lds.org utf-8 encoded website

Shane Hathaway Sun, 25 Sep 2005 23:18:35 -0700

Bryan Murdock wrote:

All right, let's see if this list can really talk software :-)


I'm writing a little python app to parse the lesson manuals that the
church provides online, in particular this one:

http://library.lds.org/nxt/gateway.dll/Curriculum/aaronic%20priesthood.htm/ap3.htm?f=templates$fn=document-frame.htm$3.0$q=$x

I parse it with python's HTMLParser and then hand off some text from
the page to another module that inserts it between some xml tags to
send to backpackit.com's cool API service.  It then parses the xml
with xml.dom.minidom and spits out an error saying the xml is not well
formed.

Looking closer, it turns out that the html is utf-8 encoded, and
lesson 5 uses some fancy quotation marks that are causing the problem.

I don't know much about all this unicode stuff, other than I've heard
that xml is supposed to be real strict about this kind of stuff.  I've
tried using the built-in unicode function to convert the string to
utf-8 at different points in my code, but with no luck.  Any ideas?

The HTML declares that it's encoded using UTF-8, but the document isactually encoded using a Windows character set. I'll paste what I typedinto the Python interpreter:


>>> url = \
'http://library.lds.org/nxt/gateway.dll/Curriculum/aaronic%20priesthood.htm/ap3.htm?f=templates$fn=document-frame.htm$3.0$q=$x'
>>> import urllib2
>>> f = urllib2.urlopen(url)
>>> raw = f.read()
>>> data = raw.decode('cp1252')
>>> data[3035]
u'\u201c'
>>> fixed_raw = data.encode('utf8')

Character U+201c is the character you want, according to a handyreference I just found:


http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html

Shane
_______________________________________________
Ldsoss mailing list
[email protected]
http://lists.ldsoss.org/mailman/listinfo/ldsoss

Re: [Ldsoss] parsing lds.org utf-8 encoded website

Reply via email to