Using lxml to screen scrap a site, problem with charset

Дамјан Георгиевски Sun, 01 Feb 2009 16:20:42 -0800

So, I'm using lxml to screen scrap a site that uses the cyrillic 
alphabet (windows-1251 encoding). The sites HTML doesn't have the <META 
..content-type.. charset=..> header, but does have a HTTP header that 
specifies the charset... so they are standards compliant enough.


Now when I run this code:

from lxml import html
doc = html.parse('http://a1.com.mk/')
root = doc.getroot()
title = root.cssselect(('head title'))[0]
print title.text

the title.text is а unicode string, but it has been wrongly decoded as 
latin1 -> unicode

So.. is this a deficiency/bug in lxml or I'm doing something wrong.
Also, what are my other options here?


I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters.

-- 
дамјан ( http://softver.org.mk/damjan/ )

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


--
http://mail.python.org/mailman/listinfo/python-list

Using lxml to screen scrap a site, problem with charset

Reply via email to