So, I'm using lxml to screen scrap a site that uses the cyrillic alphabet (windows-1251 encoding). The sites HTML doesn't have the <META ..content-type.. charset=..> header, but does have a HTTP header that specifies the charset... so they are standards compliant enough.
Now when I run this code: from lxml import html doc = html.parse('http://a1.com.mk/') root = doc.getroot() title = root.cssselect(('head title'))[0] print title.text the title.text is а unicode string, but it has been wrongly decoded as latin1 -> unicode So.. is this a deficiency/bug in lxml or I'm doing something wrong. Also, what are my other options here? I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters. -- дамјан ( http://softver.org.mk/damjan/ ) "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -- http://mail.python.org/mailman/listinfo/python-list