Re: Using lxml to screen scrap a site, problem with charset

Stefan Behnel Wed, 04 Feb 2009 12:06:01 -0800

Tim Arnold wrote:
> "?????? ???????????" <gdam...@gmail.com> wrote in message 
> news:ciqh56-ses....@archaeopteryx.softver.org.mk...
>> So, I'm using lxml to screen scrap a site that uses the cyrillic
>> alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
>> ..content-type.. charset=..> header, but does have a HTTP header that
>> specifies the charset... so they are standards compliant enough.
>>
>> Now when I run this code:
>>
>> from lxml import html
>> doc = html.parse('http://a1.com.mk/')
>> root = doc.getroot()
>> title = root.cssselect(('head title'))[0]
>> print title.text
>>
>> the title.text is ? unicode string, but it has been wrongly decoded as
>> latin1 -> unicode
> 
> The way I do that is to open the file with codecs, encoding=cp1251, read it 
> into variable and feed that to the parser.


Yes, if you know the encoding through an external source (especially when
parsing broken HTML), it's best to pass in either a decoded string or a
decoding file-like object, as in

        tree = lxml.html.parse( codecs.open(..., encoding='...') )

You can also create a parser with an encoding override:

        parser = etree.HTMLParser(encoding='...', **other_options)

Stefan
--
http://mail.python.org/mailman/listinfo/python-list

Re: Using lxml to screen scrap a site, problem with charset

Reply via email to