retrieving url from based on text inside tag with different encoding

Raf Roger Sun, 18 Sep 2016 11:02:18 -0700

Hi,

for testing purpose, i wrote a small scrapy script that should find next 
page based on <a> text.
However if i was able to do it in utf-8, i encountrer some issue with web 
page encoded in "windows-1250" while my scrapy script and by default text 
is written in utf-8


let's have a look at: https://www.vsetkyfirmy.sk/autoskoly

the bottom pagination display "Next page" in local language "Ďalšie >> 
<https://www.vsetkyfirmy.sk/autoskoly/strana_2.html>" and i would like to 
retrieve the complete url of this <a> so if we are on the first page: 
https://www.vsetkyfirmy.sk/autoskoly/strana_2.html, if we are on the page 
2, https://www.vsetkyfirmy.sk/autoskoly/strana_3.html, etc...

however this webpage is encoded in "windows-1250" and in my scrapy script 
i'm confused as i use utf-8 and the following code to retrieve the <a> url:
t = Selector(response).xpath('//*[text()[contains(., 
"Ďalšie")]]/@href').extract()

but once done...scrapy says:
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL 
bytes or control characters

So what should i do to achieve what i want ?

thx

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

retrieving url from based on text inside tag with different encoding

Reply via email to