Hi Travis, Thanks for the advise, It worked. Now I am able to scrap the page.
I have put question on this forums earlier as well, however haven't got any helpful replies, I was thinking this forum is inactive and while posting this question I wasn't having any hope that I will get answers for this, however thanks to you, my problem resolved. Thanks a lot. Gaurang Shah Blog: qtp-help.blogspot.com Mobile: +91 738756556 On Thu, Mar 5, 2015 at 9:33 PM, Travis Leleu <[email protected]> wrote: > Sounds like the site is detecting you're scraping and trying to prevent > it. Id suggest looking into user agent middlewares to mimic a browser UA > string > On Mar 5, 2015 1:41 AM, "Gaurang shah" <[email protected]> wrote: > >> Hi Guys, >> >> I am trying scrapy a website, however the problem is whenever I try to >> visit the page from which I have to scrap data it redirects to some other >> page. if I visit that page manually in the the browser it's not being >> redirected anyway, I checked the response code as well, it shows 200. >> >> However with scrapy it's being redirected and I am able to see the code >> 302. >> >> Following is the website I am trying to scrap. >> http://www.lonmark.org/membership/directory/partners >> >> In the scrapy logs I am able to see following entries. >> 2015-03-05 15:08:36+0530 [lonamrk] DEBUG: Redirecting (302) to <GET >> http://www.lonmark.org/sitemap> from <GET >> http://www.lonmark.org/membership/directory/partners> >> 2015-03-05 15:08:37+0530 [lonamrk] DEBUG: Redirecting (302) to <GET >> http://www.lonmark.org/sitemap> from <GET http://www.lonmark.org/sitemap> >> 2015-03-05 15:08:37+0530 [lonamrk] DEBUG: Redirecting (302) to <GET >> http://www.lonmark.org/sitemap> from <GET http://www.lonmark.org/sitemap> >> 2015-03-05 15:08:41+0530 [lonamrk] DEBUG: Redirecting (302) to <GET >> http://www.lonmark.org/sitemap> from <GET http://www.lonmark.org/sitemap> >> >> Following the code. >> class Spider(BaseSpider): >> name = "lonamrk" >> allowed_domains = ["lonmark.org"] >> # Request.meta = {'dont_redirect': True, >> # 'handle_httpstatus_list': [302]} >> >> start_urls = ["http://www.lonmark.org/membership/directory/partners"] >> >> def parse(self, response): >> print response.url >> hxs = HtmlXPathSelector(response) >> company_links = >> hxs.select("//*[@id='page_content']/table/tbody/tr[1]/td[1]/a/@href") >> for link in company_links: >> yield >> Request("http://www.lonmark.org/membership/directory/"+link._root, >> callback=self.parse_company_info) >> >> >> >> If I uncomment the code, and stop redirection. Then I am not getting >> anything in the response body. >> >> would someone please help me what to do ??? >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to a topic in the > Google Groups "scrapy-users" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/scrapy-users/Jx-zq7QNw5A/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
