You're welcome! Happy scraping On Friday, March 21, 2014 12:20:55 PM UTC+1, James Ford wrote: > > Seems to be the issue m8 :). > > Thanks you saved my day! > > Den fredagen den 21:e mars 2014 kl. 12:14:23 UTC+1 skrev Paul Tremberth: >> >> Could it be that readlines() leaves the \n at the end? >> try with >> >> self.agents = [a.strip() for a in f.readlines()] >> >> or similar >> >> On Friday, March 21, 2014 12:09:11 PM UTC+1, James Ford wrote: >>> >>> Sure, >>> >>> Below you will find a crawl of http://doc.scrapy.org with a depth of 1 >>> and extraction of inlinks only. >>> >>> http://pastebin.com/wE292pQe >>> >>> As you can see from the stats the status 200 count is only 13. This is >>> not the case if I put my agent-list directly in my module or if I disable >>> my middleware. >>> >>> Thanks >>> >>> Den fredagen den 21:e mars 2014 kl. 10:55:26 UTC+1 skrev Paul Tremberth: >>>> >>>> Can you share logs? >>>> >>>> On Fri, Mar 21, 2014 at 10:53 AM, James Ford <[email protected]> >>>> wrote: >>>> > Hello, >>>> > >>>> > I'm having an odd issue with one of my projects. >>>> > >>>> > I have implemented a custom middleware that rotates user-agent for >>>> each >>>> > request. >>>> > >>>> > The middleware works by reading from a file when the middleware is >>>> > initialized by putting the contents of the file into a list(in >>>> memory). >>>> > >>>> > According to me this should work fine, but I am getting a large >>>> amount of >>>> > 400 bad requsts of my crawls? The odd thing is that it works fine if >>>> I just >>>> > put the agents in a list directly instead of reading from file. >>>> > >>>> > What can cause this error? Here is my middleware: >>>> > >>>> > class UserAgentPool(): >>>> > def __init__(self): >>>> > basepath = os.path.dirname(__file__) >>>> > filepath = os.path.abspath(os.path.join(basepath, >>>> "agents.txt")) >>>> > with open(filepath, 'r') as f: >>>> > self.agents = f.readlines() >>>> > >>>> > def rotate(self): >>>> > log.msg("Rotating user agent", level=log.DEBUG) >>>> > agent = self.agents.pop(0) >>>> > log.msg("Agent popped %s" %agent, level=log.DEBUG) >>>> > log.msg("[%s]" % ", ".join(map(str, self.agents)), >>>> level=log.DEBUG) >>>> > self.agents.append(agent) >>>> > return agent >>>> > >>>> > class UserAgentRotationMiddleware(object): >>>> > def __init__(self): >>>> > self.pool = UserAgentPool() >>>> > >>>> > def process_request(self, request, spider): >>>> > if getattr(spider, 'agent_rotation', None): >>>> > agent = self.pool.rotate() >>>> > request.headers.setdefault('User-Agent', agent) >>>> > log.msg("Setting User-Agent to %s" >>>> > %request.headers["User-Agent"]) >>>> > >>>> > >>>> > -- >>>> > You received this message because you are subscribed to the Google >>>> Groups >>>> > "scrapy-users" group. >>>> > To unsubscribe from this group and stop receiving emails from it, >>>> send an >>>> > email to [email protected]. >>>> > To post to this group, send email to [email protected]. >>>> > Visit this group at http://groups.google.com/group/scrapy-users. >>>> > For more options, visit https://groups.google.com/d/optout. >>>> >>>
-- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
