You're welcome!
Happy scraping

On Friday, March 21, 2014 12:20:55 PM UTC+1, James Ford wrote:
>
> Seems to be the issue m8 :).
>
> Thanks you saved my day!
>
> Den fredagen den 21:e mars 2014 kl. 12:14:23 UTC+1 skrev Paul Tremberth:
>>
>> Could it be that readlines() leaves the \n at the end?
>> try with 
>>
>> self.agents = [a.strip() for a in f.readlines()]
>>
>> or similar
>>
>> On Friday, March 21, 2014 12:09:11 PM UTC+1, James Ford wrote:
>>>
>>> Sure,
>>>
>>> Below you will find a crawl of http://doc.scrapy.org with a depth of 1 
>>> and extraction of inlinks only.
>>>
>>> http://pastebin.com/wE292pQe
>>>
>>> As you can see from the stats the status 200 count is only 13. This is 
>>> not the case if I put my agent-list directly in my module or if I disable 
>>> my middleware.
>>>
>>> Thanks
>>>
>>> Den fredagen den 21:e mars 2014 kl. 10:55:26 UTC+1 skrev Paul Tremberth:
>>>>
>>>> Can you share logs? 
>>>>
>>>> On Fri, Mar 21, 2014 at 10:53 AM, James Ford <[email protected]> 
>>>> wrote: 
>>>> > Hello, 
>>>> > 
>>>> > I'm having an odd issue with one of my projects. 
>>>> > 
>>>> > I have implemented a custom middleware that rotates user-agent for 
>>>> each 
>>>> > request. 
>>>> > 
>>>> > The middleware works by reading from a file when the middleware is 
>>>> > initialized by putting the contents of the file into a list(in 
>>>> memory). 
>>>> > 
>>>> > According to me this should work fine, but I am getting a large 
>>>> amount of 
>>>> > 400 bad requsts of my crawls? The odd thing is that it works fine if 
>>>> I just 
>>>> > put the agents in a list directly instead of reading from file. 
>>>> > 
>>>> > What can cause this error? Here is my middleware: 
>>>> > 
>>>> > class UserAgentPool(): 
>>>> >     def __init__(self): 
>>>> >         basepath = os.path.dirname(__file__) 
>>>> >         filepath = os.path.abspath(os.path.join(basepath, 
>>>> "agents.txt")) 
>>>> >         with open(filepath, 'r') as f: 
>>>> >             self.agents = f.readlines() 
>>>> > 
>>>> >     def rotate(self): 
>>>> >         log.msg("Rotating user agent", level=log.DEBUG) 
>>>> >         agent = self.agents.pop(0) 
>>>> >         log.msg("Agent popped %s" %agent, level=log.DEBUG) 
>>>> >         log.msg("[%s]" % ", ".join(map(str, self.agents)), 
>>>> level=log.DEBUG) 
>>>> >         self.agents.append(agent) 
>>>> >         return agent 
>>>> > 
>>>> > class UserAgentRotationMiddleware(object): 
>>>> >     def __init__(self): 
>>>> >         self.pool = UserAgentPool() 
>>>> > 
>>>> >     def process_request(self, request, spider): 
>>>> >         if getattr(spider, 'agent_rotation', None): 
>>>> >             agent = self.pool.rotate() 
>>>> >             request.headers.setdefault('User-Agent', agent) 
>>>> >             log.msg("Setting User-Agent to %s" 
>>>> > %request.headers["User-Agent"]) 
>>>> > 
>>>> > 
>>>> > -- 
>>>> > You received this message because you are subscribed to the Google 
>>>> Groups 
>>>> > "scrapy-users" group. 
>>>> > To unsubscribe from this group and stop receiving emails from it, 
>>>> send an 
>>>> > email to [email protected]. 
>>>> > To post to this group, send email to [email protected]. 
>>>> > Visit this group at http://groups.google.com/group/scrapy-users. 
>>>> > For more options, visit https://groups.google.com/d/optout. 
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to