Re: Custom ScrapyAgent which allow to filter response by content length

Gheorghe Chirica Fri, 07 Mar 2014 05:48:28 -0800

Now, my questions are:

Is this approach ok? If no, what is the best way to achieve this?


How can I send some custom *reason *to loseConnection(reason??)? I tries to 
send smth like reason = failure.Failure(ConnectionAborted())but I do not 
receive this in 
connectionLost<https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L145>
.

How can we change the chunk size when receiving data(this may be a question 
related to twisted) ?


Thx.


On Friday, March 7, 2014 3:42:15 PM UTC+2, Gheorghe Chirica wrote:
>
> Hi.
>
> Recently I'm working on a small crawler which check some site and see what 
> is the status(and other info) of all urls.
>
> My initial idea was to make a GET request to all html resources and a HEAD 
> request to resources other then html.
>
> The problem in this case is that some servers do not implement HEAD 
> request(I noticed this on urls to facebook and twitter) and I get a 
> TimeoutError.
>
> Note that I can have the same issue not only with plain html pages, but 
> also with other assets.
>
> My next idea was to make a GET request instead of HEAD. But in this case I 
> don't need to get the resource body for assets(images, js, css).
>
> In this case I need somehow to make GET request, but to request only for 
> small chunk of data, which will include headers, and then close connection. 
> No need to download 10 MB file If I need only it's status(200, 301)
>
> Now, from theory to code. I checked the scrapy code related to downloading 
> requests. So, 
> ScrapyAgent<https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L41>
>  is 
> responsable for downloading  pages via 
> download_request<https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L64>
> .
>
> The code responsable for receiving data from the socket is  
> dataReceived<https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L142>.
>  
> Here I plugged in some logic which close the connection after first 
> received chunk:
>
> if allowed_mimetype:
>
>     self._txresponse._transport._producer.loseConnection() 
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Custom ScrapyAgent which allow to filter response by content length

Reply via email to