Re: Custom ScrapyAgent which allow to filter response by content length

Pablo Hoffman Wed, 26 Nov 2014 18:04:23 -0800

This pull request (to limit response size) got merged today in trunk:
https://github.com/scrapy/scrapy/pull/946


Perhaps you can use that or an extend that functionality to truncate
responses to certain sizes (patches welcome!).

On Fri, Mar 7, 2014 at 11:48 AM, Gheorghe Chirica <[email protected]
> wrote:

> Now, my questions are:
>
> Is this approach ok? If no, what is the best way to achieve this?
>
> How can I send some custom *reason *to loseConnection(reason??)? I tries
> to send smth like reason = failure.Failure(ConnectionAborted())but I do
> not receive this in connectionLost
> <https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L145>
> .
>
> How can we change the chunk size when receiving data(this may be a
> question related to twisted) ?
>
>
> Thx.
>
>
> On Friday, March 7, 2014 3:42:15 PM UTC+2, Gheorghe Chirica wrote:
>>
>> Hi.
>>
>> Recently I'm working on a small crawler which check some site and see
>> what is the status(and other info) of all urls.
>>
>> My initial idea was to make a GET request to all html resources and a
>> HEAD request to resources other then html.
>>
>> The problem in this case is that some servers do not implement HEAD
>> request(I noticed this on urls to facebook and twitter) and I get a
>> TimeoutError.
>>
>> Note that I can have the same issue not only with plain html pages, but
>> also with other assets.
>>
>> My next idea was to make a GET request instead of HEAD. But in this case
>> I don't need to get the resource body for assets(images, js, css).
>>
>> In this case I need somehow to make GET request, but to request only for
>> small chunk of data, which will include headers, and then close connection.
>> No need to download 10 MB file If I need only it's status(200, 301)
>>
>> Now, from theory to code. I checked the scrapy code related to
>> downloading requests. So, ScrapyAgent
>> <https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L41>
>>  is
>> responsable for downloading  pages via download_request
>> <https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L64>
>> .
>>
>> The code responsable for receiving data from the socket is  dataReceived
>> <https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L142>.
>> Here I plugged in some logic which close the connection after first
>> received chunk:
>>
>> if allowed_mimetype:
>>
>>     self._txresponse._transport._producer.loseConnection()
>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Custom ScrapyAgent which allow to filter response by content length

Reply via email to