Hi.

Recently I'm working on a small crawler which check some site and see what 
is the status(and other info) of all urls.

My initial idea was to make a GET request to all html resources and a HEAD 
request to resources other then html.

The problem in this case is that some servers do not implement HEAD 
request(I noticed this on urls to facebook and twitter) and I get a 
TimeoutError.

Note that I can have the same issue not only with plain html pages, but 
also with other assets.

My next idea was to make a GET request instead of HEAD. But in this case I 
don't need to get the resource body for assets(images, js, css).

In this case I need somehow to make GET request, but to request only for 
small chunk of data, which will include headers, and then close connection. 
No need to download 10 MB file If I need only it's status(200, 301)

Now, from theory to code. I checked the scrapy code related to downloading 
requests. So, 
ScrapyAgent<https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L41>
 is 
responsable for downloading  pages via 
download_request<https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L64>
.

The code responsable for receiving data from the socket is  
dataReceived<https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L142>.
 
Here I plugged in some logic which close the connection after first 
received chunk:

if allowed_mimetype:

    self._txresponse._transport._producer.loseConnection() 


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to