Now, my questions are: Is this approach ok? If no, what is the best way to achieve this?
How can I send some custom *reason *to loseConnection(reason??)? I tries to send smth like reason = failure.Failure(ConnectionAborted())but I do not receive this in connectionLost<https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L145> . How can we change the chunk size when receiving data(this may be a question related to twisted) ? Thx. On Friday, March 7, 2014 3:42:15 PM UTC+2, Gheorghe Chirica wrote: > > Hi. > > Recently I'm working on a small crawler which check some site and see what > is the status(and other info) of all urls. > > My initial idea was to make a GET request to all html resources and a HEAD > request to resources other then html. > > The problem in this case is that some servers do not implement HEAD > request(I noticed this on urls to facebook and twitter) and I get a > TimeoutError. > > Note that I can have the same issue not only with plain html pages, but > also with other assets. > > My next idea was to make a GET request instead of HEAD. But in this case I > don't need to get the resource body for assets(images, js, css). > > In this case I need somehow to make GET request, but to request only for > small chunk of data, which will include headers, and then close connection. > No need to download 10 MB file If I need only it's status(200, 301) > > Now, from theory to code. I checked the scrapy code related to downloading > requests. So, > ScrapyAgent<https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L41> > is > responsable for downloading pages via > download_request<https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L64> > . > > The code responsable for receiving data from the socket is > dataReceived<https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L142>. > > Here I plugged in some logic which close the connection after first > received chunk: > > if allowed_mimetype: > > self._txresponse._transport._producer.loseConnection() > > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
