Hello Malik, On Tuesday, May 23, 2017 at 4:54:40 AM UTC+2, Malik Rumi wrote: > > In the docs, it says: > > TextResponse objects adds encoding capabilities to the base Response >> class, which is meant to be used >> only for binary data, such as images, sounds or any media file. > > > I understood this to mean that the base Response class is meant to be used > only for binary data. However, I also read: > > Correct. This line is taken from the official docs (https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse)
> TextResponse Objects are used for binary data such as images, sounds etc >> which has the ability to encode the base Response class. > > > https://www.tutorialspoint.com/scrapy/scrapy_requests_and_responses.htm > > which is of course exactly the opposite of how I interpreted it. Would > someone here please clarify? Thanks. > > This line is not from the official docs. And I believe it is neither correct nor clear. > As additional background, I am scraping text, not photos or media files. > So it makes sense to me that something called TextResponse would be > intended for use with text, but I didn't write it, so I don't know. That's > why I am asking for clarification. > > Ordinarily, when I download, it is a bytes object which I then have to > convert to unicode. If I can set it up to come to me as unicode in the > first place, > that would save me a step and be great. But that leads me to my second > question: How exactly are we supposed to implement TextResponse? > > The Scrapy framework will instantiate the correct Response class or subclass and pass it as argument to your spider callbacks. If the framework receives an HTML or XML response, it will create an HtmlResponse or XmlResponse respectively, by itself, without you needing to do anything special. Both HtmlResponse and XmlResponse are subclasses of TextResponse. (See https://docs.scrapy.org/en/latest/topics/request-response.html#htmlresponse-objects) The distinction between a plain, raw Response and TextResponse is really that, on TextResponses, you can call .xpath() and .css() on them directly, without the need to create Selector explicitly. XPath and CSS selectors only make sense for HTML or XML. That's why .xpath() and .css() are only available on HtmlResponse and XmlResponse instances. ALL Responses, TextResponse or not, come with the raw body received from the server, and which is accessible via the .body attribute. response.body gives you raw bytes. What TextResponse adds here is a .text attribute that contains the Unicode string of the raw body, as decoded with the detected encoding of the page. response.text is a Unicode string. response.text is NOT available on non-TextResponse. > I am in 100% agreement with the OP here: > https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users > > and I don't think he (or I) got a sufficient answer. > > HTML pages are the most common response types spiders deal with, and their >> class is HtmlResponse, which inherits from TextResponse, so you can use all >> its features. > > > Well, if that's so, then TextResponse would be the default and we'd get > back unicode strings, right? But that's not what happens. We get byte > strings. > > As Pablo said in https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users , usually (but not always), you expect HTML back from Request. And one usually writes callbacks with this assumption. And with this assumption, you rarely need to bother about the raw bytes or encoding: you trust scrapy and use response.css() or response.xpath() And if you need to access the (decoded) unicode content, you use response.text If your callbacks can, for some (maybe valid) reason, receive responses that are of mixed type, that is that they are NOT always text (such as image, zip file etc.), then you can test the response type with isinstance() and you use response.body to get the raw bytes if you need. > And despite the answer found there, it is not at all clear how we can use > these response subclasses if we are told the middleware does it all > automatically, as if we aren't > supposed to worry about it. If that were so, why tell us about, or even > have - the subclass at all? > > As I mention above, one usually writes spider callbacks for a specific type of Response, and it's usually for HtmlResponse. But you can totally work with non-TextResponse in Scrapy, if you need it. One area where the type is more important is middlewares. These are generic components and may need to handle different types of responses (or skip processing if the type is not the one it's supposed to work on). You may not need to write your own middlewares, but if you do, you can have a look at scrapy's source code; For example AjaxCrawlMiddleware https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/downloadermiddlewares/ajaxcrawl.py#L38 > Here's an error I got: TypeError: TextResponse url must be str, got list: > The list the error is referring to is my start_urls variable that I've > been using without issue until I tried to use TextResponse. So if we can't > use a list, are we supposed to only feed it > one url at a time? Manually? > > Your patient, thorough, and detailed explanation of these issues is > greatly appreciated. > I hope this explained the different response type clearly enough. If not, feel free to ask. Cheers, /Paul. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.