Dear Paul, thank you for the explanation. I'm not sure I understand, to be honest, but let me try a few things and see if it gets clearer. If not, I'll be back.
On Tuesday, May 23, 2017 at 7:43:22 AM UTC-7, Paul Tremberth wrote: > > Hello Malik, > > On Tuesday, May 23, 2017 at 4:54:40 AM UTC+2, Malik Rumi wrote: >> >> In the docs, it says: >> >> TextResponse objects adds encoding capabilities to the base Response >>> class, which is meant to be used >>> only for binary data, such as images, sounds or any media file. >> >> >> I understood this to mean that the base Response class is meant to be >> used only for binary data. However, I also read: >> >> > Correct. This line is taken from the official docs ( > https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse > ) > > >> TextResponse Objects are used for binary data such as images, sounds etc >>> which has the ability to encode the base Response class. >> >> >> https://www.tutorialspoint.com/scrapy/scrapy_requests_and_responses.htm >> >> which is of course exactly the opposite of how I interpreted it. Would >> someone here please clarify? Thanks. >> >> > This line is not from the official docs. And I believe it is neither > correct nor clear. > > > >> As additional background, I am scraping text, not photos or media files. >> So it makes sense to me that something called TextResponse would be >> intended for use with text, but I didn't write it, so I don't know. >> That's why I am asking for clarification. >> >> Ordinarily, when I download, it is a bytes object which I then have to >> convert to unicode. If I can set it up to come to me as unicode in the >> first place, >> that would save me a step and be great. But that leads me to my second >> question: How exactly are we supposed to implement TextResponse? >> >> > The Scrapy framework will instantiate the correct Response class or > subclass and pass it as argument to your spider callbacks. > > If the framework receives an HTML or XML response, it will create an > HtmlResponse or XmlResponse respectively, by itself, without you needing to > do anything special. > > Both HtmlResponse and XmlResponse are subclasses of TextResponse. (See > https://docs.scrapy.org/en/latest/topics/request-response.html#htmlresponse-objects > ) > > The distinction between a plain, raw Response and TextResponse is really > that, > on TextResponses, you can call .xpath() and .css() on them directly, > without the need to create Selector explicitly. > > XPath and CSS selectors only make sense for HTML or XML. That's why > .xpath() and .css() are only available on HtmlResponse and XmlResponse > instances. > > ALL Responses, TextResponse or not, come with the raw body received from > the server, > and which is accessible via the .body attribute. > response.body gives you raw bytes. > > What TextResponse adds here is a .text attribute that contains the Unicode > string of the raw body, > as decoded with the detected encoding of the page. > response.text is a Unicode string. > > response.text is NOT available on non-TextResponse. > > > >> I am in 100% agreement with the OP here: >> https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users >> >> and I don't think he (or I) got a sufficient answer. >> >> HTML pages are the most common response types spiders deal with, and >>> their class is HtmlResponse, which inherits from TextResponse, so you can >>> use all its features. >> >> >> Well, if that's so, then TextResponse would be the default and we'd get >> back unicode strings, right? But that's not what happens. We get byte >> strings. >> >> > As Pablo said in > https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users > > , > usually (but not always), > you expect HTML back from Request. > > And one usually writes callbacks with this assumption. > And with this assumption, you rarely need to bother about the raw bytes or > encoding: you trust scrapy and use response.css() or response.xpath() > > And if you need to access the (decoded) unicode content, you use > response.text > > If your callbacks can, for some (maybe valid) reason, receive responses > that are of mixed type, > that is that they are NOT always text (such as image, zip file etc.), > then you can test the response type with isinstance() and you use > response.body to get the raw bytes if you need. > > > >> And despite the answer found there, it is not at all clear how we can use >> these response subclasses if we are told the middleware does it all >> automatically, as if we aren't >> supposed to worry about it. If that were so, why tell us about, or even >> have - the subclass at all? >> >> > As I mention above, one usually writes spider callbacks for a specific > type of Response, and it's usually for HtmlResponse. > But you can totally work with non-TextResponse in Scrapy, if you need it. > > One area where the type is more important is middlewares. > These are generic components and may need to handle different types of > responses (or skip processing if the type is not the one it's supposed to > work on). > > You may not need to write your own middlewares, but if you do, you can > have a look at scrapy's source code; > For example AjaxCrawlMiddleware > > https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/downloadermiddlewares/ajaxcrawl.py#L38 > > >> Here's an error I got: TypeError: TextResponse url must be str, got list: >> The list the error is referring to is my start_urls variable that I've >> been using without issue until I tried to use TextResponse. So if we can't >> use a list, are we supposed to only feed it >> one url at a time? Manually? >> >> Your patient, thorough, and detailed explanation of these issues is >> greatly appreciated. >> > > I hope this explained the different response type clearly enough. > If not, feel free to ask. > > Cheers, > /Paul. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.