Re: TextReponse

Malik Rumi Sat, 27 May 2017 11:44:17 -0700

Dear Paul,
thank you for the explanation. I'm not sure I understand, to be honest, but 
let me try a few things and see if it gets clearer. If not, I'll be back.




On Tuesday, May 23, 2017 at 7:43:22 AM UTC-7, Paul Tremberth wrote:
>
> Hello Malik,
>
> On Tuesday, May 23, 2017 at 4:54:40 AM UTC+2, Malik Rumi wrote:
>>
>> In the docs, it says:
>>
>> TextResponse objects adds encoding capabilities to the base Response 
>>> class, which is meant to be used
>>> only for binary data, such as images, sounds or any media file.
>>
>>
>> I understood this to mean that the base Response class is meant to be 
>> used only for binary data. However, I also read:
>>
>>
> Correct. This line is taken from the official docs (
> https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse
> )
>  
>
>> TextResponse Objects are used for binary data such as images, sounds etc 
>>> which has the ability to encode the base Response class. 
>>
>>
>>  https://www.tutorialspoint.com/scrapy/scrapy_requests_and_responses.htm
>>
>> which is of course exactly the opposite of how I interpreted it. Would 
>> someone here please clarify? Thanks.
>>
>>
> This line is not from the official docs. And I believe it is neither 
> correct nor clear.
>
>  
>
>> As additional background, I am scraping text, not photos or media files. 
>> So it makes sense to me that something called TextResponse would be 
>> intended for use with text, but I didn't write it, so I don't know. 
>> That's why I am asking for clarification.
>>
>> Ordinarily, when I download, it is a bytes object which I then have to 
>> convert to unicode. If I can set it up to come to me as unicode in the 
>> first place, 
>> that would save me a step and be great. But that leads me to my second 
>> question: How exactly are we supposed to implement TextResponse? 
>>
>>
> The Scrapy framework will instantiate the correct Response class or 
> subclass and pass it as argument to your spider callbacks.
>
> If the framework receives an HTML or XML response, it will create an 
> HtmlResponse or XmlResponse respectively, by itself, without you needing to 
> do anything special.
>
> Both HtmlResponse and XmlResponse are subclasses of TextResponse. (See 
> https://docs.scrapy.org/en/latest/topics/request-response.html#htmlresponse-objects
> )
>
> The distinction between a plain, raw Response and TextResponse is really 
> that,
> on TextResponses, you can call .xpath() and .css() on them directly, 
> without the need to create Selector explicitly.
>
> XPath and CSS selectors only make sense for HTML or XML. That's why 
> .xpath() and .css() are only available on HtmlResponse and XmlResponse 
> instances.
>
> ALL Responses, TextResponse or not, come with the raw body received from 
> the server,
> and which is accessible via the .body attribute.
> response.body gives you raw bytes.
>
> What TextResponse adds here is a .text attribute that contains the Unicode 
> string of the raw body,
> as decoded with the detected encoding of the page.
> response.text is a Unicode string.
>
> response.text is NOT available on non-TextResponse.
>
>  
>
>> I am in 100% agreement with the OP here: 
>> https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users
>>
>> and I don't think he (or I) got a sufficient answer.
>>
>> HTML pages are the most common response types spiders deal with, and 
>>> their class is HtmlResponse, which inherits from TextResponse, so you can 
>>> use all its features.
>>
>>
>> Well, if that's so, then TextResponse would be the default and we'd get 
>> back unicode strings, right? But that's not what happens. We get byte 
>> strings.
>>
>>
> As Pablo said in 
> https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users
>  
> ,
> usually (but not always),
> you expect HTML back from Request.
>
> And one usually writes callbacks with this assumption.
> And with this assumption, you rarely need to bother about the raw bytes or 
> encoding: you trust scrapy and use response.css() or response.xpath()
>
> And if you need to access the (decoded) unicode content, you use 
> response.text
>
> If your callbacks can, for some (maybe valid) reason, receive responses 
> that are of mixed type,
> that is that they are NOT always text (such as image, zip file etc.),
> then you can test the response type with isinstance() and you use 
> response.body to get the raw bytes if you need.
>
>  
>
>> And despite the answer found there, it is not at all clear how we can use 
>> these response subclasses if we are told the middleware does it all 
>> automatically, as if we aren't
>> supposed to worry about it. If that were so, why tell us about, or even 
>> have - the subclass at all?
>>
>>
> As I mention above, one usually writes spider callbacks for a specific 
> type of Response, and it's usually for HtmlResponse.
> But you can totally work with non-TextResponse in Scrapy, if you need it.
>
> One area where the type is more important is middlewares.
> These are generic components and may need to handle different types of 
> responses (or skip processing if the type is not the one it's supposed to 
> work on).
>
> You may not need to write your own middlewares, but if you do, you can 
> have a look at scrapy's source code;
> For example AjaxCrawlMiddleware
>
> https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/downloadermiddlewares/ajaxcrawl.py#L38
>  
>
>> Here's an error I got: TypeError: TextResponse url must be str, got list:
>> The list the error is referring to is my start_urls variable that I've 
>> been using without issue until I tried to use TextResponse. So if we can't 
>> use a list, are we supposed to only feed it
>> one url at a time? Manually? 
>>
>> Your patient, thorough, and detailed explanation of these issues is 
>> greatly appreciated. 
>>
>
> I hope this explained the different response type clearly enough.
> If not, feel free to ask.
>
> Cheers,
> /Paul. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: TextReponse

Reply via email to