Re: TextReponse

Paul Tremberth Tue, 23 May 2017 07:43:31 -0700

Hello Malik,

On Tuesday, May 23, 2017 at 4:54:40 AM UTC+2, Malik Rumi wrote:
>
> In the docs, it says:
>
> TextResponse objects adds encoding capabilities to the base Response 
>> class, which is meant to be used
>> only for binary data, such as images, sounds or any media file.
>
>
> I understood this to mean that the base Response class is meant to be used 
> only for binary data. However, I also read:
>
>
Correct. This line is taken from the official docs 
(https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse)


> TextResponse Objects are used for binary data such as images, sounds etc 
>> which has the ability to encode the base Response class. 
>
>
>  https://www.tutorialspoint.com/scrapy/scrapy_requests_and_responses.htm
>
> which is of course exactly the opposite of how I interpreted it. Would 
> someone here please clarify? Thanks.
>
>
This line is not from the official docs. And I believe it is neither 
correct nor clear.

 

> As additional background, I am scraping text, not photos or media files. 
> So it makes sense to me that something called TextResponse would be 
> intended for use with text, but I didn't write it, so I don't know. That's 
> why I am asking for clarification.
>
> Ordinarily, when I download, it is a bytes object which I then have to 
> convert to unicode. If I can set it up to come to me as unicode in the 
> first place, 
> that would save me a step and be great. But that leads me to my second 
> question: How exactly are we supposed to implement TextResponse? 
>
>
The Scrapy framework will instantiate the correct Response class or 
subclass and pass it as argument to your spider callbacks.

If the framework receives an HTML or XML response, it will create an 
HtmlResponse or XmlResponse respectively, by itself, without you needing to 
do anything special.

Both HtmlResponse and XmlResponse are subclasses of TextResponse. 
(See 
https://docs.scrapy.org/en/latest/topics/request-response.html#htmlresponse-objects)

The distinction between a plain, raw Response and TextResponse is really 
that,
on TextResponses, you can call .xpath() and .css() on them directly, 
without the need to create Selector explicitly.

XPath and CSS selectors only make sense for HTML or XML. That's why 
.xpath() and .css() are only available on HtmlResponse and XmlResponse 
instances.

ALL Responses, TextResponse or not, come with the raw body received from 
the server,
and which is accessible via the .body attribute.
response.body gives you raw bytes.

What TextResponse adds here is a .text attribute that contains the Unicode 
string of the raw body,
as decoded with the detected encoding of the page.
response.text is a Unicode string.

response.text is NOT available on non-TextResponse.

 

> I am in 100% agreement with the OP here: 
> https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users
>
> and I don't think he (or I) got a sufficient answer.
>
> HTML pages are the most common response types spiders deal with, and their 
>> class is HtmlResponse, which inherits from TextResponse, so you can use all 
>> its features.
>
>
> Well, if that's so, then TextResponse would be the default and we'd get 
> back unicode strings, right? But that's not what happens. We get byte 
> strings.
>
>
As Pablo said 
in 
https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users
 
,
usually (but not always),
you expect HTML back from Request.

And one usually writes callbacks with this assumption.
And with this assumption, you rarely need to bother about the raw bytes or 
encoding: you trust scrapy and use response.css() or response.xpath()

And if you need to access the (decoded) unicode content, you use 
response.text

If your callbacks can, for some (maybe valid) reason, receive responses 
that are of mixed type,
that is that they are NOT always text (such as image, zip file etc.),
then you can test the response type with isinstance() and you use 
response.body to get the raw bytes if you need.

 

> And despite the answer found there, it is not at all clear how we can use 
> these response subclasses if we are told the middleware does it all 
> automatically, as if we aren't
> supposed to worry about it. If that were so, why tell us about, or even 
> have - the subclass at all?
>
>
As I mention above, one usually writes spider callbacks for a specific type 
of Response, and it's usually for HtmlResponse.
But you can totally work with non-TextResponse in Scrapy, if you need it.

One area where the type is more important is middlewares.
These are generic components and may need to handle different types of 
responses (or skip processing if the type is not the one it's supposed to 
work on).

You may not need to write your own middlewares, but if you do, you can have 
a look at scrapy's source code;
For example AjaxCrawlMiddleware
https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/downloadermiddlewares/ajaxcrawl.py#L38
 

> Here's an error I got: TypeError: TextResponse url must be str, got list:
> The list the error is referring to is my start_urls variable that I've 
> been using without issue until I tried to use TextResponse. So if we can't 
> use a list, are we supposed to only feed it
> one url at a time? Manually? 
>
> Your patient, thorough, and detailed explanation of these issues is 
> greatly appreciated. 
>

I hope this explained the different response type clearly enough.
If not, feel free to ask.

Cheers,
/Paul. 

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: TextReponse

Reply via email to