Re: TextReponse

Paul Tremberth Wed, 31 May 2017 07:33:04 -0700

Hi Malik,

scrapy callbacks MUST return a Request, an Item, a dict, or a list of those 
(or be a generator of these types, if you use yield), as the error says.
That's part of Scrapy framework's API contract with spider classes.


If you did 
item['textbody'] = response.text

then item['textbody'] contains a Unicode string,
and your callback would return item, not item['textbody'].

You can then process your output items to get the "textbody" field of each.

Scrapy is about outputtin structured data. Plain strings are less 
structured than a dict or XML element with a "textbody" field.

Is that clearer?

If not, you can post your spider code.

Best,
Paul.

Note that we're moving the community questions and discussion to Reddit.
See https://groups.google.com/d/msg/scrapy-users/0ParYGqd5Hg/4z_T-8JpCQAJ

On Sunday, May 28, 2017 at 7:08:24 AM UTC+2, Malik Rumi wrote:
>
> OK, I'm back:
>
> 2017-05-28 05:00:18 [scrapy.core.scraper] ERROR: Spider must return 
> Request, BaseItem, dict or None, got 'str' in <GET.html 
> <https://law.resource.org/pub/us/case/reporter/US/350/350.US.523.282.html>
> >
>
>
> But I *want* a string!
>
> That's why I redefined the items in my spider this way: 
>
>
> item['textbody'] = response.text
>
>
> And besides, isn't item['texbody'] a dict or dict like object?
>
>
> How do I get a string?!
>
>
>
>
>
>
> On Saturday, May 27, 2017 at 11:44:09 AM UTC-7, Malik Rumi wrote:
>>
>> Dear Paul,
>> thank you for the explanation. I'm not sure I understand, to be honest, 
>> but let me try a few things and see if it gets clearer. If not, I'll be 
>> back. 
>>
>>
>>
>> On Tuesday, May 23, 2017 at 7:43:22 AM UTC-7, Paul Tremberth wrote:
>>>
>>> Hello Malik,
>>>
>>> On Tuesday, May 23, 2017 at 4:54:40 AM UTC+2, Malik Rumi wrote:
>>>>
>>>> In the docs, it says:
>>>>
>>>> TextResponse objects adds encoding capabilities to the base Response 
>>>>> class, which is meant to be used
>>>>> only for binary data, such as images, sounds or any media file.
>>>>
>>>>
>>>> I understood this to mean that the base Response class is meant to be 
>>>> used only for binary data. However, I also read:
>>>>
>>>>
>>> Correct. This line is taken from the official docs (
>>> https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse
>>> )
>>>  
>>>
>>>> TextResponse Objects are used for binary data such as images, sounds 
>>>>> etc which has the ability to encode the base Response class. 
>>>>
>>>>
>>>>  
>>>> https://www.tutorialspoint.com/scrapy/scrapy_requests_and_responses.htm
>>>>
>>>> which is of course exactly the opposite of how I interpreted it. Would 
>>>> someone here please clarify? Thanks.
>>>>
>>>>
>>> This line is not from the official docs. And I believe it is neither 
>>> correct nor clear.
>>>
>>>  
>>>
>>>> As additional background, I am scraping text, not photos or media 
>>>> files. So it makes sense to me that something called TextResponse would be 
>>>> intended for use with text, but I didn't write it, so I don't know. 
>>>> That's why I am asking for clarification.
>>>>
>>>> Ordinarily, when I download, it is a bytes object which I then have to 
>>>> convert to unicode. If I can set it up to come to me as unicode in the 
>>>> first place, 
>>>> that would save me a step and be great. But that leads me to my second 
>>>> question: How exactly are we supposed to implement TextResponse? 
>>>>
>>>>
>>> The Scrapy framework will instantiate the correct Response class or 
>>> subclass and pass it as argument to your spider callbacks.
>>>
>>> If the framework receives an HTML or XML response, it will create an 
>>> HtmlResponse or XmlResponse respectively, by itself, without you needing to 
>>> do anything special.
>>>
>>> Both HtmlResponse and XmlResponse are subclasses of TextResponse. (See 
>>> https://docs.scrapy.org/en/latest/topics/request-response.html#htmlresponse-objects
>>> )
>>>
>>> The distinction between a plain, raw Response and TextResponse is really 
>>> that,
>>> on TextResponses, you can call .xpath() and .css() on them directly, 
>>> without the need to create Selector explicitly.
>>>
>>> XPath and CSS selectors only make sense for HTML or XML. That's why 
>>> .xpath() and .css() are only available on HtmlResponse and XmlResponse 
>>> instances.
>>>
>>> ALL Responses, TextResponse or not, come with the raw body received from 
>>> the server,
>>> and which is accessible via the .body attribute.
>>> response.body gives you raw bytes.
>>>
>>> What TextResponse adds here is a .text attribute that contains the 
>>> Unicode string of the raw body,
>>> as decoded with the detected encoding of the page.
>>> response.text is a Unicode string.
>>>
>>> response.text is NOT available on non-TextResponse.
>>>
>>>  
>>>
>>>> I am in 100% agreement with the OP here: 
>>>> https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users
>>>>
>>>> and I don't think he (or I) got a sufficient answer.
>>>>
>>>> HTML pages are the most common response types spiders deal with, and 
>>>>> their class is HtmlResponse, which inherits from TextResponse, so you can 
>>>>> use all its features.
>>>>
>>>>
>>>> Well, if that's so, then TextResponse would be the default and we'd get 
>>>> back unicode strings, right? But that's not what happens. We get byte 
>>>> strings.
>>>>
>>>>
>>> As Pablo said in 
>>> https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users
>>>  
>>> ,
>>> usually (but not always),
>>> you expect HTML back from Request.
>>>
>>> And one usually writes callbacks with this assumption.
>>> And with this assumption, you rarely need to bother about the raw bytes 
>>> or encoding: you trust scrapy and use response.css() or response.xpath()
>>>
>>> And if you need to access the (decoded) unicode content, you use 
>>> response.text
>>>
>>> If your callbacks can, for some (maybe valid) reason, receive responses 
>>> that are of mixed type,
>>> that is that they are NOT always text (such as image, zip file etc.),
>>> then you can test the response type with isinstance() and you use 
>>> response.body to get the raw bytes if you need.
>>>
>>>  
>>>
>>>> And despite the answer found there, it is not at all clear how we can 
>>>> use these response subclasses if we are told the middleware does it all 
>>>> automatically, as if we aren't
>>>> supposed to worry about it. If that were so, why tell us about, or even 
>>>> have - the subclass at all?
>>>>
>>>>
>>> As I mention above, one usually writes spider callbacks for a specific 
>>> type of Response, and it's usually for HtmlResponse.
>>> But you can totally work with non-TextResponse in Scrapy, if you need it.
>>>
>>> One area where the type is more important is middlewares.
>>> These are generic components and may need to handle different types of 
>>> responses (or skip processing if the type is not the one it's supposed to 
>>> work on).
>>>
>>> You may not need to write your own middlewares, but if you do, you can 
>>> have a look at scrapy's source code;
>>> For example AjaxCrawlMiddleware
>>>
>>> https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/downloadermiddlewares/ajaxcrawl.py#L38
>>>  
>>>
>>>> Here's an error I got: TypeError: TextResponse url must be str, got 
>>>> list:
>>>> The list the error is referring to is my start_urls variable that I've 
>>>> been using without issue until I tried to use TextResponse. So if we can't 
>>>> use a list, are we supposed to only feed it
>>>> one url at a time? Manually? 
>>>>
>>>> Your patient, thorough, and detailed explanation of these issues is 
>>>> greatly appreciated. 
>>>>
>>>
>>> I hope this explained the different response type clearly enough.
>>> If not, feel free to ask.
>>>
>>> Cheers,
>>> /Paul. 
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: TextReponse

Reply via email to