Re: TextReponse

Malik Rumi Sat, 27 May 2017 22:08:54 -0700

OK, I'm back:

2017-05-28 05:00:18 [scrapy.core.scraper] ERROR: Spider must return 
Request, BaseItem, dict or None, got 'str' in <GET.html 
<https://law.resource.org/pub/us/case/reporter/US/350/350.US.523.282.html>>



But I *want* a string!

That's why I redefined the items in my spider this way: 


item['textbody'] = response.text


And besides, isn't item['texbody'] a dict or dict like object?


How do I get a string?!






On Saturday, May 27, 2017 at 11:44:09 AM UTC-7, Malik Rumi wrote:
>
> Dear Paul,
> thank you for the explanation. I'm not sure I understand, to be honest, 
> but let me try a few things and see if it gets clearer. If not, I'll be 
> back. 
>
>
>
> On Tuesday, May 23, 2017 at 7:43:22 AM UTC-7, Paul Tremberth wrote:
>>
>> Hello Malik,
>>
>> On Tuesday, May 23, 2017 at 4:54:40 AM UTC+2, Malik Rumi wrote:
>>>
>>> In the docs, it says:
>>>
>>> TextResponse objects adds encoding capabilities to the base Response 
>>>> class, which is meant to be used
>>>> only for binary data, such as images, sounds or any media file.
>>>
>>>
>>> I understood this to mean that the base Response class is meant to be 
>>> used only for binary data. However, I also read:
>>>
>>>
>> Correct. This line is taken from the official docs (
>> https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse
>> )
>>  
>>
>>> TextResponse Objects are used for binary data such as images, sounds etc 
>>>> which has the ability to encode the base Response class. 
>>>
>>>
>>>  https://www.tutorialspoint.com/scrapy/scrapy_requests_and_responses.htm
>>>
>>> which is of course exactly the opposite of how I interpreted it. Would 
>>> someone here please clarify? Thanks.
>>>
>>>
>> This line is not from the official docs. And I believe it is neither 
>> correct nor clear.
>>
>>  
>>
>>> As additional background, I am scraping text, not photos or media files. 
>>> So it makes sense to me that something called TextResponse would be 
>>> intended for use with text, but I didn't write it, so I don't know. 
>>> That's why I am asking for clarification.
>>>
>>> Ordinarily, when I download, it is a bytes object which I then have to 
>>> convert to unicode. If I can set it up to come to me as unicode in the 
>>> first place, 
>>> that would save me a step and be great. But that leads me to my second 
>>> question: How exactly are we supposed to implement TextResponse? 
>>>
>>>
>> The Scrapy framework will instantiate the correct Response class or 
>> subclass and pass it as argument to your spider callbacks.
>>
>> If the framework receives an HTML or XML response, it will create an 
>> HtmlResponse or XmlResponse respectively, by itself, without you needing to 
>> do anything special.
>>
>> Both HtmlResponse and XmlResponse are subclasses of TextResponse. (See 
>> https://docs.scrapy.org/en/latest/topics/request-response.html#htmlresponse-objects
>> )
>>
>> The distinction between a plain, raw Response and TextResponse is really 
>> that,
>> on TextResponses, you can call .xpath() and .css() on them directly, 
>> without the need to create Selector explicitly.
>>
>> XPath and CSS selectors only make sense for HTML or XML. That's why 
>> .xpath() and .css() are only available on HtmlResponse and XmlResponse 
>> instances.
>>
>> ALL Responses, TextResponse or not, come with the raw body received from 
>> the server,
>> and which is accessible via the .body attribute.
>> response.body gives you raw bytes.
>>
>> What TextResponse adds here is a .text attribute that contains the 
>> Unicode string of the raw body,
>> as decoded with the detected encoding of the page.
>> response.text is a Unicode string.
>>
>> response.text is NOT available on non-TextResponse.
>>
>>  
>>
>>> I am in 100% agreement with the OP here: 
>>> https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users
>>>
>>> and I don't think he (or I) got a sufficient answer.
>>>
>>> HTML pages are the most common response types spiders deal with, and 
>>>> their class is HtmlResponse, which inherits from TextResponse, so you can 
>>>> use all its features.
>>>
>>>
>>> Well, if that's so, then TextResponse would be the default and we'd get 
>>> back unicode strings, right? But that's not what happens. We get byte 
>>> strings.
>>>
>>>
>> As Pablo said in 
>> https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users
>>  
>> ,
>> usually (but not always),
>> you expect HTML back from Request.
>>
>> And one usually writes callbacks with this assumption.
>> And with this assumption, you rarely need to bother about the raw bytes 
>> or encoding: you trust scrapy and use response.css() or response.xpath()
>>
>> And if you need to access the (decoded) unicode content, you use 
>> response.text
>>
>> If your callbacks can, for some (maybe valid) reason, receive responses 
>> that are of mixed type,
>> that is that they are NOT always text (such as image, zip file etc.),
>> then you can test the response type with isinstance() and you use 
>> response.body to get the raw bytes if you need.
>>
>>  
>>
>>> And despite the answer found there, it is not at all clear how we can 
>>> use these response subclasses if we are told the middleware does it all 
>>> automatically, as if we aren't
>>> supposed to worry about it. If that were so, why tell us about, or even 
>>> have - the subclass at all?
>>>
>>>
>> As I mention above, one usually writes spider callbacks for a specific 
>> type of Response, and it's usually for HtmlResponse.
>> But you can totally work with non-TextResponse in Scrapy, if you need it.
>>
>> One area where the type is more important is middlewares.
>> These are generic components and may need to handle different types of 
>> responses (or skip processing if the type is not the one it's supposed to 
>> work on).
>>
>> You may not need to write your own middlewares, but if you do, you can 
>> have a look at scrapy's source code;
>> For example AjaxCrawlMiddleware
>>
>> https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/downloadermiddlewares/ajaxcrawl.py#L38
>>  
>>
>>> Here's an error I got: TypeError: TextResponse url must be str, got list:
>>> The list the error is referring to is my start_urls variable that I've 
>>> been using without issue until I tried to use TextResponse. So if we can't 
>>> use a list, are we supposed to only feed it
>>> one url at a time? Manually? 
>>>
>>> Your patient, thorough, and detailed explanation of these issues is 
>>> greatly appreciated. 
>>>
>>
>> I hope this explained the different response type clearly enough.
>> If not, feel free to ask.
>>
>> Cheers,
>> /Paul. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: TextReponse

Reply via email to