OK, I'm back: 2017-05-28 05:00:18 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'str' in <GET.html <https://law.resource.org/pub/us/case/reporter/US/350/350.US.523.282.html>>
But I *want* a string! That's why I redefined the items in my spider this way: item['textbody'] = response.text And besides, isn't item['texbody'] a dict or dict like object? How do I get a string?! On Saturday, May 27, 2017 at 11:44:09 AM UTC-7, Malik Rumi wrote: > > Dear Paul, > thank you for the explanation. I'm not sure I understand, to be honest, > but let me try a few things and see if it gets clearer. If not, I'll be > back. > > > > On Tuesday, May 23, 2017 at 7:43:22 AM UTC-7, Paul Tremberth wrote: >> >> Hello Malik, >> >> On Tuesday, May 23, 2017 at 4:54:40 AM UTC+2, Malik Rumi wrote: >>> >>> In the docs, it says: >>> >>> TextResponse objects adds encoding capabilities to the base Response >>>> class, which is meant to be used >>>> only for binary data, such as images, sounds or any media file. >>> >>> >>> I understood this to mean that the base Response class is meant to be >>> used only for binary data. However, I also read: >>> >>> >> Correct. This line is taken from the official docs ( >> https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse >> ) >> >> >>> TextResponse Objects are used for binary data such as images, sounds etc >>>> which has the ability to encode the base Response class. >>> >>> >>> https://www.tutorialspoint.com/scrapy/scrapy_requests_and_responses.htm >>> >>> which is of course exactly the opposite of how I interpreted it. Would >>> someone here please clarify? Thanks. >>> >>> >> This line is not from the official docs. And I believe it is neither >> correct nor clear. >> >> >> >>> As additional background, I am scraping text, not photos or media files. >>> So it makes sense to me that something called TextResponse would be >>> intended for use with text, but I didn't write it, so I don't know. >>> That's why I am asking for clarification. >>> >>> Ordinarily, when I download, it is a bytes object which I then have to >>> convert to unicode. If I can set it up to come to me as unicode in the >>> first place, >>> that would save me a step and be great. But that leads me to my second >>> question: How exactly are we supposed to implement TextResponse? >>> >>> >> The Scrapy framework will instantiate the correct Response class or >> subclass and pass it as argument to your spider callbacks. >> >> If the framework receives an HTML or XML response, it will create an >> HtmlResponse or XmlResponse respectively, by itself, without you needing to >> do anything special. >> >> Both HtmlResponse and XmlResponse are subclasses of TextResponse. (See >> https://docs.scrapy.org/en/latest/topics/request-response.html#htmlresponse-objects >> ) >> >> The distinction between a plain, raw Response and TextResponse is really >> that, >> on TextResponses, you can call .xpath() and .css() on them directly, >> without the need to create Selector explicitly. >> >> XPath and CSS selectors only make sense for HTML or XML. That's why >> .xpath() and .css() are only available on HtmlResponse and XmlResponse >> instances. >> >> ALL Responses, TextResponse or not, come with the raw body received from >> the server, >> and which is accessible via the .body attribute. >> response.body gives you raw bytes. >> >> What TextResponse adds here is a .text attribute that contains the >> Unicode string of the raw body, >> as decoded with the detected encoding of the page. >> response.text is a Unicode string. >> >> response.text is NOT available on non-TextResponse. >> >> >> >>> I am in 100% agreement with the OP here: >>> https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users >>> >>> and I don't think he (or I) got a sufficient answer. >>> >>> HTML pages are the most common response types spiders deal with, and >>>> their class is HtmlResponse, which inherits from TextResponse, so you can >>>> use all its features. >>> >>> >>> Well, if that's so, then TextResponse would be the default and we'd get >>> back unicode strings, right? But that's not what happens. We get byte >>> strings. >>> >>> >> As Pablo said in >> https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users >> >> , >> usually (but not always), >> you expect HTML back from Request. >> >> And one usually writes callbacks with this assumption. >> And with this assumption, you rarely need to bother about the raw bytes >> or encoding: you trust scrapy and use response.css() or response.xpath() >> >> And if you need to access the (decoded) unicode content, you use >> response.text >> >> If your callbacks can, for some (maybe valid) reason, receive responses >> that are of mixed type, >> that is that they are NOT always text (such as image, zip file etc.), >> then you can test the response type with isinstance() and you use >> response.body to get the raw bytes if you need. >> >> >> >>> And despite the answer found there, it is not at all clear how we can >>> use these response subclasses if we are told the middleware does it all >>> automatically, as if we aren't >>> supposed to worry about it. If that were so, why tell us about, or even >>> have - the subclass at all? >>> >>> >> As I mention above, one usually writes spider callbacks for a specific >> type of Response, and it's usually for HtmlResponse. >> But you can totally work with non-TextResponse in Scrapy, if you need it. >> >> One area where the type is more important is middlewares. >> These are generic components and may need to handle different types of >> responses (or skip processing if the type is not the one it's supposed to >> work on). >> >> You may not need to write your own middlewares, but if you do, you can >> have a look at scrapy's source code; >> For example AjaxCrawlMiddleware >> >> https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/downloadermiddlewares/ajaxcrawl.py#L38 >> >> >>> Here's an error I got: TypeError: TextResponse url must be str, got list: >>> The list the error is referring to is my start_urls variable that I've >>> been using without issue until I tried to use TextResponse. So if we can't >>> use a list, are we supposed to only feed it >>> one url at a time? Manually? >>> >>> Your patient, thorough, and detailed explanation of these issues is >>> greatly appreciated. >>> >> >> I hope this explained the different response type clearly enough. >> If not, feel free to ask. >> >> Cheers, >> /Paul. >> > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.