Hi Malik, scrapy callbacks MUST return a Request, an Item, a dict, or a list of those (or be a generator of these types, if you use yield), as the error says. That's part of Scrapy framework's API contract with spider classes.
If you did item['textbody'] = response.text then item['textbody'] contains a Unicode string, and your callback would return item, not item['textbody']. You can then process your output items to get the "textbody" field of each. Scrapy is about outputtin structured data. Plain strings are less structured than a dict or XML element with a "textbody" field. Is that clearer? If not, you can post your spider code. Best, Paul. Note that we're moving the community questions and discussion to Reddit. See https://groups.google.com/d/msg/scrapy-users/0ParYGqd5Hg/4z_T-8JpCQAJ On Sunday, May 28, 2017 at 7:08:24 AM UTC+2, Malik Rumi wrote: > > OK, I'm back: > > 2017-05-28 05:00:18 [scrapy.core.scraper] ERROR: Spider must return > Request, BaseItem, dict or None, got 'str' in <GET.html > <https://law.resource.org/pub/us/case/reporter/US/350/350.US.523.282.html> > > > > > But I *want* a string! > > That's why I redefined the items in my spider this way: > > > item['textbody'] = response.text > > > And besides, isn't item['texbody'] a dict or dict like object? > > > How do I get a string?! > > > > > > > On Saturday, May 27, 2017 at 11:44:09 AM UTC-7, Malik Rumi wrote: >> >> Dear Paul, >> thank you for the explanation. I'm not sure I understand, to be honest, >> but let me try a few things and see if it gets clearer. If not, I'll be >> back. >> >> >> >> On Tuesday, May 23, 2017 at 7:43:22 AM UTC-7, Paul Tremberth wrote: >>> >>> Hello Malik, >>> >>> On Tuesday, May 23, 2017 at 4:54:40 AM UTC+2, Malik Rumi wrote: >>>> >>>> In the docs, it says: >>>> >>>> TextResponse objects adds encoding capabilities to the base Response >>>>> class, which is meant to be used >>>>> only for binary data, such as images, sounds or any media file. >>>> >>>> >>>> I understood this to mean that the base Response class is meant to be >>>> used only for binary data. However, I also read: >>>> >>>> >>> Correct. This line is taken from the official docs ( >>> https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse >>> ) >>> >>> >>>> TextResponse Objects are used for binary data such as images, sounds >>>>> etc which has the ability to encode the base Response class. >>>> >>>> >>>> >>>> https://www.tutorialspoint.com/scrapy/scrapy_requests_and_responses.htm >>>> >>>> which is of course exactly the opposite of how I interpreted it. Would >>>> someone here please clarify? Thanks. >>>> >>>> >>> This line is not from the official docs. And I believe it is neither >>> correct nor clear. >>> >>> >>> >>>> As additional background, I am scraping text, not photos or media >>>> files. So it makes sense to me that something called TextResponse would be >>>> intended for use with text, but I didn't write it, so I don't know. >>>> That's why I am asking for clarification. >>>> >>>> Ordinarily, when I download, it is a bytes object which I then have to >>>> convert to unicode. If I can set it up to come to me as unicode in the >>>> first place, >>>> that would save me a step and be great. But that leads me to my second >>>> question: How exactly are we supposed to implement TextResponse? >>>> >>>> >>> The Scrapy framework will instantiate the correct Response class or >>> subclass and pass it as argument to your spider callbacks. >>> >>> If the framework receives an HTML or XML response, it will create an >>> HtmlResponse or XmlResponse respectively, by itself, without you needing to >>> do anything special. >>> >>> Both HtmlResponse and XmlResponse are subclasses of TextResponse. (See >>> https://docs.scrapy.org/en/latest/topics/request-response.html#htmlresponse-objects >>> ) >>> >>> The distinction between a plain, raw Response and TextResponse is really >>> that, >>> on TextResponses, you can call .xpath() and .css() on them directly, >>> without the need to create Selector explicitly. >>> >>> XPath and CSS selectors only make sense for HTML or XML. That's why >>> .xpath() and .css() are only available on HtmlResponse and XmlResponse >>> instances. >>> >>> ALL Responses, TextResponse or not, come with the raw body received from >>> the server, >>> and which is accessible via the .body attribute. >>> response.body gives you raw bytes. >>> >>> What TextResponse adds here is a .text attribute that contains the >>> Unicode string of the raw body, >>> as decoded with the detected encoding of the page. >>> response.text is a Unicode string. >>> >>> response.text is NOT available on non-TextResponse. >>> >>> >>> >>>> I am in 100% agreement with the OP here: >>>> https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users >>>> >>>> and I don't think he (or I) got a sufficient answer. >>>> >>>> HTML pages are the most common response types spiders deal with, and >>>>> their class is HtmlResponse, which inherits from TextResponse, so you can >>>>> use all its features. >>>> >>>> >>>> Well, if that's so, then TextResponse would be the default and we'd get >>>> back unicode strings, right? But that's not what happens. We get byte >>>> strings. >>>> >>>> >>> As Pablo said in >>> https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users >>> >>> , >>> usually (but not always), >>> you expect HTML back from Request. >>> >>> And one usually writes callbacks with this assumption. >>> And with this assumption, you rarely need to bother about the raw bytes >>> or encoding: you trust scrapy and use response.css() or response.xpath() >>> >>> And if you need to access the (decoded) unicode content, you use >>> response.text >>> >>> If your callbacks can, for some (maybe valid) reason, receive responses >>> that are of mixed type, >>> that is that they are NOT always text (such as image, zip file etc.), >>> then you can test the response type with isinstance() and you use >>> response.body to get the raw bytes if you need. >>> >>> >>> >>>> And despite the answer found there, it is not at all clear how we can >>>> use these response subclasses if we are told the middleware does it all >>>> automatically, as if we aren't >>>> supposed to worry about it. If that were so, why tell us about, or even >>>> have - the subclass at all? >>>> >>>> >>> As I mention above, one usually writes spider callbacks for a specific >>> type of Response, and it's usually for HtmlResponse. >>> But you can totally work with non-TextResponse in Scrapy, if you need it. >>> >>> One area where the type is more important is middlewares. >>> These are generic components and may need to handle different types of >>> responses (or skip processing if the type is not the one it's supposed to >>> work on). >>> >>> You may not need to write your own middlewares, but if you do, you can >>> have a look at scrapy's source code; >>> For example AjaxCrawlMiddleware >>> >>> https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/downloadermiddlewares/ajaxcrawl.py#L38 >>> >>> >>>> Here's an error I got: TypeError: TextResponse url must be str, got >>>> list: >>>> The list the error is referring to is my start_urls variable that I've >>>> been using without issue until I tried to use TextResponse. So if we can't >>>> use a list, are we supposed to only feed it >>>> one url at a time? Manually? >>>> >>>> Your patient, thorough, and detailed explanation of these issues is >>>> greatly appreciated. >>>> >>> >>> I hope this explained the different response type clearly enough. >>> If not, feel free to ask. >>> >>> Cheers, >>> /Paul. >>> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.