Re: init method on pipeline class

Paul Tremberth Wed, 31 May 2017 08:08:10 -0700

Hi Malik,

On Sunday, May 28, 2017 at 6:03:03 AM UTC+2, Malik Rumi wrote:
>
> This is my third or fourth post in the last 24 hours. I freely admit that 
> I don’t know what I am doing, and that over the last several hours for this 
> particular issue I have been guessing, because I didn’t know what scrapy 
> wanted from me and I couldn’t find an answer. 
>
>
>
I also have a question: what do you want from Scrapy? what are you trying 
to achieve with this pipeline?


Try to always refer to the official docs and not copies.
For item pipelines, this is the page: 
https://docs.scrapy.org/en/latest/topics/item-pipeline.html
It says:

Each item pipeline component (sometimes referred as just “Item Pipeline”) 
is a Python class that implements a simple method. They receive an item and 
perform an action over it, also deciding if the item should continue 
through the pipeline or be dropped and no longer processed.



Item pipelines need to implement one or more of the 4 methods described 
in 
https://docs.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline
(You don't need to implement them all.)

The most important one is `process_item`;
Scrapy framework will call this method of the pipeline instance it creates,
with each item that your spider callbacks return.

This method MUST have the signature described, in that it must expect 2 
parameters, an item object and the running spider, in addition to the 
conventional "self" as first argument.

process_item(self, item, spider) MUST either return an item or a dict 
(usually transformed from the input item) -- you can return the item 
unchanged, or raise DropItem to tell scrapy to drop the item.
"self", "item", "spider" are variable references, named using conventions 
within Scrapy framework, to use in your method implementation.
- "item" points to an instance of your Item or a dict, depending of what 
your callback returns
- "spider" points to your spider instance

You do not need to replace your method signature to use "testerapp2".
You can do that in theory, because it's just a name give to the argument to 
be able to work with it in your method.
But I highly discourage this.

The most basic implementation of `process_item` would be returning the item 
as-is:

    def process_item(self, item, spider):
        # do something with the item
        # item['processed'] = True
        return item



The 2nd important method is the classmethod `from_crawler` that is usually 
used to initialize the pipeline object using settings or other info from 
the crawler object.

You've written

    @classmethod
    def from_crawler(cls, testerapp2):
        return cls(name = crawler.settings.get('ITEM_PIPELINES'),)


You could use some setting to initialize the pipeline, but ITEM_PIPELINES 
setting is a dict, so assigning the pipeline name field with a dict does 
not make much sense.
Also, you're seeing NameError: name 'crawler' is not defined
That makes sense because your `from_crawler` signature using "testerapp2" 
as the name of the 2nd argument.
Hence using "crawler" in the method does not mean anything for the Python 
interpreter.

One usually write this:

    @classmethod
    def from_crawler(cls, crawler):
        return cls(name = crawler.settings.get('somesetting'),)


I think you'll find useful to read further about Python classes 
<https://docs.python.org/3.6/tutorial/classes.html> and defining functions 
<https://docs.python.org/3.6/tutorial/controlflow.html#defining-functions>, 
especially regarding formal parameters, and the convention of using "self" 
for the first argument of class methods.


Regards,
/Paul.

 

> Here are just a few lines from my log today. It runs over 100 pages when 
> pasted into my word processor. I was just trying to make this work with the 
> pipeline. It started with this error:
>
>
> SavePipeline(item)
>> TypeError: object() takes no parameters
>
>
> and never got better.
>
>
> I read on SO that this was because my pipeline class did not have its own 
> __init__ method, and so python was searching in the parent object for one. 
> I thought that made sense, so I put an __init__ in there, and hell ensued. 
> It was the usual ‘how many arguments’ problem, but when I tried giving it 
> only self, and leaving the rest blank or with ‘pass’, I got indentation 
> errors.
>
>
> So I tried putting something innocuous like self.name = name, and we were 
> back to the how many arguments error. I tried giving it process_item as an 
> attribute, and after many go rounds and variations, that worked, but then 
> it wouldn’t take my call to the process_item method – back to the number of 
> arguments again. I imported my spider, and that helped, but still the 
> errors kept coming. It’s been about 6 hours. I have Googled all over the 
> place. I give up. I don’t get it. I need help. 
>
>
> Here is one full traceback, typical of most but hardly the only one, 
> followed by an abbreviated version of some others, including the last:
>
>
> Traceback (most recent call last):
>> File 
>> "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/twisted/internet/defer.py",
>>  
>> line 1301, in _inlineCallbacks
>> result = g.send(result)
>> File 
>> "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/crawler.py",
>>  
>> line 72, in crawl
>> self.engine = self._create_engine()
>> File 
>> "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/crawler.py",
>>  
>> line 97, in _create_engine
>> return ExecutionEngine(self, lambda _: self.stop())
>> File 
>> "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/core/engine.py",
>>  
>> line 70, in __init__
>> self.scraper = Scraper(crawler)
>> File 
>> "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/core/scraper.py",
>>  
>> line 71, in __init__
>> self.itemproc = itemproc_cls.from_crawler(crawler)
>> File 
>> "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/middleware.py",
>>  
>> line 58, in from_crawler
>> return cls.from_settings(crawler.settings, crawler)
>> File 
>> "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/middleware.py",
>>  
>> line 34, in from_settings
>> mwcls = load_object(clspath)
>> File 
>> "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/utils/misc.py",
>>  
>> line 44, in load_object
>> mod = import_module(module)
>> File "/usr/lib/python3.5/importlib/__init__.py", line 126, in 
>> import_module
>> return _bootstrap._gcd_import(name[level:], package, level)
>> File "<frozen importlib._bootstrap>", line 986, in _gcd_import
>> File "<frozen importlib._bootstrap>", line 969, in _find_and_load
>> File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
>> File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
>> File "<frozen importlib._bootstrap_external>", line 665, in exec_module
>> File "<frozen importlib._bootstrap>", line 222, in 
>> _call_with_frames_removed
>> File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", 
>> line 87, in <module>
>> class SavePipeline(object):
>> File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", 
>> line 96, in SavePipeline
>> SavePipeline(process_item)
>> NameError: name 'SavePipeline' is not defined
>> 2017-05-28 02:43:30,386:_legacy.py:154:publishToNewObserver:CRITICAL:
>> Traceback (most recent call last):
>> File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", 
>> line 87, in <module>
>> class SavePipeline(object):
>> File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", 
>> line 96, in SavePipeline
>> SavePipeline(process_item)
>> NameError: name 'SavePipeline' is not defined
>> 2017-05-28 
>> 02:44:46,861:_legacy.py:154:publishToNewObserver:CRITICAL:Unhandled error 
>> in Deferred:
>> 2017-05-28 
>> 02:44:46,861:_legacy.py:154:publishToNewObserver:CRITICAL:Unhandled error 
>> in Deferred:
>> 2017-05-28 02:44:46,861:_legacy.py:154:publishToNewObserver:CRITICAL:
>> Traceback (most recent call last):
>> File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", 
>> line 96, in <module>
>> SavePipeline(process_item)
>> NameError: name 'process_item' is not defined
>> 2017-05-28 02:44:46,862:_legacy.py:154:publishToNewObserver:CRITICAL:
>> Traceback (most recent call last):
>> File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", 
>> line 96, in <module>
>> SavePipeline(process_item)
>> NameError: name 'process_item' is not defined
>> 2017-05-28 03:10:29,174:_legacy.py:154:publishToNewObserver:CRITICAL:
>> Traceback (most recent call last):
>> File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", 
>> line 100
>> return cls(name = =crawler.settings.get('ITEM_PIPELINES'),)
>> ^
>> SyntaxError: invalid syntax
>> 2017-05-28 03:10:51,021:middleware.py:53:from_settings:INFO:Enabled 
>> downloader middlewares:
>> 2017-05-28 
>> 03:10:51,024:_legacy.py:154:publishToNewObserver:CRITICAL:Unhandled error 
>> in Deferred:
>> 2017-05-28 
>> 03:10:51,025:_legacy.py:154:publishToNewObserver:CRITICAL:Unhandled error 
>> in Deferred:
>> 2017-05-28 03:10:51,025:_legacy.py:154:publishToNewObserver:CRITICAL:
>> Traceback (most recent call last):
>> File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", 
>> line 100, in from_crawler
>> return cls(name = crawler.settings.get('ITEM_PIPELINES'),)
>> NameError: name 'crawler' is not defined
>> 2017-05-28 03:10:51,026:_legacy.py:154:publishToNewObserver:CRITICAL:
>> Traceback (most recent call last):
>> File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", 
>> line 100, in from_crawler
>> return cls(name = crawler.settings.get('ITEM_PIPELINES'),)
>> NameError: name 'crawler' is not defined
>
>
>
> PIPELINE.PY
>> from items import Acquire2Item
>> item = Acquire2Item()
>> from acquire2.spiders import testerapp2
>> class SavePipeline(object):
>> def __init__(self, name):
>> self.name = name
>> def process_item(self, item, testerapp2):
>> item.save()
>> return
>> process_item(self, item, testerapp2)
>> @classmethod
>> def from_crawler(cls, testerapp2):
>> return cls(name = crawler.settings.get('ITEM_PIPELINES'),)
>
>
> I notice there is something in there about crawler settings. I read this 
> http://mengyangyang.org/scrapy/topics/item-pipeline.html#from_crawler 
> among many other things. Obviously I don’t get it. Perhaps this is related 
> to my other question about settings earlier today?
>
>
> I just noticed that url. This must be a Chinese copy of the docs. Don’t 
> think that makes a difference here. 
>
>
> Any help at all will be appreciated.  
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: init method on pipeline class

Reply via email to