Parse, callback, and Not Implemented Error following examples in the docs.

Malik Rumi Mon, 12 Oct 2015 18:53:06 -0700

I posted a related question to Stack Overflow 
at 
http://stackoverflow.com/questions/33084480/scrapy-error-can-t-find-callback, 
but so far it has no answers.


I am not able to get a spider to crawl past the first page of any site I 
have tried, despite many iterations and many re-reads of the docs. I 
decided to test it against the example code from the docs.
The only change I made was to the name, so I could tell it apart.

'''
Copied from Scrapy 1.03 docs at pdf page 15, section 2.3, Scrapy Tutorial
Run this, as is, on Dmoz.
'''

import scrapy
from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "tutfollinks"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/";,
    ]

    def parse(self, response):
        for href in response.css("ul.directory.dir-col > li > 
a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

And here is what I got:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 
577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/pymodules/python2.7/scrapy/spiders/__init__.py", line 76, 
in parse
    raise NotImplementedError
NotImplementedError
2015-10-12 19:31:21 [scrapy] INFO: Closing spider (finished)

When I googled the error, my first hit was:
http://stackoverflow.com/questions/5264829/why-does-scrapy-throw-an-error-for-me-when-trying-to-spider-and-parse-a-site

The answer, according to the OP, was to change from BaseSpider to 
CrawlSpider. But, I repeat, this is copied  verbatim from the example in 
the docs. Then how can it throw an error? In fact, the whole point of the 
example in the docs is to show how to crawl a site WITHOUT CrawlSpider, 
which is introduced for the first time in a note at the end of section 2.3.4

Another SO post had a similar issue, but in that case the original code was 
subclassed from CrawlSpider, and the OP was told he had accidentally 
overwritten parse(). But I see parse() being used in various examples in 
the docs, including this one. What, exactly, constitutes 'overwriting 
parse()'? Is it adding variables like the example in the docs do? How can 
that be?

Furthermore, the callback in this case is explicitly not parse, but 
parse_dir_contents.

What is going on here? Please, I'd like a why explanation as well as the 
hopefully simple answer. Thanks. 

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Parse, callback, and Not Implemented Error following examples in the docs.

Reply via email to