Well, I figured it out, but now I have a different problem.

The original problem was that I had a hidden character in my code, probably 
from being lazy and cut and pasting it directly into my text editor. 
However, when I retyped the spider and ran it, scrapy ran a different 
spider, one I had NOT named in the command line. That was a head scratcher, 
and then I realized I may have may have copied things *too* closely; the 
docs say "The spider name is how the spider is located (and instantiated) 
by Scrapy, so it must be unique". Ok, fine, no problem. I changed the name. 
I also did cat on both files, and you can see the offending character at 
the start of the bad one, and it is not there in the new one. But when I 
ran runspider again, I still got the same result. I am at a loss as to why 
this is happening. Any expert explanations out there?

malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ cat dmoz_debug2.py
��''' #this is the offending hidden character - by the way, 'delete' does 
not work to get rid of it
Copied from Scrapy 1.03 docs at pdf page 15, section 2.3, Scrapy Tutorial
Run this, as is, on Dmoz. It is dmoz_debug2, with the name of the spider 
'dmoz'.
I changed this to iso-8859 per 
http://stackoverflow.com/questions/1067742/clean-source-code-files-of-invisible-characters.
'''

import scrapy
from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/";,
    ]

def parse(self, response):
    for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_dir_contents)
        
def parse_dir_contents(self, response):
    for sel in response.xpath('//ul/li'):
        item = DmozItem()
        item['title'] = sel.xpath('a/text()').extract()
        item['link'] = sel.xpath('a/@href').extract()
        item['desc'] = sel.xpath('text()').extract()
        yield item

malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ cat 
tutfollinksc.py
''' # this is my retyped copy, as you can see, without the offending 
character
This is tutfollinksc, the retyped spider in hopes of getting rid of hidden 
character
and not implemented error. It is in all respects identical to tutfollinks.
'''

import scrapy
from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
name = "tutlinkC_dmoz"  # this is where i changed the spider name so it 
would not be identical to the spider in dmoz_debug2
allowed_domains = ["dmoz.org"]
start_urls = [
   "http://www.dmoz.org/Computers/Programming/Languages/Python/";
   ]
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)

def parse_dir_contents(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item

malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ scrapy runspider 
tutfollinksc.py -o tutfollinks_dmoz_c.json
Traceback (most recent call last):
  File "/usr/bin/scrapy", line 9, in <module>
    load_entry_point('Scrapy==1.0.3.post6-g2d688cd', 'console_scripts', 
'scrapy')()
  File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 142, in 
execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 209, in 
__init__
    super(CrawlerProcess, self).__init__(settings)
  File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 115, in 
__init__
    self.spider_loader = _get_spider_loader(settings)
  File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 296, in 
_get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/usr/lib/pymodules/python2.7/scrapy/spiderloader.py", line 30, in 
from_settings
    return cls(settings)
  File "/usr/lib/pymodules/python2.7/scrapy/spiderloader.py", line 21, in 
__init__
    for module in walk_modules(name):
  File "/usr/lib/pymodules/python2.7/scrapy/utils/misc.py", line 71, in 
walk_modules
    submod = import_module(fullpath)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/home/malikarumi/Projects/tutorial/tutorial/spiders/dmoz_debug2.py", 
line 1
SyntaxError: Non-ASCII character '\xff' in file 
/home/malikarumi/Projects/tutorial/tutorial/spiders/dmoz_debug2.py on line 
1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for 
details


Assuming the identical spider name was the issue, why does scrapy continue 
to run dmoz_debug2 despite the fact that this is NOT what I put in the 
command line? How do I fix it?
If that's not the issue, what is? Thanks

<>


On Monday, October 12, 2015 at 8:52:13 PM UTC-5, Malik Rumi wrote:
>
> I posted a related question to Stack Overflow at 
> http://stackoverflow.com/questions/33084480/scrapy-error-can-t-find-callback, 
> but so far it has no answers.
>
> I am not able to get a spider to crawl past the first page of any site I 
> have tried, despite many iterations and many re-reads of the docs. I 
> decided to test it against the example code from the docs.
> The only change I made was to the name, so I could tell it apart.
>
> '''
> Copied from Scrapy 1.03 docs at pdf page 15, section 2.3, Scrapy Tutorial
> Run this, as is, on Dmoz.
> '''
>
> import scrapy
> from tutorial.items import DmozItem
>
> class DmozSpider(scrapy.Spider):
>     name = "tutfollinks"
>     allowed_domains = ["dmoz.org"]
>     start_urls = [
>         "http://www.dmoz.org/Computers/Programming/Languages/Python/";,
>     ]
>
>     def parse(self, response):
>         for href in response.css("ul.directory.dir-col > li > 
> a::attr('href')"):
>             url = response.urljoin(href.extract())
>             yield scrapy.Request(url, callback=self.parse_dir_contents)
>
>     def parse_dir_contents(self, response):
>         for sel in response.xpath('//ul/li'):
>             item = DmozItem()
>             item['title'] = sel.xpath('a/text()').extract()
>             item['link'] = sel.xpath('a/@href').extract()
>             item['desc'] = sel.xpath('text()').extract()
>             yield item
>
> And here is what I got:
>
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 
> 577, in _runCallbacks
>     current.result = callback(current.result, *args, **kw)
>   File "/usr/lib/pymodules/python2.7/scrapy/spiders/__init__.py", line 76, 
> in parse
>     raise NotImplementedError
> NotImplementedError
> 2015-10-12 19:31:21 [scrapy] INFO: Closing spider (finished)
>
> When I googled the error, my first hit was:
>
> http://stackoverflow.com/questions/5264829/why-does-scrapy-throw-an-error-for-me-when-trying-to-spider-and-parse-a-site
>
> The answer, according to the OP, was to change from BaseSpider to 
> CrawlSpider. But, I repeat, this is copied  verbatim from the example in 
> the docs. Then how can it throw an error? In fact, the whole point of the 
> example in the docs is to show how to crawl a site WITHOUT CrawlSpider, 
> which is introduced for the first time in a note at the end of section 2.3.4
>
> Another SO post had a similar issue, but in that case the original code 
> was subclassed from CrawlSpider, and the OP was told he had accidentally 
> overwritten parse(). But I see parse() being used in various examples in 
> the docs, including this one. What, exactly, constitutes 'overwriting 
> parse()'? Is it adding variables like the example in the docs do? How can 
> that be?
>
> Furthermore, the callback in this case is explicitly not parse, but 
> parse_dir_contents.
>
> What is going on here? Please, I'd like a why explanation as well as the 
> hopefully simple answer. Thanks. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to