Well, I figured it out, but now I have a different problem. The original problem was that I had a hidden character in my code, probably from being lazy and cut and pasting it directly into my text editor. However, when I retyped the spider and ran it, scrapy ran a different spider, one I had NOT named in the command line. That was a head scratcher, and then I realized I may have may have copied things *too* closely; the docs say "The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique". Ok, fine, no problem. I changed the name. I also did cat on both files, and you can see the offending character at the start of the bad one, and it is not there in the new one. But when I ran runspider again, I still got the same result. I am at a loss as to why this is happening. Any expert explanations out there?
malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ cat dmoz_debug2.py ��''' #this is the offending hidden character - by the way, 'delete' does not work to get rid of it Copied from Scrapy 1.03 docs at pdf page 15, section 2.3, Scrapy Tutorial Run this, as is, on Dmoz. It is dmoz_debug2, with the name of the spider 'dmoz'. I changed this to iso-8859 per http://stackoverflow.com/questions/1067742/clean-source-code-files-of-invisible-characters. ''' import scrapy from tutorial.items import DmozItem class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/", ] def parse(self, response): for href in response.css("ul.directory.dir-col > li > a::attr('href')"): url = response.urljoin(href.extract()) yield scrapy.Request(url, callback=self.parse_dir_contents) def parse_dir_contents(self, response): for sel in response.xpath('//ul/li'): item = DmozItem() item['title'] = sel.xpath('a/text()').extract() item['link'] = sel.xpath('a/@href').extract() item['desc'] = sel.xpath('text()').extract() yield item malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ cat tutfollinksc.py ''' # this is my retyped copy, as you can see, without the offending character This is tutfollinksc, the retyped spider in hopes of getting rid of hidden character and not implemented error. It is in all respects identical to tutfollinks. ''' import scrapy from tutorial.items import DmozItem class DmozSpider(scrapy.Spider): name = "tutlinkC_dmoz" # this is where i changed the spider name so it would not be identical to the spider in dmoz_debug2 allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/" ] def parse(self, response): for href in response.css("ul.directory.dir-col > li > a::attr('href')"): url = response.urljoin(href.extract()) yield scrapy.Request(url, callback=self.parse_dir_contents) def parse_dir_contents(self, response): for sel in response.xpath('//ul/li'): item = DmozItem() item['title'] = sel.xpath('a/text()').extract() item['link'] = sel.xpath('a/@href').extract() item['desc'] = sel.xpath('text()').extract() yield item malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ scrapy runspider tutfollinksc.py -o tutfollinks_dmoz_c.json Traceback (most recent call last): File "/usr/bin/scrapy", line 9, in <module> load_entry_point('Scrapy==1.0.3.post6-g2d688cd', 'console_scripts', 'scrapy')() File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 142, in execute cmd.crawler_process = CrawlerProcess(settings) File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 209, in __init__ super(CrawlerProcess, self).__init__(settings) File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 115, in __init__ self.spider_loader = _get_spider_loader(settings) File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 296, in _get_spider_loader return loader_cls.from_settings(settings.frozencopy()) File "/usr/lib/pymodules/python2.7/scrapy/spiderloader.py", line 30, in from_settings return cls(settings) File "/usr/lib/pymodules/python2.7/scrapy/spiderloader.py", line 21, in __init__ for module in walk_modules(name): File "/usr/lib/pymodules/python2.7/scrapy/utils/misc.py", line 71, in walk_modules submod = import_module(fullpath) File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module __import__(name) File "/home/malikarumi/Projects/tutorial/tutorial/spiders/dmoz_debug2.py", line 1 SyntaxError: Non-ASCII character '\xff' in file /home/malikarumi/Projects/tutorial/tutorial/spiders/dmoz_debug2.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details Assuming the identical spider name was the issue, why does scrapy continue to run dmoz_debug2 despite the fact that this is NOT what I put in the command line? How do I fix it? If that's not the issue, what is? Thanks <> On Monday, October 12, 2015 at 8:52:13 PM UTC-5, Malik Rumi wrote: > > I posted a related question to Stack Overflow at > http://stackoverflow.com/questions/33084480/scrapy-error-can-t-find-callback, > but so far it has no answers. > > I am not able to get a spider to crawl past the first page of any site I > have tried, despite many iterations and many re-reads of the docs. I > decided to test it against the example code from the docs. > The only change I made was to the name, so I could tell it apart. > > ''' > Copied from Scrapy 1.03 docs at pdf page 15, section 2.3, Scrapy Tutorial > Run this, as is, on Dmoz. > ''' > > import scrapy > from tutorial.items import DmozItem > > class DmozSpider(scrapy.Spider): > name = "tutfollinks" > allowed_domains = ["dmoz.org"] > start_urls = [ > "http://www.dmoz.org/Computers/Programming/Languages/Python/", > ] > > def parse(self, response): > for href in response.css("ul.directory.dir-col > li > > a::attr('href')"): > url = response.urljoin(href.extract()) > yield scrapy.Request(url, callback=self.parse_dir_contents) > > def parse_dir_contents(self, response): > for sel in response.xpath('//ul/li'): > item = DmozItem() > item['title'] = sel.xpath('a/text()').extract() > item['link'] = sel.xpath('a/@href').extract() > item['desc'] = sel.xpath('text()').extract() > yield item > > And here is what I got: > > Traceback (most recent call last): > File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line > 577, in _runCallbacks > current.result = callback(current.result, *args, **kw) > File "/usr/lib/pymodules/python2.7/scrapy/spiders/__init__.py", line 76, > in parse > raise NotImplementedError > NotImplementedError > 2015-10-12 19:31:21 [scrapy] INFO: Closing spider (finished) > > When I googled the error, my first hit was: > > http://stackoverflow.com/questions/5264829/why-does-scrapy-throw-an-error-for-me-when-trying-to-spider-and-parse-a-site > > The answer, according to the OP, was to change from BaseSpider to > CrawlSpider. But, I repeat, this is copied verbatim from the example in > the docs. Then how can it throw an error? In fact, the whole point of the > example in the docs is to show how to crawl a site WITHOUT CrawlSpider, > which is introduced for the first time in a note at the end of section 2.3.4 > > Another SO post had a similar issue, but in that case the original code > was subclassed from CrawlSpider, and the OP was told he had accidentally > overwritten parse(). But I see parse() being used in various examples in > the docs, including this one. What, exactly, constitutes 'overwriting > parse()'? Is it adding variables like the example in the docs do? How can > that be? > > Furthermore, the callback in this case is explicitly not parse, but > parse_dir_contents. > > What is going on here? Please, I'd like a why explanation as well as the > hopefully simple answer. Thanks. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.