I opened a PR based on this thread. https://github.com/scrapy/scrapy/pull/861
El jueves, 10 de julio de 2014 13:54:36 UTC-3, SlappySquirrel escribió: > > I'm having an issue with the XMLFeedSpider that I've been trying to wrap > my head around for a week. It's got to be something stupid that I'm doing. > The https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml is > shows an encoding of 8859-1. I did not include, but I verified that the > response.body was indeed the full xml file body. I did a quick print on > that in a def adapt_response(self,response): function. I threw some > printouts in the python/utils/iterators.py and verified that > nodetext=<cvrfdoc>< > Vulnerability> .........</Vulnerability></cvrfdoc>. I also verified that > nodeName was Vulnerability. However, it bombs on the Selector stating > "exceptions.IndexError: list index out of range" > > Obviously, then never gets to the parse_node function, but bombs during > the itertag processing. Any help would be greatly appreciated b/c I've been > floundering. This hasn't been the only xml file I've come across that > didn't parse well. > > With that stated, I can open the xml file in firefox, select all, copy to > notepad, and save as xml file. It won't render in a browser, but scrapy has > no problems steamrolling through and parsing it. What a puzzle? > > Environment: Python 2.7 with Scrapy .22 > > Spider: > from scrapy.contrib.spiders import XMLFeedSpider > from scrapy.http import XmlResponse > from vulnerability.items import VulnerabilityItem > > class CveSpider(XMLFeedSpider): > name = 'cve' > allowed_domains = ['https://cve.mitre.org'] > start_urls = [' > https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml'] > > iterator = 'iternodes' > itertag = 'Vulnerability' > > def parse_node(self, response, node): > item = VulnerabilityItem() > vulnerabilityId = node.xpath('CVE/text()').extract() > item['id'] = vulnerabilityId > return item > > Trace: > 2014-07-10 11:19:48-0500 [cve] ERROR: Spider error processing <GET > https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml> > Traceback (most recent call last): > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/twisted/internet/base.py", > > line 824, in runUntilCurrent > call.func(*call.args, **call.kw) > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/twisted/internet/task.py", > > line 638, in _tick > taskObj._oneWorkUnit() > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/twisted/internet/task.py", > > line 484, in _oneWorkUnit > result = next(self._iterator) > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/utils/defer.py", > > line 57, in <genexpr> > work = (callable(elem, *args, **named) for elem in iterable) > --- <exception caught here> --- > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/utils/defer.py", > > line 96, in iter_errback > yield next(it) > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", > > line 23, in process_spider_output > for x in result: > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py", > > line 37, in process_spider_output > for item_or_request in self._process_requests(result): > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py", > > line 52, in _process_requests > for request in iter(items_or_requests): > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", > > line 22, in <genexpr> > return (_set_referer(r) for r in result or ()) > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", > > line 33, in <genexpr> > return (r for r in result or () if _filter(r)) > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", > > line 50, in <genexpr> > return (r for r in result or () if _filter(r)) > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spiders/feed.py", > > line 61, in parse_nodes > for selector in nodes: > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spiders/feed.py", > > line 87, in _iternodes > for node in xmliter(response, self.itertag): > File > "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/utils/iterators.py", > > line 31, in xmliter > yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0] > exceptions.IndexError: list index out of range > > 2014-07-10 11:19:48-0500 [cve] INFO: Closing spider (finished) > 2014-07-10 11:19:48-0500 [cve] INFO: Dumping Scrapy stats: > {'downloader/request_bytes': 318, > 'downloader/request_count': 1, > 'downloader/request_method_count/GET': 1, > 'downloader/response_bytes': 3880685, > 'downloader/response_count': 1, > 'downloader/response_status_count/200': 1, > 'finish_reason': 'finished', > 'finish_time': datetime.datetime(2014, 7, 10, 16, 19, 48, 933372), > 'log_count/DEBUG': 3, > 'log_count/ERROR': 1, > 'log_count/INFO': 7, > 'response_received_count': 1, > 'scheduler/dequeued': 1, > 'scheduler/dequeued/memory': 1, > 'scheduler/enqueued': 1, > 'scheduler/enqueued/memory': 1, > 'spider_exceptions/IndexError': 1, > 'start_time': datetime.datetime(2014, 7, 10, 16, 19, 37, 768318)} > 2014-07-10 11:19:48-0500 [cve] INFO: Spider closed (finished) > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.