Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Nicolás Alejandro Ramírez Quiros Tue, 19 Aug 2014 11:47:17 -0700

I opened a PR based on this thread. 
https://github.com/scrapy/scrapy/pull/861


El jueves, 10 de julio de 2014 13:54:36 UTC-3, SlappySquirrel escribió:
>
> I'm having an issue with the XMLFeedSpider that I've been trying to wrap 
> my head around for a week. It's got to be something stupid that I'm doing. 
> The https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml is 
> shows an encoding of 8859-1. I did not include, but I verified that the 
> response.body was indeed the full xml file body. I did a quick print on 
> that in a def adapt_response(self,response): function. I threw some 
> printouts in the python/utils/iterators.py and verified that 
> nodetext=<cvrfdoc><
> Vulnerability> .........</Vulnerability></cvrfdoc>. I also verified that 
> nodeName was Vulnerability. However, it bombs on the Selector stating 
> "exceptions.IndexError: list index out of range"
>
> Obviously, then never gets to the parse_node function, but bombs during 
> the itertag processing. Any help would be greatly appreciated b/c I've been 
> floundering. This hasn't been the only xml file I've come across that 
> didn't parse well.
>
> With that stated, I can open the xml file in firefox, select all, copy to 
> notepad, and save as xml file. It won't render in a browser, but scrapy has 
> no problems steamrolling through and parsing it. What a puzzle?
>
> Environment: Python 2.7 with Scrapy .22
>
> Spider:
> from scrapy.contrib.spiders import XMLFeedSpider
> from scrapy.http import XmlResponse
> from vulnerability.items import VulnerabilityItem
>  
> class CveSpider(XMLFeedSpider):
>    name = 'cve'
>    allowed_domains = ['https://cve.mitre.org']
>    start_urls = ['
> https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml']
>                   
>    iterator = 'iternodes'
>    itertag = 'Vulnerability'
>    
>    def parse_node(self, response, node):
>        item = VulnerabilityItem()
>        vulnerabilityId = node.xpath('CVE/text()').extract()
>        item['id'] = vulnerabilityId 
>        return item
>
> Trace:
> 2014-07-10 11:19:48-0500 [cve] ERROR: Spider error processing <GET 
> https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml>
>     Traceback (most recent call last):
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/twisted/internet/base.py",
>  
> line 824, in runUntilCurrent
>         call.func(*call.args, **call.kw)
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/twisted/internet/task.py",
>  
> line 638, in _tick
>         taskObj._oneWorkUnit()
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/twisted/internet/task.py",
>  
> line 484, in _oneWorkUnit
>         result = next(self._iterator)
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/utils/defer.py",
>  
> line 57, in <genexpr>
>         work = (callable(elem, *args, **named) for elem in iterable)
>     --- <exception caught here> ---
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/utils/defer.py",
>  
> line 96, in iter_errback
>         yield next(it)
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py",
>  
> line 23, in process_spider_output
>         for x in result:
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py",
>  
> line 37, in process_spider_output
>         for item_or_request in self._process_requests(result):
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py",
>  
> line 52, in _process_requests
>         for request in iter(items_or_requests):
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py",
>  
> line 22, in <genexpr>
>         return (_set_referer(r) for r in result or ())
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py",
>  
> line 33, in <genexpr>
>         return (r for r in result or () if _filter(r))
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py",
>  
> line 50, in <genexpr>
>         return (r for r in result or () if _filter(r))
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spiders/feed.py",
>  
> line 61, in parse_nodes
>         for selector in nodes:
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spiders/feed.py",
>  
> line 87, in _iternodes
>         for node in xmliter(response, self.itertag):
>       File 
> "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/utils/iterators.py",
>  
> line 31, in xmliter
>         yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
>     exceptions.IndexError: list index out of range
>     
> 2014-07-10 11:19:48-0500 [cve] INFO: Closing spider (finished)
> 2014-07-10 11:19:48-0500 [cve] INFO: Dumping Scrapy stats:
>     {'downloader/request_bytes': 318,
>      'downloader/request_count': 1,
>      'downloader/request_method_count/GET': 1,
>      'downloader/response_bytes': 3880685,
>      'downloader/response_count': 1,
>      'downloader/response_status_count/200': 1,
>      'finish_reason': 'finished',
>      'finish_time': datetime.datetime(2014, 7, 10, 16, 19, 48, 933372),
>      'log_count/DEBUG': 3,
>      'log_count/ERROR': 1,
>      'log_count/INFO': 7,
>      'response_received_count': 1,
>      'scheduler/dequeued': 1,
>      'scheduler/dequeued/memory': 1,
>      'scheduler/enqueued': 1,
>      'scheduler/enqueued/memory': 1,
>      'spider_exceptions/IndexError': 1,
>      'start_time': datetime.datetime(2014, 7, 10, 16, 19, 37, 768318)}
> 2014-07-10 11:19:48-0500 [cve] INFO: Spider closed (finished)
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Reply via email to