Scrapy choking on malformed HTML

Kimble Tue, 07 Jan 2014 01:27:15 -0800

I have a problem with scrapy choking on some HTML comments in pages. The 
encoding on the page itself is not great as the comment it is choking on 
has <!-- but one of the dashes is probably encoded using some sort of Word 
encoding but when running the page through the CrawlSpider Scrapy throws an 
exception like:


          File 
"/usr/lib/pymodules/python2.7/scrapy/contrib/linkextractors/sgml.py", line 
29, in _extract_links
            self.feed(response_text)
          File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
            self.goahead(0)
          File "/usr/lib/python2.7/sgmllib.py", line 174, in goahead
            k = self.parse_declaration(i)
          File "/usr/lib/python2.7/markupbase.py", line 98, in 
parse_declaration
            decltype, j = self._scan_name(j, i)
          File "/usr/lib/python2.7/markupbase.py", line 392, in _scan_name
            % rawdata[declstartpos:declstartpos+20])
          File "/usr/lib/python2.7/sgmllib.py", line 111, in error
            raise SGMLParseError(message)
        sgmllib.SGMLParseError: expected name token at 
'<!\xe2\x80\x94-0QzVNFtk[88X5m'


The page being crawled is:

http://www.jbhifi.com.au/pro-dj/samson/studio-gt-pro-pack-sku-86905/

and the sgmllinkextractor rule being used is:

Rule(SgmlLinkExtractor(allow=(r'.*'), 
deny=(r'\/corporate\/',r'\/stores\/',r'\/jobs\/',r'\/factory-scoop\/')), 
callback='parse_item', follow=True)

Is there anyway to prevent Scrapy from skipping these pages entirely and 
continuing on tag errors? It's not like the whole page is not parsing 
properly. There are end tags. However, it seems to be treated like a fatal 
error and the page is skipped.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Scrapy choking on malformed HTML

Reply via email to