I have a problem with scrapy choking on some HTML comments in pages. The
encoding on the page itself is not great as the comment it is choking on
has <!-- but one of the dashes is probably encoded using some sort of Word
encoding but when running the page through the CrawlSpider Scrapy throws an
exception like:
File
"/usr/lib/pymodules/python2.7/scrapy/contrib/linkextractors/sgml.py", line
29, in _extract_links
self.feed(response_text)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 174, in goahead
k = self.parse_declaration(i)
File "/usr/lib/python2.7/markupbase.py", line 98, in
parse_declaration
decltype, j = self._scan_name(j, i)
File "/usr/lib/python2.7/markupbase.py", line 392, in _scan_name
% rawdata[declstartpos:declstartpos+20])
File "/usr/lib/python2.7/sgmllib.py", line 111, in error
raise SGMLParseError(message)
sgmllib.SGMLParseError: expected name token at
'<!\xe2\x80\x94-0QzVNFtk[88X5m'
The page being crawled is:
http://www.jbhifi.com.au/pro-dj/samson/studio-gt-pro-pack-sku-86905/
and the sgmllinkextractor rule being used is:
Rule(SgmlLinkExtractor(allow=(r'.*'),
deny=(r'\/corporate\/',r'\/stores\/',r'\/jobs\/',r'\/factory-scoop\/')),
callback='parse_item', follow=True)
Is there anyway to prevent Scrapy from skipping these pages entirely and
continuing on tag errors? It's not like the whole page is not parsing
properly. There are end tags. However, it seems to be treated like a fatal
error and the page is skipped.
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.