not all of pages parsed
-----------------------
Key: NUTCH-1204
URL: https://issues.apache.org/jira/browse/NUTCH-1204
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 1.3
Reporter: behnam nikbakht
Priority: Critical
when we fetch a site in multiple segments, and dump crawldb with readdb, the
system says that some of pages are unfetched, and when we checked, we find that
these pages were fetched and stored but was not parsed
we try to crawl a site with only html pages and edit suffix-urlfilter.txt and
parser.timeout property and test it and find that only some of html pages are
parsed
this is a critical situation for performance because fetching of sites is well
but parsing of them in iterations cause refetching these sites
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira