[
https://issues.apache.org/jira/browse/NUTCH-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved NUTCH-1204.
----------------------------------
Resolution: Invalid
This doesn't seem to be a real issue, also because some documents will always
fail. Please check the user mailing lists first before opening a ticket.
> not all of pages parsed
> -----------------------
>
> Key: NUTCH-1204
> URL: https://issues.apache.org/jira/browse/NUTCH-1204
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: behnam nikbakht
> Priority: Critical
> Labels: parse
>
> when we fetch a site in multiple segments, and dump crawldb with readdb, the
> system says that some of pages are unfetched, and when we checked, we find
> that these pages were fetched and stored but was not parsed
> we try to crawl a site with only html pages and edit suffix-urlfilter.txt and
> parser.timeout property and test it and find that only some of html pages are
> parsed
> this is a critical situation for performance because fetching of sites is
> well but parsing of them in iterations cause refetching these sites
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira