[
https://issues.apache.org/jira/browse/NUTCH-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149192#comment-13149192
]
Lewis John McGibbney commented on NUTCH-1204:
---------------------------------------------
Hi Behnam. This is quite hard to quantify as the description is so vague. In
all honesty it is really hard to even begin work on this, if you don't provide
some output or something. Thanks
> not all of pages parsed
> -----------------------
>
> Key: NUTCH-1204
> URL: https://issues.apache.org/jira/browse/NUTCH-1204
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: behnam nikbakht
> Priority: Critical
> Labels: parse
>
> when we fetch a site in multiple segments, and dump crawldb with readdb, the
> system says that some of pages are unfetched, and when we checked, we find
> that these pages were fetched and stored but was not parsed
> we try to crawl a site with only html pages and edit suffix-urlfilter.txt and
> parser.timeout property and test it and find that only some of html pages are
> parsed
> this is a critical situation for performance because fetching of sites is
> well but parsing of them in iterations cause refetching these sites
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira