not all of pages parsed
-----------------------

                 Key: NUTCH-1204
                 URL: https://issues.apache.org/jira/browse/NUTCH-1204
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.3
            Reporter: behnam nikbakht
            Priority: Critical


when we fetch a site in multiple segments, and dump crawldb with readdb, the 
system says that some of pages are unfetched, and when we checked, we find that 
these pages were fetched and stored but was not parsed
we try to crawl a site with only html pages and edit suffix-urlfilter.txt and 
parser.timeout property and test it and find that only some of html pages are 
parsed
this is a critical situation for performance because fetching of sites is well 
but parsing of them in iterations cause refetching these sites

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to