[jira] [Resolved] (NUTCH-1204) not all of pages parsed

Markus Jelsma (Resolved) (JIRA) Mon, 14 Nov 2011 07:41:15 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Markus Jelsma resolved NUTCH-1204.
----------------------------------

    Resolution: Invalid

This doesn't seem to be  a real issue, also because some documents will always 
fail. Please check the user mailing lists first before opening a ticket.
                
> not all of pages parsed
> -----------------------
>
>                 Key: NUTCH-1204
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1204
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: behnam nikbakht
>            Priority: Critical
>              Labels: parse
>
> when we fetch a site in multiple segments, and dump crawldb with readdb, the 
> system says that some of pages are unfetched, and when we checked, we find 
> that these pages were fetched and stored but was not parsed
> we try to crawl a site with only html pages and edit suffix-urlfilter.txt and 
> parser.timeout property and test it and find that only some of html pages are 
> parsed
> this is a critical situation for performance because fetching of sites is 
> well but parsing of them in iterations cause refetching these sites

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1204) not all of pages parsed

Reply via email to