OK. I made it work as I wanted to. The following modifications were made in the 
IndexSegment class: 
   
  1. Delete the following lines
                if (!fetcherOutput.getProtocolStatus().isSuccess()) {
                  continue;                              
              }

  2. Replace 
                    doc = IndexingFilters.filter(doc, parse, fetcherOutput);

  with
   
                if (fetcherOutput.getProtocolStatus().isSuccess())
                  doc = IndexingFilters.filter(doc, parse, fetcherOutput);
              else
                  doc.add(Field.UnIndexed("url", 
fetcherOutput.getUrl().toString()));

  It would be great if Nutch had a config file option that allows switch to 
this mode. It is very usefull for incremental crawling, when one knows that 
documents were updated/added/deleted.
  

Mike Alulin <[EMAIL PROTECTED]> wrote:
  Here is some extra info. Tested with Nutch version 0.71

The IndexSegment class checks the HTTP status of the response and if it is not 
200 it does not save any info to the index. As a result the DeleteDuplicates 
class does not know that the page has been deleted recently and leave it in the 
index.

Is this how Nutch should work? If yes then how can it remove 404 pages?




Mike Alulin wrote:
What is the best way to delete 404 pages from the index files? It seems that 
the fetcher saves info about 404 pages, however when Nutch builds the segment 
index it does not include that data. As a result the searcher returns the 
document link from an older version of the index file, althouth the document 
does not exist anymore.

SegmentMergeTool does handle this, however it is to slow too use too often.

---------------------------------
Yahoo! Photos
Got holiday prints? See all the ways to get quality prints in your hands ASAP.  


                        
---------------------------------
Yahoo! Photos
 Got holiday prints? See all the ways to get quality prints in your hands ASAP.

Reply via email to