You can customaries your own IndexingFilter,just set the return document to 
NULL with the document you don't want to have.try begin with index-basic filter 
provided by nutch, you should can find the clue..

If you want to determine whether the page is wanted during parsing process,you 
can add one value in your parsedata ,and retrieve it back through 
"parse.getData().getMeta(Your Own key Name )" during the method of. The code 
may like this:

public Document filter(Document doc, Parse parse, Text url, CrawlDatum datum, 
Inlinks inlinks){
     int score= Integer.parseInt(parse.getData.getMeta("..." )) ;
    if (score <1)
          return null;
}


----- Original Message ----- 
From: "John Thompson" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Saturday, June 28, 2008 8:41 AM
Subject: Only indexing pages meeting certain criteria


> I'm looking to only index a very small subset of the pages that I fetch -
> where whether or not a page belongs in that small subset is determined by
> the page's content when it is parsed.  Anyone done anything like this / know
> roughly what classes I should modify?  I'm flagging the documents (index /
> don't-index) with an extended HtmlParseFilter class, but I'm not so sure
> about the indexing side.
> 
> Best,
> John
>

Reply via email to