focused crawls -- where to add parse filter

Brian Whitman Sat, 17 Feb 2007 09:49:15 -0800

In doing whole-internet focused crawls we'd like a parse/injectorfilter.

Say we only want pages in our nutch db and index that have the word"nutch" in them. I'd like to express the rule as a lucene booleanquery, contents:nutch, because in our real world scenario the matchis more fuzzy and involves many phrases and terms. It's not just aregular expression.

If the query does not match or matches under a threshold score, Idon't want to add the fetched/parsed document to the index, nor (moreimportantly) have the generator find outlinks from that page forfuture crawls.

This is somewhat like a url filter, but instead of filtering by urlcontent I want to filter by parsed page content.


Where would I add this in nutch?

-Brian

focused crawls -- where to add parse filter

Reply via email to