In doing whole-internet focused crawls we'd like a parse/injector
filter.
Say we only want pages in our nutch db and index that have the word
"nutch" in them. I'd like to express the rule as a lucene boolean
query, contents:nutch, because in our real world scenario the match
is more fuzzy and involves many phrases and terms. It's not just a
regular expression.
If the query does not match or matches under a threshold score, I
don't want to add the fetched/parsed document to the index, nor (more
importantly) have the generator find outlinks from that page for
future crawls.
This is somewhat like a url filter, but instead of filtering by url
content I want to filter by parsed page content.
Where would I add this in nutch?
-Brian