On Feb 17, 2007, at 12:58 PM, Dennis Kubes wrote: > You can use an HtmlParseFilter and then set a metadata attribute as > to whether or not it contains the phrase. Problem with this is > that all of the content is still stored. You could also change the > ParseOutputFormat to only write out if the word is contained > although that is a bit of a hack.
I'm not worried about a hack, our whole set up is very "der lauf der dinge" and one more plank won't matter much :) But after sending my question out, I realized that I would need to index the document anyway before being able to lucene query it for topicality. I don't mind having pages stored that don't match my query, but I really would rather the generator not get more outlinks from those pages. So a simple fix would be something I can write or run after a crawl/ index cycle that can mark certain pages to not emit more URIs in the generator. It would query each page in an index and update some flag. But what is that flag and how can I get at it? And more advanced and later on -- the generator has smarts to prioritize fetching by inlink counts-- is there something I can hack to "boost" outlink fetches based on the source page's content? for example - I find a page that scores high on my lucene query after crawl/index gets done. I would want the generator to put all of its outlinks up top, even if there's not many inlinks to that page... would this be a "generator plugin?" -Brian ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
