Hi all, I'm new to nutch internals... my project is the following in very short
index webpages and only those that contain a specific regex (supplied by me). The regex extract specific attributes which will be used later for efficient search I envisionned the following changes, do you guys think it goes in the right direction or is there a more intelligent way. The plug-in technique did not seem to fit. 1. modifiy the outlink extractor in two ways - return an array of matches of my regex - return outlinks only if my regex matches 2. modify the indexer to use the regex match attribute - do not index pages with no matches 3. modify the search engine to use the matches attribute Thanks for your answers !
