You could write a new MapReduce job that takes a directory of nutch segments and outputs a single segment of the data that contains the features you're looking for. Basically, it would just be a copy of the mergeseg job that only calls output.collect if a feature was found.
Then you would be left with a concise segment of only the data you want, which you could then index. On 10/24/07, Marcin Okraszewski <[EMAIL PROTECTED]> wrote: > Seems that my use case is pretty similar. What I want is to crawl all pages > in some sites, but index just those which contains features, as you call it. > In my case probably only 10% of pages contains the feature, so keeping / > indexing all pages would be waste of space and time. Though, I need to get > the pages without features because otherwise I won't be able to reach pages > with features. > > For now, during parsing I write a file with URLs containing features. Then I > make mergeseg with URL filter which loads file from previous step and accepts > only URLs from this file. Finally I index the filtered segment. > > I don't like this solution because it wont work properly when I add second > computer. I think of extending mergeseg, which would filter pages based on > mata data. I'll need to ask some questions on the group to do it, but so far > I don't have time for this. But it seems you have solved some of those > problems. Maybe you could contribute or share some code / ideas on how to do > it in the best way. > > Regards, > Marcin > > > Dnia 24 października 2007 6:48 Matt Kangas <[EMAIL PROTECTED]> napisał(a): > > > Dear nutch-user readers, > > > > I have a question for everyone here: Is the current Nutch crawler > > (Fetcher/Fetcher2) flexible enough for your needs? > > If not, what would you like to see it do? > > > > I'm asking because, last week, I suggested that the Nutch crawler > > could be much more useful to many people if it was structured more as > > a "crawler construction toolkit". But I realize that my comments > > could seem like sour grapes unless there's some plan for moving > > forward. So, I thought I'd just ask everybody what you think and > > tally the results. > > > > What kind of crawls would you like to do that aren't supported? I'll > > start with some nonstandard crawls I've done: > > > > 1) Outlinks-only crawl: crawl a specific website, keep only the > > outlinks from articles (, etc) > > 2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter > > 3) Plug in a "feature detector" (address, date, brand-name, etc) and > > use this signal to guide the crawl > > > > 4) .... (fill in your own here!) > > > > -- > > Matt Kangas / [EMAIL PROTECTED] > > > > > > >
