Seems that my use case is pretty similar. What I want is to crawl all pages in some sites, but index just those which contains features, as you call it. In my case probably only 10% of pages contains the feature, so keeping / indexing all pages would be waste of space and time. Though, I need to get the pages without features because otherwise I won't be able to reach pages with features.
For now, during parsing I write a file with URLs containing features. Then I make mergeseg with URL filter which loads file from previous step and accepts only URLs from this file. Finally I index the filtered segment. I don't like this solution because it wont work properly when I add second computer. I think of extending mergeseg, which would filter pages based on mata data. I'll need to ask some questions on the group to do it, but so far I don't have time for this. But it seems you have solved some of those problems. Maybe you could contribute or share some code / ideas on how to do it in the best way. Regards, Marcin Dnia 24 października 2007 6:48 Matt Kangas <[EMAIL PROTECTED]> napisał(a): > Dear nutch-user readers, > > I have a question for everyone here: Is the current Nutch crawler > (Fetcher/Fetcher2) flexible enough for your needs? > If not, what would you like to see it do? > > I'm asking because, last week, I suggested that the Nutch crawler > could be much more useful to many people if it was structured more as > a "crawler construction toolkit". But I realize that my comments > could seem like sour grapes unless there's some plan for moving > forward. So, I thought I'd just ask everybody what you think and > tally the results. > > What kind of crawls would you like to do that aren't supported? I'll > start with some nonstandard crawls I've done: > > 1) Outlinks-only crawl: crawl a specific website, keep only the > outlinks from articles (, etc) > 2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter > 3) Plug in a "feature detector" (address, date, brand-name, etc) and > use this signal to guide the crawl > > 4) .... (fill in your own here!) > > -- > Matt Kangas / [EMAIL PROTECTED] > > >
