Seems that my use case is pretty similar. What I want is to crawl all pages in 
some sites, but index just those which contains features, as you call it. In my 
case probably only 10% of pages contains the feature, so keeping / indexing all 
pages would be waste of space and time. Though, I need to get the pages without 
features because otherwise I won't be able to reach pages with features.

For now, during parsing I write a file with URLs containing features. Then I 
make mergeseg with URL filter which loads file from previous step and accepts 
only URLs from this file. Finally I index the filtered segment. 

I don't like this solution because it wont work properly when I add second 
computer. I think of extending mergeseg, which would filter pages based on mata 
data. I'll need to ask some questions on the group to do it, but so far I don't 
have time for this. But it seems you have solved some of those problems. Maybe 
you could contribute or share some code / ideas on how to do it in the best way.

Regards,
Marcin


Dnia 24 października 2007 6:48 Matt Kangas <[EMAIL PROTECTED]> napisał(a):

> Dear nutch-user readers,
> 
> I have a question for everyone here: Is the current Nutch crawler  
> (Fetcher/Fetcher2) flexible enough for your needs?
> If not, what would you like to see it do?
> 
> I'm asking because, last week, I suggested that the Nutch crawler  
> could be much more useful to many people if it was structured more as  
> a "crawler construction toolkit". But I realize that my comments  
> could seem like sour grapes unless there's some plan for moving  
> forward. So, I thought I'd just ask everybody what you think and  
> tally the results.
> 
> What kind of crawls would you like to do that aren't supported? I'll  
> start with some nonstandard crawls I've done:
> 
> 1) Outlinks-only crawl: crawl a specific website, keep only the  
> outlinks from articles (, etc)
> 2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter
> 3) Plug in a "feature detector" (address, date, brand-name, etc) and  
> use this signal to guide the crawl
> 
> 4) .... (fill in your own here!)
> 
> --
> Matt Kangas / [EMAIL PROTECTED]
> 
> 
> 

Reply via email to