You could write a new MapReduce job that takes a directory of nutch
segments and outputs a single segment of the data that contains the
features you're looking for.  Basically, it would just be a copy of
the mergeseg job that only calls output.collect if a feature was
found.

Then you would be left with a concise segment of only the data you
want, which you could then index.

On 10/24/07, Marcin Okraszewski <[EMAIL PROTECTED]> wrote:
> Seems that my use case is pretty similar. What I want is to crawl all pages 
> in some sites, but index just those which contains features, as you call it. 
> In my case probably only 10% of pages contains the feature, so keeping / 
> indexing all pages would be waste of space and time. Though, I need to get 
> the pages without features because otherwise I won't be able to reach pages 
> with features.
>
> For now, during parsing I write a file with URLs containing features. Then I 
> make mergeseg with URL filter which loads file from previous step and accepts 
> only URLs from this file. Finally I index the filtered segment.
>
> I don't like this solution because it wont work properly when I add second 
> computer. I think of extending mergeseg, which would filter pages based on 
> mata data. I'll need to ask some questions on the group to do it, but so far 
> I don't have time for this. But it seems you have solved some of those 
> problems. Maybe you could contribute or share some code / ideas on how to do 
> it in the best way.
>
> Regards,
> Marcin
>
>
> Dnia 24 października 2007 6:48 Matt Kangas <[EMAIL PROTECTED]> napisał(a):
>
> > Dear nutch-user readers,
> >
> > I have a question for everyone here: Is the current Nutch crawler
> > (Fetcher/Fetcher2) flexible enough for your needs?
> > If not, what would you like to see it do?
> >
> > I'm asking because, last week, I suggested that the Nutch crawler
> > could be much more useful to many people if it was structured more as
> > a "crawler construction toolkit". But I realize that my comments
> > could seem like sour grapes unless there's some plan for moving
> > forward. So, I thought I'd just ask everybody what you think and
> > tally the results.
> >
> > What kind of crawls would you like to do that aren't supported? I'll
> > start with some nonstandard crawls I've done:
> >
> > 1) Outlinks-only crawl: crawl a specific website, keep only the
> > outlinks from articles (, etc)
> > 2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter
> > 3) Plug in a "feature detector" (address, date, brand-name, etc) and
> > use this signal to guide the crawl
> >
> > 4) .... (fill in your own here!)
> >
> > --
> > Matt Kangas / [EMAIL PROTECTED]
> >
> >
> >
>

Reply via email to