Hey there,

i'm quite new to the nutch-scene and read the list just for about 2 weeks or so.

at the moment i've got the following problem: we want to crawl all pages in the scope, but save only the ones with a special feature. So i think for us your 3rd proposal would be really useful. Maybe there is an easy way to achieve, what we are trying to do, but there is no (or i haven't found any) Documentation about this.

As some people might have had similiar problems, and maybe already solved them. It would be great, if you could share your experiences and maybe some code-fragments (maybe even on the wiki, so that new people who are not yet reading the list might find it).

so long,
        Sebastian Steinmetz


Am 24.10.2007 um 06:48 schrieb Matt Kangas:

Dear nutch-user readers,

I have a question for everyone here: Is the current Nutch crawler (Fetcher/Fetcher2) flexible enough for your needs?
If not, what would you like to see it do?

I'm asking because, last week, I suggested that the Nutch crawler could be much more useful to many people if it was structured more as a "crawler construction toolkit". But I realize that my comments could seem like sour grapes unless there's some plan for moving forward. So, I thought I'd just ask everybody what you think and tally the results.

What kind of crawls would you like to do that aren't supported? I'll start with some nonstandard crawls I've done:

1) Outlinks-only crawl: crawl a specific website, keep only the outlinks from articles (, etc)
2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter
3) Plug in a "feature detector" (address, date, brand-name, etc) and use this signal to guide the crawl

4) .... (fill in your own here!)

--
Matt Kangas / [EMAIL PROTECTED]



Reply via email to