Re: Poll: Crawler flexibility?

Sebastian Steinmetz Thu, 25 Oct 2007 05:59:08 -0700

Hey there,

i'm quite new to the nutch-scene and read the list just for about 2weeks or so.

at the moment i've got the following problem: we want to crawl allpages in the scope, but save only the ones with a special feature. Soi think for us your 3rd proposal would be really useful. Maybe thereis an easy way to achieve, what we are trying to do, but there is no(or i haven't found any) Documentation about this.

As some people might have had similiar problems, and maybe alreadysolved them. It would be great, if you could share your experiencesand maybe some code-fragments (maybe even on the wiki, so that newpeople who are not yet reading the list might find it).


so long,
        Sebastian Steinmetz


Am 24.10.2007 um 06:48 schrieb Matt Kangas:

Dear nutch-user readers,
I have a question for everyone here: Is the current Nutch crawler(Fetcher/Fetcher2) flexible enough for your needs?
If not, what would you like to see it do?
I'm asking because, last week, I suggested that the Nutch crawlercould be much more useful to many people if it was structured moreas a "crawler construction toolkit". But I realize that my commentscould seem like sour grapes unless there's some plan for movingforward. So, I thought I'd just ask everybody what you think andtally the results.
What kind of crawls would you like to do that aren't supported?I'll start with some nonstandard crawls I've done:
1) Outlinks-only crawl: crawl a specific website, keep only theoutlinks from articles (, etc)
2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter
3) Plug in a "feature detector" (address, date, brand-name, etc)and use this signal to guide the crawl
4) .... (fill in your own here!)

--
Matt Kangas / [EMAIL PROTECTED]

Re: Poll: Crawler flexibility?

Reply via email to