Re: Poll: Crawler flexibility?

eyal edri Wed, 24 Oct 2007 10:43:29 -0700

Hi Matt,

i've posted some messages on the nutch-dev & nutch user mailing lists on
downloading content by using the fetcher.

if you ask me what i'm looking in the fetcher its:

   1. real time (or very fast) on demand scan on a given url(s) & depth
   required for the purpose of extracting data from it (this is diffrent from
   wget in that it can handle redirects and all sorts of issues in web
   crawling) - maybe design a deamon that will accept requests on the fly.
   2. configuration that will enable nutch do d/l specific content types
   without the need to write a specific plugin for each extension.

i've mentioned before that my purpose is soley fetching pages and not
indexing/searching results... so i'd rather see an optimized fetcher..

Eyal.

On 10/24/07, Matt Kangas <[EMAIL PROTECTED]> wrote:

> Dear nutch-user readers,
>
> I have a question for everyone here: Is the current Nutch crawler
> (Fetcher/Fetcher2) flexible enough for your needs?
> If not, what would you like to see it do?
>
> I'm asking because, last week, I suggested that the Nutch crawler
> could be much more useful to many people if it was structured more as
> a "crawler construction toolkit". But I realize that my comments
> could seem like sour grapes unless there's some plan for moving
> forward. So, I thought I'd just ask everybody what you think and
> tally the results.
>
> What kind of crawls would you like to do that aren't supported? I'll
> start with some nonstandard crawls I've done:
>
> 1) Outlinks-only crawl: crawl a specific website, keep only the
> outlinks from articles (, etc)
> 2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter
> 3) Plug in a "feature detector" (address, date, brand-name, etc) and
> use this signal to guide the crawl
>
> 4) .... (fill in your own here!)
>
> --
> Matt Kangas / [EMAIL PROTECTED]
>
>
>

-- 
Eyal Edri

Re: Poll: Crawler flexibility?

Reply via email to