Hi Matt, i've posted some messages on the nutch-dev & nutch user mailing lists on downloading content by using the fetcher.
if you ask me what i'm looking in the fetcher its: 1. real time (or very fast) on demand scan on a given url(s) & depth required for the purpose of extracting data from it (this is diffrent from wget in that it can handle redirects and all sorts of issues in web crawling) - maybe design a deamon that will accept requests on the fly. 2. configuration that will enable nutch do d/l specific content types without the need to write a specific plugin for each extension. i've mentioned before that my purpose is soley fetching pages and not indexing/searching results... so i'd rather see an optimized fetcher.. Eyal. On 10/24/07, Matt Kangas <[EMAIL PROTECTED]> wrote: > Dear nutch-user readers, > > I have a question for everyone here: Is the current Nutch crawler > (Fetcher/Fetcher2) flexible enough for your needs? > If not, what would you like to see it do? > > I'm asking because, last week, I suggested that the Nutch crawler > could be much more useful to many people if it was structured more as > a "crawler construction toolkit". But I realize that my comments > could seem like sour grapes unless there's some plan for moving > forward. So, I thought I'd just ask everybody what you think and > tally the results. > > What kind of crawls would you like to do that aren't supported? I'll > start with some nonstandard crawls I've done: > > 1) Outlinks-only crawl: crawl a specific website, keep only the > outlinks from articles (, etc) > 2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter > 3) Plug in a "feature detector" (address, date, brand-name, etc) and > use this signal to guide the crawl > > 4) .... (fill in your own here!) > > -- > Matt Kangas / [EMAIL PROTECTED] > > > -- Eyal Edri
