Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML to a file or to some database, you could then configure it to skip all unnecessary plugins. I think it depends a lot on the other requirements you have whether using nutch for this task is the right way to go or not. If you can get by with wget -r then its probably an overkill to use nutch.
Best regards, Magnus On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal <susam....@gmail.com> wrote: > On Wed, Sep 30, 2009 at 1:39 AM, O. Olson <olson_...@yahoo.it> wrote: > > Sorry for pushing this topic, but I would like to know if Nutch would > help me get the raw HTML in my situation described below. > > > > I am sure it would be a simple answer to those who know Nutch. If not > then I guess Nutch is the wrong tool for the job. > > > > Thanks, > > O. O. > > > > > > --- Gio 24/9/09, O. Olson <olson_...@yahoo.it> ha scritto: > > > >> Da: O. Olson <olson_...@yahoo.it> > >> Oggetto: Using Nutch for only retriving HTML > >> A: nutch-user@lucene.apache.org > >> Data: Giovedì 24 settembre 2009, 20:54 > >> Hi, > >> I am new to Nutch. I would like to > >> completely crawl through an Internal Website and retrieve > >> all the HTML Content. I don’t intend to do further > >> processing using Nutch. > >> The Website/Content is rather huge. By crawl, I mean that I > >> would go to a page, download/archive the HTML, get the links > >> from that page, and then download/archive those pages. I > >> would keep doing this till I don’t have any new links. > > I don't think it is possible to retrieve pages and store them as > separate files, one per page, without modifications in Nutch. I am not > sure though. Someone would correct me if I am wrong here. However, it > is easy to retrieve the HTML contents from the crawl DB using the > Nutch API. But from your post, it seems, you don't want to do this. > > >> > >> Is this possible? Is this the right tool for this job, or > >> are there other tools out there that would be more suited > >> for my purpose? > > I guess 'wget' is the tool you are looking for. You can use it with -r > option to recursively download pages and store them as separate files > on the hard disk, which is exactly what you need. You might want to > use the -np option too. It is available for Windows as well as Linux. > > Regards, > Susam Pal >