Thanks Magnús and Susam for your responses and pointing me in the right direction. I think I would spend time over the next few weeks trying out Nutch over. I only needed the HTML – I don’t care if it is in the Database or in separate files.
Thanks guys, O.O. --- Mer 30/9/09, Magnús Skúlason <magg...@gmail.com> ha scritto: > Da: Magnús Skúlason <magg...@gmail.com> > Oggetto: Re: R: Using Nutch for only retriving HTML > A: nutch-user@lucene.apache.org > Data: Mercoledì 30 settembre 2009, 11:48 > Actually its quite easy to modify the > parse-html filter to do this. > > That is saving the HTML to a file or to some database, you > could then > configure it to skip all unnecessary plugins. I think it > depends a lot on > the other requirements you have whether using nutch for > this task is the > right way to go or not. If you can get by with wget -r then > its probably an > overkill to use nutch. > > Best regards, > Magnus > > On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal <susam....@gmail.com> > wrote: > > > On Wed, Sep 30, 2009 at 1:39 AM, O. Olson <olson_...@yahoo.it> > wrote: > > > Sorry for pushing this topic, but I would like to > know if Nutch would > > help me get the raw HTML in my situation described > below. > > > > > > I am sure it would be a simple answer to those > who know Nutch. If not > > then I guess Nutch is the wrong tool for the job. > > > > > > Thanks, > > > O. O. > > > > > > > > > --- Gio 24/9/09, O. Olson <olson_...@yahoo.it> > ha scritto: > > > > > >> Da: O. Olson <olson_...@yahoo.it> > > >> Oggetto: Using Nutch for only retriving HTML > > >> A: nutch-user@lucene.apache.org > > >> Data: Giovedì 24 settembre 2009, 20:54 > > >> Hi, > > >> I am new to Nutch. I > would like to > > >> completely crawl through an Internal Website > and retrieve > > >> all the HTML Content. I don’t intend to do > further > > >> processing using Nutch. > > >> The Website/Content is rather huge. By crawl, > I mean that I > > >> would go to a page, download/archive the > HTML, get the links > > >> from that page, and then download/archive > those pages. I > > >> would keep doing this till I don’t have any > new links. > > > > I don't think it is possible to retrieve pages and > store them as > > separate files, one per page, without modifications in > Nutch. I am not > > sure though. Someone would correct me if I am wrong > here. However, it > > is easy to retrieve the HTML contents from the crawl > DB using the > > Nutch API. But from your post, it seems, you don't > want to do this. > > > > >> > > >> Is this possible? Is this the right tool for > this job, or > > >> are there other tools out there that would be > more suited > > >> for my purpose? > > > > I guess 'wget' is the tool you are looking for. You > can use it with -r > > option to recursively download pages and store them as > separate files > > on the hard disk, which is exactly what you need. You > might want to > > use the -np option too. It is available for Windows as > well as Linux. > > > > Regards, > > Susam Pal > > >