Re: R: Using Nutch for only retriving HTML

Magnús Skúlason Wed, 30 Sep 2009 02:48:48 -0700

Actually its quite easy to modify the parse-html filter to do this.

That is saving the HTML to a file or to some database, you could then
configure it to skip all unnecessary plugins. I think it depends a lot on
the other requirements you have whether using nutch for this task is the
right way to go or not. If you can get by with wget -r then its probably an
overkill to use nutch.


Best regards,
Magnus

On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal <susam....@gmail.com> wrote:

> On Wed, Sep 30, 2009 at 1:39 AM, O. Olson <olson_...@yahoo.it> wrote:
> > Sorry for pushing this topic, but I would like to know if Nutch would
> help me get the raw HTML in my situation described below.
> >
> > I am sure it would be a simple answer to those who know Nutch. If not
> then I guess Nutch is the wrong tool for the job.
> >
> > Thanks,
> > O. O.
> >
> >
> > --- Gio 24/9/09, O. Olson <olson_...@yahoo.it> ha scritto:
> >
> >> Da: O. Olson <olson_...@yahoo.it>
> >> Oggetto: Using Nutch for only retriving HTML
> >> A: nutch-user@lucene.apache.org
> >> Data: Giovedì 24 settembre 2009, 20:54
> >> Hi,
> >>     I am new to Nutch. I would like to
> >> completely crawl through an Internal Website and retrieve
> >> all the HTML Content. I don’t intend to do further
> >> processing using Nutch.
> >> The Website/Content is rather huge. By crawl, I mean that I
> >> would go to a page, download/archive the HTML, get the links
> >> from that page, and then download/archive those pages. I
> >> would keep doing this till I don’t have any new links.
>
> I don't think it is possible to retrieve pages and store them as
> separate files, one per page, without modifications in Nutch. I am not
> sure though. Someone would correct me if I am wrong here. However, it
> is easy to retrieve the HTML contents from the crawl DB using the
> Nutch API. But from your post, it seems, you don't want to do this.
>
> >>
> >> Is this possible? Is this the right tool for this job, or
> >> are there other tools out there that would be more suited
> >> for my purpose?
>
> I guess 'wget' is the tool you are looking for. You can use it with -r
> option to recursively download pages and store them as separate files
> on the hard disk, which is exactly what you need. You might want to
> use the -np option too. It is available for Windows as well as Linux.
>
> Regards,
> Susam Pal
>

Re: R: Using Nutch for only retriving HTML

Reply via email to