Re: R: Using Nutch for only retriving HTML

O. Olson Wed, 30 Sep 2009 13:48:27 -0700

Thanks Magnús and Susam for your responses and pointing me in the right 
direction. I think I would spend time over the next few weeks trying out Nutch 
over. I only needed the HTML – I don’t care if it is in the Database or in 
separate files.


Thanks guys,
O.O. 


--- Mer 30/9/09, Magnús Skúlason <magg...@gmail.com> ha scritto:

> Da: Magnús Skúlason <magg...@gmail.com>
> Oggetto: Re: R: Using Nutch for only retriving HTML
> A: nutch-user@lucene.apache.org
> Data: Mercoledì 30 settembre 2009, 11:48
> Actually its quite easy to modify the
> parse-html filter to do this.
> 
> That is saving the HTML to a file or to some database, you
> could then
> configure it to skip all unnecessary plugins. I think it
> depends a lot on
> the other requirements you have whether using nutch for
> this task is the
> right way to go or not. If you can get by with wget -r then
> its probably an
> overkill to use nutch.
> 
> Best regards,
> Magnus
> 
> On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal <susam....@gmail.com>
> wrote:
> 
> > On Wed, Sep 30, 2009 at 1:39 AM, O. Olson <olson_...@yahoo.it>
> wrote:
> > > Sorry for pushing this topic, but I would like to
> know if Nutch would
> > help me get the raw HTML in my situation described
> below.
> > >
> > > I am sure it would be a simple answer to those
> who know Nutch. If not
> > then I guess Nutch is the wrong tool for the job.
> > >
> > > Thanks,
> > > O. O.
> > >
> > >
> > > --- Gio 24/9/09, O. Olson <olson_...@yahoo.it>
> ha scritto:
> > >
> > >> Da: O. Olson <olson_...@yahoo.it>
> > >> Oggetto: Using Nutch for only retriving HTML
> > >> A: nutch-user@lucene.apache.org
> > >> Data: Giovedì 24 settembre 2009, 20:54
> > >> Hi,
> > >>     I am new to Nutch. I
> would like to
> > >> completely crawl through an Internal Website
> and retrieve
> > >> all the HTML Content. I don’t intend to do
> further
> > >> processing using Nutch.
> > >> The Website/Content is rather huge. By crawl,
> I mean that I
> > >> would go to a page, download/archive the
> HTML, get the links
> > >> from that page, and then download/archive
> those pages. I
> > >> would keep doing this till I don’t have any
> new links.
> >
> > I don't think it is possible to retrieve pages and
> store them as
> > separate files, one per page, without modifications in
> Nutch. I am not
> > sure though. Someone would correct me if I am wrong
> here. However, it
> > is easy to retrieve the HTML contents from the crawl
> DB using the
> > Nutch API. But from your post, it seems, you don't
> want to do this.
> >
> > >>
> > >> Is this possible? Is this the right tool for
> this job, or
> > >> are there other tools out there that would be
> more suited
> > >> for my purpose?
> >
> > I guess 'wget' is the tool you are looking for. You
> can use it with -r
> > option to recursively download pages and store them as
> separate files
> > on the hard disk, which is exactly what you need. You
> might want to
> > use the -np option too. It is available for Windows as
> well as Linux.
> >
> > Regards,
> > Susam Pal
> >
>

Re: R: Using Nutch for only retriving HTML

Reply via email to