Re: R: Using Nutch for only retriving HTML

Susam Pal Tue, 29 Sep 2009 13:26:13 -0700

On Wed, Sep 30, 2009 at 1:39 AM, O. Olson <olson_...@yahoo.it> wrote:
> Sorry for pushing this topic, but I would like to know if Nutch would help me 
> get the raw HTML in my situation described below.
>
> I am sure it would be a simple answer to those who know Nutch. If not then I 
> guess Nutch is the wrong tool for the job.
>
> Thanks,
> O. O.
>
>
> --- Gio 24/9/09, O. Olson <olson_...@yahoo.it> ha scritto:
>
>> Da: O. Olson <olson_...@yahoo.it>
>> Oggetto: Using Nutch for only retriving HTML
>> A: nutch-user@lucene.apache.org
>> Data: Giovedì 24 settembre 2009, 20:54
>> Hi,
>>     I am new to Nutch. I would like to
>> completely crawl through an Internal Website and retrieve
>> all the HTML Content. I don’t intend to do further
>> processing using Nutch.
>> The Website/Content is rather huge. By crawl, I mean that I
>> would go to a page, download/archive the HTML, get the links
>> from that page, and then download/archive those pages. I
>> would keep doing this till I don’t have any new links.


I don't think it is possible to retrieve pages and store them as
separate files, one per page, without modifications in Nutch. I am not
sure though. Someone would correct me if I am wrong here. However, it
is easy to retrieve the HTML contents from the crawl DB using the
Nutch API. But from your post, it seems, you don't want to do this.

>>
>> Is this possible? Is this the right tool for this job, or
>> are there other tools out there that would be more suited
>> for my purpose?

I guess 'wget' is the tool you are looking for. You can use it with -r
option to recursively download pages and store them as separate files
on the hard disk, which is exactly what you need. You might want to
use the -np option too. It is available for Windows as well as Linux.

Regards,
Susam Pal

Re: R: Using Nutch for only retriving HTML

Reply via email to