Sorry for pushing this topic, but I would like to know if Nutch would help me 
get the raw HTML in my situation described below. 

I am sure it would be a simple answer to those who know Nutch. If not then I 
guess Nutch is the wrong tool for the job.

Thanks,
O. O. 


--- Gio 24/9/09, O. Olson <olson_...@yahoo.it> ha scritto:

> Da: O. Olson <olson_...@yahoo.it>
> Oggetto: Using Nutch for only retriving HTML
> A: nutch-user@lucene.apache.org
> Data: Giovedì 24 settembre 2009, 20:54
> Hi,
>     I am new to Nutch. I would like to
> completely crawl through an Internal Website and retrieve
> all the HTML Content. I don’t intend to do further
> processing using Nutch. 
> The Website/Content is rather huge. By crawl, I mean that I
> would go to a page, download/archive the HTML, get the links
> from that page, and then download/archive those pages. I
> would keep doing this till I don’t have any new links.
> 
> Is this possible? Is this the right tool for this job, or
> are there other tools out there that would be more suited
> for my purpose?
> 
> Thanks,
> O.O. 
> 
> 
> 
> 
> 



Reply via email to