[Nutch-general] Getting the real data not only the segment files/index

Nils Höller Tue, 07 Nov 2006 06:36:51 -0800

Hi,

I ve worked with Nutch till last year and 
I am now trying to do something (about continious queries) new with it.


I have only used nutch for getting the index an searching something in a
generated site-map (with the WebDB).

Now I want to use it for to get a archive of a certain number of sites.
So I ll want to nutch to crawl the sites every day (like I used it
before) but also download and save the REAL content of the sites (all
html and pictures), so I can work with this real content.

Is there a possibility to make nutch save also the content like it is
crawled, and not only creating the WebDB and Index?

Actually I have a solution with a perl script, wget, and lucene, but 
it would be perfect if I can use nutch from now on.

Thanks for your help.

Nils


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Getting the real data not only the segment files/index

Reply via email to