I am new with Linux (as an end user, desktop, SuSE 10.0), it seems that all
this "free" staff does not work well. 

KWebGet is GUI for crawling, based on Wget. KWebGet is single threaded!
Works really very fast, but needs a lot of RAM for URL's (aka fetch-list);
doesn't have any database of URLs.

Wget uses few concurrent sessions to download single "HTTP Response"
(including HTML, zip, doc, etc.) Few crawlers developed on top of Wget.

Pavuk - multithreaded, but buggy, works fine with single thread only.

Windows, Teleport Ultra, it has many "netiquette" features, such as dynamic
bandwidth allocation for slow/fast Web-Servers (do we need dynamic
configuration for Nutch?).

And, of course, we should try http://htmlparser.sourceforge.net (it has
utility class just for crawling)

Thanks

P.S.
Mozilla Firefox is the best debugger (DOM introspector)



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to