I am new with Linux (as an end user, desktop, SuSE 10.0), it seems that all this "free" staff does not work well.
KWebGet is GUI for crawling, based on Wget. KWebGet is single threaded! Works really very fast, but needs a lot of RAM for URL's (aka fetch-list); doesn't have any database of URLs. Wget uses few concurrent sessions to download single "HTTP Response" (including HTML, zip, doc, etc.) Few crawlers developed on top of Wget. Pavuk - multithreaded, but buggy, works fine with single thread only. Windows, Teleport Ultra, it has many "netiquette" features, such as dynamic bandwidth allocation for slow/fast Web-Servers (do we need dynamic configuration for Nutch?). And, of course, we should try http://htmlparser.sourceforge.net (it has utility class just for crawling) Thanks P.S. Mozilla Firefox is the best debugger (DOM introspector) ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
