I am new with Linux (as an end user, desktop, SuSE 10.0), it seems that all
this "free" staff does not work well. 

KWebGet is GUI for crawling, based on Wget. KWebGet is single threaded!
Works really very fast, but needs a lot of RAM for URL's (aka fetch-list);
doesn't have any database of URLs.

Wget uses few concurrent sessions to download single "HTTP Response"
(including HTML, zip, doc, etc.) Few crawlers developed on top of Wget.

Pavuk - multithreaded, but buggy, works fine with single thread only.

Windows, Teleport Ultra, it has many "netiquette" features, such as dynamic
bandwidth allocation for slow/fast Web-Servers (do we need dynamic
configuration for Nutch?).

And, of course, we should try http://htmlparser.sourceforge.net (it has
utility class just for crawling)

Thanks

P.S.
Mozilla Firefox is the best debugger (DOM introspector)

Reply via email to