HowTo crawl many files (ZIP with DOC,PDF....) correctly?

P.Nguyen2 Tue, 09 Oct 2007 08:53:57 -0700

Hello,

I'm using nutch to crawl the intranet.
I've set the file size limit quite high (2 Mb, and default is just kinda
64k), therefore I also set the fetcher threads very low (between 1 and
4).


But while the fetcher runs, my memory usage is too high for my notebook
(1Gb Ram, and java.exe needs ~700Mb, after this point everything gets
very slow of course).

My question would be, if there is a possibility to fetch/crawl many many
files (ZIP files with PDF, XLS, DOC and PPT) with less memory usage?
Or may be did I just configure my nutch wrong?

I'm running nutch via intranet search with i.e. "bin/nutch crawl myurls
-dir crawldb -depth 1 -threads 2 -topN 50"
Since I got all the URLs in my text file I choose "depth 1".

Can I configure nutch somehow not to fetch too many files at once? And
start fetching again after indexing the 1st part nutch got?
I would be really happy if someone could give me a hind.

Cheers
Lam Nguyen

HowTo crawl many files (ZIP with DOC,PDF....) correctly?

Reply via email to