Hello, I'm using nutch to crawl the intranet. I've set the file size limit quite high (2 Mb, and default is just kinda 64k), therefore I also set the fetcher threads very low (between 1 and 4).
But while the fetcher runs, my memory usage is too high for my notebook (1Gb Ram, and java.exe needs ~700Mb, after this point everything gets very slow of course). My question would be, if there is a possibility to fetch/crawl many many files (ZIP files with PDF, XLS, DOC and PPT) with less memory usage? Or may be did I just configure my nutch wrong? I'm running nutch via intranet search with i.e. "bin/nutch crawl myurls -dir crawldb -depth 1 -threads 2 -topN 50" Since I got all the URLs in my text file I choose "depth 1". Can I configure nutch somehow not to fetch too many files at once? And start fetching again after indexing the 1st part nutch got? I would be really happy if someone could give me a hind. Cheers Lam Nguyen
