I need to index 100s of GBs of documents that I already have on a
local filesystem in my site. I need the content and the index to be
distributed on dfs for distributed search.

What is the best way to import these files (all html docs) into nutch
0.8using dfs and mapred?

I tried putting the files on an http server in my site, then crawling the
files from my dfs/mapred nutch cluster.
-The servers are connected by 1 Gbit/s eathernet, but I could only get crawl
bandwidth of 200 kb/s.
- It is not a cpu utilization issue. I checked the cpu utilization on the
slaves, and it was low as expected (5%-10%).
- The crawl doesn't go through a firewall.
- The crawl-urlfilter.txt file is very simple with a few lines
- Is it a politeness issue? If so how to override the politeness settings?

I'd appreciate your help.

Carl

Reply via email to