Hi, everyone. I'm very interested in the NUTCH's distributed file system and distributed searching. These days, I'm focus on setting up nutch-0.9 on serval computers.
At first, I used just one computer. The namenode,datanode,and jobtracker are all running on the same computer. I put all my data files in the directory /home/hadoop/data which consist of about 20000 html files(total size about 30M). The crawl-urlfilter.txt is like this: -^(ftp|mailto:) (I removed the file option) +^file:///home/hadoop/data/* the urls.txt is like this file:///home/hadoop/data/ I aslo modified the nutch-default.xml, add protocol-(http|file|ftp) to the "plugin.include" node Then I started the hadoop and run the cmd: nutch crawl url -dir crawl -depth 10 However, after the crawling was done, I got nothing. The size of crawldb,indexes,linkdb and segments are all less than 2K. It's impossible!! Can anyone tell me what's wrong with my steps? Besides, my goal is to setup nutch on serval computers which have no access to Internet. So, I can not just crawl the Internet. But I have get about 100G Internet html file from a lab. Crawl and Index those files must be time consuming. How can I index those huge mount of files efficiently? Thank you all! -- Feng Xia
