Hi, everyone.
I'm very interested in the NUTCH's distributed file system and distributed
searching.
These days, I'm focus on setting up nutch-0.9 on serval computers.

At first, I used just one computer. The namenode,datanode,and jobtracker are
all running on the same computer.
I put all my data files in the directory /home/hadoop/data which consist of
about 20000 html files(total size about 30M).
The crawl-urlfilter.txt is like this:
-^(ftp|mailto:)   (I removed the file option)
+^file:///home/hadoop/data/*
the urls.txt is like this
file:///home/hadoop/data/
I aslo modified the nutch-default.xml, add  protocol-(http|file|ftp) to the
"plugin.include" node
Then I started the hadoop and  run the cmd: nutch crawl url -dir crawl
-depth 10
However, after the crawling was done, I got nothing.
The size of crawldb,indexes,linkdb and segments are all less than 2K. It's
impossible!!
Can anyone tell me what's wrong with my steps?
Besides, my goal is to setup nutch on serval computers which have no access
to Internet. So, I can not just crawl the
Internet. But I have get about 100G Internet html file from a lab. Crawl and
Index those files must be time consuming.
How can I index those huge mount of files efficiently?

Thank you all!
-- 
Feng Xia

Reply via email to