Re: newbie questions

Mischa Tuffield Tue, 01 Dec 2009 02:58:22 -0800

Hello Brian, 

Getting a response from another newbie here, so I could be wrong (do excuse if 
I am).


If you are attempting to run a search index from the filesystem you need to 
have the following in your nutch-site.xml : 

  <property>
    <name>fs.default.name</name>
    <value>file:///</value>
  </property>

The fs.default.name is require by the nutch-site.xml when you build your .war 
file for deployment to tomcat. This should be accompanied by the below config, 
which should point to the direct where your index has been copied to, in my 
case it looks something like below :

 <property>
   <name>searcher.dir</name>
   <value>/home/nutch/nutch/service/crawl</value>
   <description>
   Path to root of crawl.  This directory is searched (in
   order) for either the file search-servers.txt, containing a list of
   distributed search servers, or the directory "index" containing
   merged indexes, or the directory "segments" containing segment
   indexes.
   </description>
 </property>

Regarding your second question :

bin/nutch readdb yourcrawldir/crawldb -dump -format csv

Gives you a nice flat file serialisation of your crawl database.

I hope this helps, 

Mischa
On 1 Dec 2009, at 08:44, brian wrote:

> also, I would like to know how to extract flat text files of the crawl data.

___________________________________
Mischa Tuffield
Email: [email protected]
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Re: newbie questions

Reply via email to