Hello Brian,
Getting a response from another newbie here, so I could be wrong (do excuse if
I am).
If you are attempting to run a search index from the filesystem you need to
have the following in your nutch-site.xml :
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
The fs.default.name is require by the nutch-site.xml when you build your .war
file for deployment to tomcat. This should be accompanied by the below config,
which should point to the direct where your index has been copied to, in my
case it looks something like below :
<property>
<name>searcher.dir</name>
<value>/home/nutch/nutch/service/crawl</value>
<description>
Path to root of crawl. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
</description>
</property>
Regarding your second question :
bin/nutch readdb yourcrawldir/crawldb -dump -format csv
Gives you a nice flat file serialisation of your crawl database.
I hope this helps,
Mischa
On 1 Dec 2009, at 08:44, brian wrote:
> also, I would like to know how to extract flat text files of the crawl data.
___________________________________
Mischa Tuffield
Email: [email protected]
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465 http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD