Hi Renaud!
I'm newbie with shell scripts and I know stops tomcat service is not the
better way to do this. The problem is, when a run the re-crawl script with
tomcat started I get this error:
060721 132224 merging segment indexes to: crawl-legislacao2\index
Exception in thread "main" java.io.IOException: Cannot delete _0.f0
at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
:141)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92)
at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160)
So, I want another way to re-crawl my pages without this error and without
restarting the tomcat. Could you suggest one?
Thanks a lot!
On 7/21/06, Renaud Richardet <[EMAIL PROTECTED]> wrote:
Hi Matt and Lourival,
Matt, thank you for the recrawl script. Any plans to commit it to trunk?
Lourival, here's in the script what "reloads Tomcat", not the cleanest,
but it should work
# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml
HTH,
Renaud
Lourival Júnior wrote:
> Hi Matt!
>
> In the article found at
>
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou
>
> said the re-crawl script have a problem with updating the live search
> index. In my tests with Nutch version 0.7.2 when I run the script the
> index
> could not be update because the tomcat loads it to the memory. Could you
> suggest a modification to this script or to the NutchBean that accepts
> modifications to the index without restart tomcat (Actually, I use net
> stop
> "Apache Tomcat" before the index updation...)?
>
> Thanks
>
> On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>>
>> Thanks for putting up with all the messages to the list... Here is the
>> recrawl script for 0.8.0 if anyone is interested.
>> Matt
>> -------------------------------
>>
>> #!/bin/bash
>>
>> # Nutch recrawl script.
>> # Based on 0.7.2 script at
>>
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>
>> # Modified by Matthew Holt
>>
>> if [ -n "$1" ]
>> then
>> crawl_dir=$1
>> else
>> echo "Usage: recrawl crawl_dir [depth] [adddays]"
>> exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>> depth=$2
>> else
>> depth=5
>> fi
>>
>> if [ -n "$3" ]
>> then
>> adddays=$3
>> else
>> adddays=0
>> fi
>>
>>
>> # EDIT THIS - List the location to your nutch servlet container.
>> nutch_dir=/usr/local/apache-tomcat-5.5.17/webapps/nutch/
>>
>> # No need to edit anything past this line #
>> webdb_dir=$crawl_dir/crawldb
>> segments_dir=$crawl_dir/segments
>> linkdb_dir=$crawl_dir/linkdb
>> index_dir=$crawl_dir/index
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>> segment=`ls -d $segments_dir/* | tail -1`
>> bin/nutch fetch $segment
>> bin/nutch updatedb $webdb_dir $segment
>> done
>>
>> # Update segments
>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>
>> # Index segments
>> new_indexes=$crawl_dir/newindexes
>> #ls -d $segments_dir/* | tail -$depth | xargs
>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>
>> # De-duplicate indexes
>> bin/nutch dedup $new_indexes
>>
>> # Merge indexes
>> bin/nutch merge $index_dir $new_indexes
>>
>> # Tell Tomcat to reload index
>> touch $nutch_dir/WEB-INF/web.xml
>>
>> # Clean up
>> rm -rf $new_indexes
>>
>>
>
>
--
Renaud Richardet
COO America
Wyona Inc. - Open Source Content Management - Apache Lenya
office +1 857 776-3195 mobile +1 617 230 9112
renaud.richardet <at> wyona.com http://www.wyona.com
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general