Lourival Júnior wrote: > Hi Renaud! > > I'm newbie with shell scripts and I know stops tomcat service is not the > better way to do this. The problem is, when a run the re-crawl script > with > tomcat started I get this error: > > 060721 132224 merging segment indexes to: crawl-legislacao2\index > Exception in thread "main" java.io.IOException: Cannot delete _0.f0 > at > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195) > at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176) > at > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java > :141) > at > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225) > at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92) > at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160) > > So, I want another way to re-crawl my pages without this error and > without > restarting the tomcat. Could you suggest one? > > Thanks a lot! > > Try this updated script and tell me what command exactly you run to call the script. Let me know the error message then.
Matt #!/bin/bash # Nutch recrawl script. # Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html # Modified by Matthew Holt if [ -n "$1" ] then nutch_dir=$1 else echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]" echo "servlet_path - Path of the nutch servlet (i.e. /usr/local/tomcat/webapps/ROOT)" echo "crawl_dir - Name of the directory the crawl is located in." echo "[depth] - The link depth from the root page that should be crawled." echo "[adddays] - Advance the clock # of days for fetchlist generation." exit 1 fi if [ -n "$2" ] then crawl_dir=$2 else echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]" echo "servlet_path - Path of the nutch servlet (i.e. /usr/local/tomcat/webapps/ROOT)" echo "crawl_dir - Name of the directory the crawl is located in." echo "[depth] - The link depth from the root page that should be crawled." echo "[adddays] - Advance the clock # of days for fetchlist generation." exit 1 fi if [ -n "$3" ] then depth=$3 else depth=5 fi if [ -n "$4" ] then adddays=$4 else adddays=0 fi # Only change if your crawl subdirectories are named something different webdb_dir=$crawl_dir/crawldb segments_dir=$crawl_dir/segments linkdb_dir=$crawl_dir/linkdb index_dir=$crawl_dir/index # The generate/fetch/update cycle for ((i=1; i <= depth ; i++)) do bin/nutch generate $webdb_dir $segments_dir -adddays $adddays segment=`ls -d $segments_dir/* | tail -1` bin/nutch fetch $segment bin/nutch updatedb $webdb_dir $segment done # Update segments bin/nutch invertlinks $linkdb_dir -dir $segments_dir # Index segments new_indexes=$crawl_dir/newindexes #ls -d $segments_dir/* | tail -$depth | xargs bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/* # De-duplicate indexes bin/nutch dedup $new_indexes # Merge indexes bin/nutch merge $index_dir $new_indexes # Tell Tomcat to reload index touch $nutch_dir/WEB-INF/web.xml # Clean up rm -rf $new_indexes ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
