Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by MatthewHolt: http://wiki.apache.org/nutch/IntranetRecrawl The comment on the change is: Merge new segments into one then delete the "new" segments into one b4 indexing. ------------------------------------------------------------------------------ # Nutch recrawl script. # Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html + # + # The script merges the new segments all into one segment to prevent redundant + # data. However, if your crawl/segments directory is becoming very large, I + # would suggest you delete it completely and generate a new crawl. This probaly + # needs to be done every 6 months. + # # Modified by Matthew Holt + # mholt at elon dot edu if [ -n "$1" ] then tomcat_dir=$1 else - echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]" + echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]" - echo "servlet_path - Path of the nutch servlet (i.e. /usr/local/tomcat/webapps/ROOT)" + echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc + at/webapps/ROOT)" - echo "crawl_dir - Path of the directory the crawl is located in." + echo "crawl_dir - Path of the directory the crawl is located in. (full path, i + e: /home/user/nutch/crawl)" - echo "[depth] - The link depth from the root page that should be crawled." + echo "depth - The link depth from the root page that should be crawled." - echo "[adddays] - Advance the clock # of days for fetchlist generation." + echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n + one]" + echo "[topN] - Optional: Selects the top # ranking URLS to be crawled." exit 1 fi @@ -104, +115 @@ then crawl_dir=$2 else - echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]" + echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]" - echo "servlet_path - Path of the nutch servlet (i.e. /usr/local/tomcat/webapps/ROOT)" + echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc + at/webapps/ROOT)" - echo "crawl_dir - Path of the directory the crawl is located in." + echo "crawl_dir - Path of the directory the crawl is located in. (full path, i + e: /home/user/nutch/crawl)" - echo "[depth] - The link depth from the root page that should be crawled." + echo "depth - The link depth from the root page that should be crawled." - echo "[adddays] - Advance the clock # of days for fetchlist generation." + echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n + one]" + echo "[topN] - Optional: Selects the top # ranking URLS to be crawled." exit 1 fi @@ -116, +131 @@ then depth=$3 else - depth=5 + echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]" + echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc + at/webapps/ROOT)" + echo "crawl_dir - Path of the directory the crawl is located in. (full path, i + e: /home/user/nutch/crawl)" + echo "depth - The link depth from the root page that should be crawled." + echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n + one]" + echo "[topN] - Optional: Selects the top # ranking URLS to be crawled." + exit 1 fi if [ -n "$4" ] then adddays=$4 else - adddays=0 + echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]" + echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomcat/webapps/ROOT)" + echo "crawl_dir - Path of the directory the crawl is located in. (full path, ie: /home/user/nutch/crawl)" + echo "depth - The link depth from the root page that should be crawled." + echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n + one]" + echo "[topN] - Optional: Selects the top # ranking URLS to be crawled." + exit 1 + fi + + if [ -n "$5" ] + then + topn="-topN $5" + else + topn="" fi #Sets the path to bin @@ -138, +176 @@ # The generate/fetch/update cycle for ((i=1; i <= depth ; i++)) do - $nutch_dir/nutch generate $webdb_dir $segments_dir -adddays $adddays + $nutch_dir/nutch generate $webdb_dir $segments_dir $topn -adddays $adddays segment=`ls -d $segments_dir/* | tail -1` $nutch_dir/nutch fetch $segment $nutch_dir/nutch updatedb $webdb_dir $segment done + # Merge segments and cleanup unused segments + mergesegs_dir=$crawl_dir/mergesegs_dir + $nutch_dir/nutch mergesegs $mergesegs_dir -dir $segments_dir + + for segment in `ls -d $segments_dir/* | tail -$depth` + do + echo "Removing Temporary Segment: $segment" + rm -rf $segment + done + + cp -R $mergesegs_dir/* $segments_dir + rm -rf $mergesegs_dir + # Update segments $nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir - - # Merge segments - mergesegs_dir=$crawl_dir/mergesegs_dir - $nutch_dir/nutch mergesegs $mergesegs_dir -dir $segments_dir - cp -R $mergesegs_dir/* $segments_dir - rm -rf $mergesegs_dir # Index segments new_indexes=$crawl_dir/newindexes @@ -170, +215 @@ # Clean up rm -rf $new_indexes + echo "FINISHED: Recrawl completed. To conserve disk space, I would suggest" + echo " that the crawl directory be deleted once every 6 months (or more" + echo " frequent depending on disk constraints) and a new crawl generated." - # sleeps for 1 minute to make sure tomcat has released its lock on dir's - # before removing them - sleep 1m - - echo "***Removing old segment directories that are no longer in use. If any of these error out it is not a problem, just used for clean up." - - for segment in `ls -dr $segments_dir/* | tail -$depth` - do - echo "Removing Segment: $segment" - rm -rf $segment - done }}} ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs