[Nutch-cvs] [Nutch Wiki] Update of "IntranetRecrawl" by MatthewHolt

Apache Wiki Tue, 08 Aug 2006 11:52:48 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by MatthewHolt:
http://wiki.apache.org/nutch/IntranetRecrawl

The comment on the change is:
Merge new segments into one then delete the "new" segments into one b4 indexing.

------------------------------------------------------------------------------
  
  # Nutch recrawl script.
  # Based on 0.7.2 script at 
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
+ #
+ # The script merges the new segments all into one segment to prevent redundant
+ # data. However, if your crawl/segments directory is becoming very large, I
+ # would suggest you delete it completely and generate a new crawl. This 
probaly
+ # needs to be done every 6 months.
+ #
  # Modified by Matthew Holt
+ # mholt at elon dot edu
  
  if [ -n "$1" ]
  then
    tomcat_dir=$1
  else
-   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
+   echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
-   echo "servlet_path - Path of the nutch servlet (i.e. 
/usr/local/tomcat/webapps/ROOT)"
+   echo "servlet_path - Path of the nutch servlet (full path, ie: 
/usr/local/tomc
+ at/webapps/ROOT)"
-   echo "crawl_dir - Path of the directory the crawl is located in."
+   echo "crawl_dir - Path of the directory the crawl is located in. (full 
path, i
+ e: /home/user/nutch/crawl)"
-   echo "[depth] - The link depth from the root page that should be crawled."
+   echo "depth - The link depth from the root page that should be crawled."
-   echo "[adddays] - Advance the clock # of days for fetchlist generation."
+   echo "adddays - Advance the clock # of days for fetchlist generation. [0 
for n
+ one]"
+   echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
    exit 1
  fi
  
@@ -104, +115 @@

  then
    crawl_dir=$2
  else
-   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
+   echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
-   echo "servlet_path - Path of the nutch servlet (i.e. 
/usr/local/tomcat/webapps/ROOT)"
+   echo "servlet_path - Path of the nutch servlet (full path, ie: 
/usr/local/tomc
+ at/webapps/ROOT)"
-   echo "crawl_dir - Path of the directory the crawl is located in."
+   echo "crawl_dir - Path of the directory the crawl is located in. (full 
path, i
+ e: /home/user/nutch/crawl)"
-   echo "[depth] - The link depth from the root page that should be crawled."
+   echo "depth - The link depth from the root page that should be crawled."
-   echo "[adddays] - Advance the clock # of days for fetchlist generation."
+   echo "adddays - Advance the clock # of days for fetchlist generation. [0 
for n
+ one]"
+   echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
    exit 1
  fi
  
@@ -116, +131 @@

  then
    depth=$3
  else
-   depth=5
+   echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
+   echo "servlet_path - Path of the nutch servlet (full path, ie: 
/usr/local/tomc
+ at/webapps/ROOT)"
+   echo "crawl_dir - Path of the directory the crawl is located in. (full 
path, i
+ e: /home/user/nutch/crawl)"
+   echo "depth - The link depth from the root page that should be crawled."
+   echo "adddays - Advance the clock # of days for fetchlist generation. [0 
for n
+ one]"
+   echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
+   exit 1
  fi
  
  if [ -n "$4" ]
  then
    adddays=$4
  else
-   adddays=0
+   echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
+   echo "servlet_path - Path of the nutch servlet (full path, ie: 
/usr/local/tomcat/webapps/ROOT)"
+   echo "crawl_dir - Path of the directory the crawl is located in. (full 
path, ie: /home/user/nutch/crawl)"
+   echo "depth - The link depth from the root page that should be crawled."
+   echo "adddays - Advance the clock # of days for fetchlist generation. [0 
for n
+ one]"
+   echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
+   exit 1
+ fi
+ 
+ if [ -n "$5" ]
+ then
+   topn="-topN $5"
+ else
+   topn=""
  fi
  
  #Sets the path to bin
@@ -138, +176 @@

  # The generate/fetch/update cycle
  for ((i=1; i <= depth ; i++))
  do
-   $nutch_dir/nutch generate $webdb_dir $segments_dir -adddays $adddays
+   $nutch_dir/nutch generate $webdb_dir $segments_dir $topn -adddays $adddays
    segment=`ls -d $segments_dir/* | tail -1`
    $nutch_dir/nutch fetch $segment
    $nutch_dir/nutch updatedb $webdb_dir $segment
  done
  
+ # Merge segments and cleanup unused segments
+ mergesegs_dir=$crawl_dir/mergesegs_dir
+ $nutch_dir/nutch mergesegs $mergesegs_dir -dir $segments_dir
+ 
+ for segment in `ls -d $segments_dir/* | tail -$depth`
+ do
+   echo "Removing Temporary Segment: $segment"
+   rm -rf $segment
+ done
+ 
+ cp -R $mergesegs_dir/* $segments_dir
+ rm -rf $mergesegs_dir
+ 
  # Update segments
  $nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir
- 
- # Merge segments
- mergesegs_dir=$crawl_dir/mergesegs_dir
- $nutch_dir/nutch mergesegs $mergesegs_dir -dir $segments_dir
- cp -R $mergesegs_dir/* $segments_dir
- rm -rf $mergesegs_dir
  
  # Index segments
  new_indexes=$crawl_dir/newindexes
@@ -170, +215 @@

  # Clean up
  rm -rf $new_indexes
  
+ echo "FINISHED: Recrawl completed. To conserve disk space, I would suggest"
+ echo " that the crawl directory be deleted once every 6 months (or more"
+ echo " frequent depending on disk constraints) and a new crawl generated."
- # sleeps for 1 minute to make sure tomcat has released its lock on dir's
- # before removing them
- sleep 1m
- 
- echo "***Removing old segment directories that are no longer in use. If any 
of these error out it is not a problem, just used for clean up."
- 
- for segment in `ls -dr $segments_dir/* | tail -$depth`
- do
-   echo "Removing Segment: $segment"
-   rm -rf $segment
- done
  }}}
  

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

[Nutch-cvs] [Nutch Wiki] Update of "IntranetRecrawl" by MatthewHolt

Reply via email to