Hi Prabhu,

Below is the script we use for deleting old segments.

Regards,
Glenn


#!/bin/sh
# Remove old dirs from segments dir
# PERIOD is threshold for old dirs
#
# Created by Keren Yu Jan 31, 2006

NUTCH_DIR=/home/glenn/nutch
PERIOD=30

# put dirs which are older than PERIOD into dates.tmp
ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
if [ $count -gt 0 ];
then
 for dir in `cat $NUTCH_DIR/dirs.tmp`
 do
# get date from dir
   mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
   if [ $threshold_date -ge $mydate ];
   then
     echo $mydate >> $NUTCH_DIR/dates.tmp
   fi
 done
fi

# remove dirs which are older than PERIOD
ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp > $NUTCH_DIR/dirs.tmp
count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
if [ $count -gt 0 ];
then
 for dir in `cat $NUTCH_DIR/dirs.tmp`
 do
   rm -fr $dir
 done
fi

rm -f $NUTCH_DIR/dates.tmp


Raghavendra Prabhu wrote:

Hi

Should we manually delete the old segments in nutch.

For example i have indexed a site on a particular day

and one week after that i index the updated content

Is there a way i can delete the redundant old url contents in the old
segments

How can we do this?

Rgds
Prabhu



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to