Hi Prabhu,
Below is the script we use for deleting old segments.
Regards,
Glenn
#!/bin/sh
# Remove old dirs from segments dir
# PERIOD is threshold for old dirs
#
# Created by Keren Yu Jan 31, 2006
NUTCH_DIR=/home/glenn/nutch
PERIOD=30
# put dirs which are older than PERIOD into dates.tmp
ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
if [ $count -gt 0 ];
then
for dir in `cat $NUTCH_DIR/dirs.tmp`
do
# get date from dir
mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
if [ $threshold_date -ge $mydate ];
then
echo $mydate >> $NUTCH_DIR/dates.tmp
fi
done
fi
# remove dirs which are older than PERIOD
ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp >
$NUTCH_DIR/dirs.tmp
count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
if [ $count -gt 0 ];
then
for dir in `cat $NUTCH_DIR/dirs.tmp`
do
rm -fr $dir
done
fi
rm -f $NUTCH_DIR/dates.tmp
Raghavendra Prabhu wrote:
Hi
Should we manually delete the old segments in nutch.
For example i have indexed a site on a particular day
and one week after that i index the updated content
Is there a way i can delete the redundant old url contents in the old
segments
How can we do this?
Rgds
Prabhu
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general