My understanding was that every 30 days sites are recrawled. So if
site #1 was indexed 30 days ago, it would be recrawled and re-indexed
in a new segment with today's crawl. That leaves you with site #1 in
two segments - the current one and the 30 day old segment. Nutch then
only uses the current segment with the most recent crawl of site #1 and
ignores the 30 day old segment, leaving it safe to delete using that
script.
That's my understanding of how it works, I stand to be corrected by the
experts.
Raghavendra Prabhu wrote:
Hi
Thanks for the infor
My thing is that every here and there my list of sites change
So first time i index one and two site
Next time i index two and three site
So there wil be new data (so if i use ur script,both one and two will
be deleted)
But i still want to have one preserved
So thing is i wud like to remove some segments based upon url of the
page alone
The information which you gave was also useful .But i want to do the above
Rgds
Prabhu
On 2/8/06, *Insurance Squared Inc.* <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
Hi Prabhu,
Below is the script we use for deleting old segments.
Regards,
Glenn
#!/bin/sh
# Remove old dirs from segments dir
# PERIOD is threshold for old dirs
#
# Created by Keren Yu Jan 31, 2006
NUTCH_DIR=/home/glenn/nutch
PERIOD=30
# put dirs which are older than PERIOD into dates.tmp
ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
if [ $count -gt 0 ];
then
for dir in `cat $NUTCH_DIR/dirs.tmp`
do
# get date from dir
mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
if [ $threshold_date -ge $mydate ];
then
echo $mydate >> $NUTCH_DIR/dates.tmp
fi
done
fi
# remove dirs which are older than PERIOD
ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp >
$NUTCH_DIR/dirs.tmp
count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
if [ $count -gt 0 ];
then
for dir in `cat $NUTCH_DIR/dirs.tmp`
do
rm -fr $dir
done
fi
rm -f $NUTCH_DIR/dates.tmp
Raghavendra Prabhu wrote:
>Hi
>
>Should we manually delete the old segments in nutch.
>
>For example i have indexed a site on a particular day
>
>and one week after that i index the updated content
>
>Is there a way i can delete the redundant old url contents in the
old
>segments
>
>How can we do this?
>
>Rgds
>Prabhu
>
>
>