Hi

Thanks for the infor

My thing is that every here and there my list of sites change

So first time i index one and two site

Next time i index two and three site

So there wil be new data (so if i use ur script,both one and two will be
deleted)

But i still want to have one preserved

So thing is i wud like to remove some segments based upon url of the page
alone

The information which you gave was also useful .But i want to do the above


Rgds
Prabhu
On 2/8/06, Insurance Squared Inc. <[EMAIL PROTECTED]> wrote:
>
> Hi Prabhu,
>
> Below is the script we use for deleting old segments.
>
> Regards,
> Glenn
>
>
> #!/bin/sh
> # Remove old dirs from segments dir
> # PERIOD is threshold for old dirs
> #
> # Created by Keren Yu Jan 31, 2006
>
> NUTCH_DIR=/home/glenn/nutch
> PERIOD=30
>
> # put dirs which are older than PERIOD into dates.tmp
> ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
> threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
> count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
> if [ $count -gt 0 ];
> then
> for dir in `cat $NUTCH_DIR/dirs.tmp`
> do
> # get date from dir
>    mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
>    if [ $threshold_date -ge $mydate ];
>    then
>      echo $mydate >> $NUTCH_DIR/dates.tmp
>    fi
> done
> fi
>
> # remove dirs which are older than PERIOD
> ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp >
> $NUTCH_DIR/dirs.tmp
> count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
> if [ $count -gt 0 ];
> then
> for dir in `cat $NUTCH_DIR/dirs.tmp`
> do
>    rm -fr $dir
> done
> fi
>
> rm -f $NUTCH_DIR/dates.tmp
>
>
> Raghavendra Prabhu wrote:
>
> >Hi
> >
> >Should we manually delete the old segments in nutch.
> >
> >For example i have indexed a site on a particular day
> >
> >and one week after that i index the updated content
> >
> >Is there a way i can delete the redundant old url contents in the old
> >segments
> >
> >How can we do this?
> >
> >Rgds
> >Prabhu
> >
> >
> >
>

Reply via email to