My understanding was that every 30 days sites are recrawled.  So if
site #1 was indexed 30 days ago, it would be recrawled and re-indexed
in a new segment with today's crawl.  That leaves you with site #1 in
two segments - the current one and the 30 day old segment.  Nutch then
only uses the current segment with the most recent crawl of site #1 and
ignores the 30 day old segment, leaving it safe to delete using that
script.

That's my understanding of how it works, I stand to be corrected by the
experts.

Raghavendra Prabhu wrote:

Hi
Thanks for the infor My thing is that every here and there my list of sites change So first time i index one and two site Next time i index two and three site So there wil be new data (so if i use ur script,both one and two will be deleted) But i still want to have one preserved So thing is i wud like to remove some segments based upon url of the page alone The information which you gave was also useful .But i want to do the above Rgds
Prabhu
On 2/8/06, *Insurance Squared Inc.* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

    Hi Prabhu,

    Below is the script we use for deleting old segments.

    Regards,
    Glenn


    #!/bin/sh
    # Remove old dirs from segments dir
    # PERIOD is threshold for old dirs
    #
    # Created by Keren Yu Jan 31, 2006

    NUTCH_DIR=/home/glenn/nutch
    PERIOD=30

    # put dirs which are older than PERIOD into dates.tmp
    ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
    threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
    count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
    if [ $count -gt 0 ];
    then
    for dir in `cat $NUTCH_DIR/dirs.tmp`
    do
    # get date from dir
       mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
       if [ $threshold_date -ge $mydate ];
       then
         echo $mydate >> $NUTCH_DIR/dates.tmp
       fi
    done
    fi

    # remove dirs which are older than PERIOD
    ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp >
    $NUTCH_DIR/dirs.tmp
    count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
    if [ $count -gt 0 ];
    then
    for dir in `cat $NUTCH_DIR/dirs.tmp`
    do
       rm -fr $dir
    done
    fi

    rm -f $NUTCH_DIR/dates.tmp


    Raghavendra Prabhu wrote:

    >Hi
    >
    >Should we manually delete the old segments in nutch.
    >
    >For example i have indexed a site on a particular day
    >
    >and one week after that i index the updated content
    >
    >Is there a way i can delete the redundant old url contents in the
    old
    >segments
    >
    >How can we do this?
    >
    >Rgds
    >Prabhu
    >
    >
    >


Reply via email to