My understanding was that every 30 days sites are recrawled.  So if
site #1 was indexed 30 days ago, it would be recrawled and re-indexed
in a new segment with today's crawl.  That leaves you with site #1 in
two segments - the current one and the 30 day old segment.  Nutch then
only uses the current segment with the most recent crawl of site #1 and
ignores the 30 day old segment, leaving it safe to delete using that
script.

That's my understanding of how it works, I stand to be corrected by the
experts.

Raghavendra Prabhu wrote:

Hi
Thanks for the infor My thing is that every here and there my list of sites change So first time i index one and two site Next time i index two and three site So there wil be new data (so if i use ur script,both one and two will be deleted) But i still want to have one preserved So thing is i wud like to remove some segments based upon url of the page alone The information which you gave was also useful .But i want to do the above Rgds
Prabhu
On 2/8/06, *Insurance Squared Inc.* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

    Hi Prabhu,

    Below is the script we use for deleting old segments.

    Regards,
    Glenn


    #!/bin/sh
    # Remove old dirs from segments dir
    # PERIOD is threshold for old dirs
    #
    # Created by Keren Yu Jan 31, 2006

    NUTCH_DIR=/home/glenn/nutch
    PERIOD=30

    # put dirs which are older than PERIOD into dates.tmp
    ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
    threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
    count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
    if [ $count -gt 0 ];
    then
    for dir in `cat $NUTCH_DIR/dirs.tmp`
    do
    # get date from dir
       mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
       if [ $threshold_date -ge $mydate ];
       then
         echo $mydate >> $NUTCH_DIR/dates.tmp
       fi
    done
    fi

    # remove dirs which are older than PERIOD
    ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp >
    $NUTCH_DIR/dirs.tmp
    count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
    if [ $count -gt 0 ];
    then
    for dir in `cat $NUTCH_DIR/dirs.tmp`
    do
       rm -fr $dir
    done
    fi

    rm -f $NUTCH_DIR/dates.tmp


    Raghavendra Prabhu wrote:

    >Hi
    >
    >Should we manually delete the old segments in nutch.
    >
    >For example i have indexed a site on a particular day
    >
    >and one week after that i index the updated content
    >
    >Is there a way i can delete the redundant old url contents in the
    old
    >segments
    >
    >How can we do this?
    >
    >Rgds
    >Prabhu
    >
    >
    >




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to