My understanding was that every 30 days sites are recrawled. So if
site #1 was indexed 30 days ago, it would be recrawled and re-indexed
in a new segment with today's crawl. That leaves you with site #1 in
two segments - the current one and the 30 day old segment. Nutch then
only uses the current segment with the most recent crawl of site #1 and
ignores the 30 day old segment, leaving it safe to delete using that
script.
That's my understanding of how it works, I stand to be corrected by the
experts.
Raghavendra Prabhu wrote:
Hi
Thanks for the infor
My thing is that every here and there my list of sites change
So first time i index one and two site
Next time i index two and three site
So there wil be new data (so if i use ur script,both one and two will
be deleted)
But i still want to have one preserved
So thing is i wud like to remove some segments based upon url of the
page alone
The information which you gave was also useful .But i want to do the above
Rgds
Prabhu
On 2/8/06, *Insurance Squared Inc.* <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
Hi Prabhu,
Below is the script we use for deleting old segments.
Regards,
Glenn
#!/bin/sh
# Remove old dirs from segments dir
# PERIOD is threshold for old dirs
#
# Created by Keren Yu Jan 31, 2006
NUTCH_DIR=/home/glenn/nutch
PERIOD=30
# put dirs which are older than PERIOD into dates.tmp
ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
if [ $count -gt 0 ];
then
for dir in `cat $NUTCH_DIR/dirs.tmp`
do
# get date from dir
mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
if [ $threshold_date -ge $mydate ];
then
echo $mydate >> $NUTCH_DIR/dates.tmp
fi
done
fi
# remove dirs which are older than PERIOD
ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp >
$NUTCH_DIR/dirs.tmp
count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
if [ $count -gt 0 ];
then
for dir in `cat $NUTCH_DIR/dirs.tmp`
do
rm -fr $dir
done
fi
rm -f $NUTCH_DIR/dates.tmp
Raghavendra Prabhu wrote:
>Hi
>
>Should we manually delete the old segments in nutch.
>
>For example i have indexed a site on a particular day
>
>and one week after that i index the updated content
>
>Is there a way i can delete the redundant old url contents in the
old
>segments
>
>How can we do this?
>
>Rgds
>Prabhu
>
>
>
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general