I'm looking over the Intranet Recrawl script here: http://wiki.apache.org/nutch/IntranetRecrawl and I'm a little confused about segment merging and deleting.
####Start code snip#### # Merge segments and cleanup unused segments mergesegs_dir=$crawl_dir/mergesegs_dir $nutch_dir/nutch mergesegs $mergesegs_dir -dir $segments_dir for segment in `ls -d $segments_dir/* | tail -$depth` do echo "Removing Temporary Segment: $segment" rm -rf $segment done cp -R $mergesegs_dir/* $segments_dir rm -rf $mergesegs_dir ####End code snip#### What I understand that this does, it is merges ALL segments into a new segment, deletes the NEW segments from the recrawl, and then adds the new merged segment to the existing ones. For example, if I had existing segment1 and segment2, then the recrawl creates segment3 and segment4 then we merge all the segments into mergedsegment1-2-3-4 then delete the new segment3 and segment4 and copies mergedsegment1-2-3-4 so that in the segments dir we now have segment1, segment2, and mergedsegment1-2-3-4 It seems to me that we should either be merging only the new segments, or we should be deleting all existing segments. Can someone confirm this or explain to me what in fact the script is doing? -- http://JacobBrunson.com
