> ... > In general if you inject a set of urls to a webdb and create new > segment the segment should only contains the new urls and pages that > are older than 30 days and fetched anyway.
Actually it seems to me that generated segments contain also urls that are in DB_UNFETCHED status from the latest fetching job. I mean, if I inject an url and set a fetching depth of 1, at the end of the process the webdb will contain 1 url in DB_FETCHED status and n urls in DB_UNFETCHED (where n is the number of outgoing links of the injected url). If I then inject another url and generate a new segment, it will contain the url itself and the n urls from previous iteration... Is there a way to instruct nutch to only fetch the injected url? Thanks, Enrico > Am 08.02.2006 um 14:56 schrieb Scott Owens: > > > Hi All, > > > > I wanted to check in to see if anyone has found an answer for this > > issue. I am injecting new URLs on a daily basis, and only need to > > fetch/index those new one's, but obviously need to maintain a complete > > webdb. > > > > One thing I was thinking was to use a temporary webdb for the initial > > injection, then updating (updatedb) my primary webdb after the fetch > > or indexing. > > > > # prepare dirs and inject urls > > rm -rf $db/* > > $nutch admin -local $db -create > > $nutch inject -local $db -urlfile $urlFile > > > > echo -e "\nGenerating next segment to fetch" > > $nutch generate -local $db $segmentdir $fetchLimit > > s=`ls -d $segmentdir/* | tail -1` > > echo -e "\nFetching next segment" > > $nutch fetch $s > > echo -e "\nUpdating web database" > > $nutch updatedb $dbmain $s > > echo -e "\nAnalyzing links" > > $nutch analyze $dbmain 5 > > > > OR after the segment is indexed -- as the above method wouldn't allow > > a depth greather than 1? > > > > # prepare dirs and inject urls > > rm -rf $db/* > > $nutch admin -local $db -create > > $nutch inject -local $db -urlfile $urlFile > > > > for i in `seq $depth` > > do > > echo -e "\nGenerating next segment to fetch" > > $nutch generate -local $db $segmentdir $fetchLimit > > s=`ls -d $segmentdir/* | tail -1` > > echo -e "\nFetching next segment" > > $nutch fetch $s > > echo -e "\nUpdating web database" > > $nutch updatedb $db $s > > echo -e "\nAnalyzing links" > > $nutch analyze $db 5 > > done > > > > echo -e "\nFetch done" > > echo "Indexing segments" > > > > for s in `ls -1d $segmentdir/*` > > do > > $nutch index $s > > done > > > > echo -e "\nUpdating web database" > > $nutch updatedb $dbmain $s > > > > > > OR maybe I have no idea what I'm talking about : ) - I'm not a > > developer, just trying to figure things out. > > > > If anyone has experience with this and some advice I'm all ears. > > thanks! > > > > Scott > > > > On 11/10/05, Dean Elwood <[EMAIL PROTECTED]> wrote: > >> Hi Lawrence, > >> > >> I'm stuck in the same position. I haven't yet examined the "merge" > >> function, > >> which might shed some light on it. > >> > >> Have you managed to discover anything so far? > >> > >>>> You can use the regular expression bases url filter. Than only > >>>> urls that > >>>> match the pattern will be added to a fetch list.<< > >> > >> Hi Stefan. Getting the new URL's to crawl is the easy part ;-) > >> > >> The trick, and the question, is how you add that to an existing > >> database, > >> and then re-index, without doing a full re-crawl? > >> > >> Thanks, > >> > >> Dean > >> > >> ----- Original Message ----- > >> From: "Lawrence Pitcher" <[EMAIL PROTECTED]> > >> To: <[email protected]> > >> Sent: Thursday, November 10, 2005 5:05 PM > >> Subject: How to add only new urls to DB > >> > >> > >> Hi, > >> > >> Thanks to all for the best search solution available. > >> > >> I have installed the software, indexed 15,000 websites and tested > >> the search > >> and it works great! > >> > >> If I want to add only two more websites, so I made a "newurls.txt" > >> file, > >> then injected it to WebDB "bin/nutch inject db/ -urlfile > >> newurls.txt", then > >> generated a new segment "bin/nutch generate db/ segments/", I > >> then checked > >> for the new sement name in the directory "/segments' > >> > >> Took that new segment name and placed it in the fetch command "bin/ > >> nutch > >> fetch segments/20051110103316/" > >> > >> However it appears to re-fetch all 15,000 webpages along with the > >> newurls.txt webpages. > >> > >> Can I not just index only the new and then Update the DB. > >> > >> Sorry for such a lame question but I have just started. > >> > >> Many thanks to all. > >> Lawrence > >> > >> > > > > --------------------------------------------------------------- > company: http://www.media-style.com > forum: http://www.text-mining.org > blog: http://www.find23.net > > > >
