no since generate looks in web db (crawldb) for the links which their status is db_unfetched and it doesn't know that it was injected...
On Mon, 2006-02-13 at 16:52 +0100, Enrico Triolo wrote: > > ... > > In general if you inject a set of urls to a webdb and create new > > segment the segment should only contains the new urls and pages that > > are older than 30 days and fetched anyway. > > Actually it seems to me that generated segments contain also urls that > are in DB_UNFETCHED status from the latest fetching job. > > I mean, if I inject an url and set a fetching depth of 1, at the end > of the process the webdb will contain 1 url in DB_FETCHED status and n > urls in DB_UNFETCHED (where n is the number of outgoing links of the > injected url). > If I then inject another url and generate a new segment, it will > contain the url itself and the n urls from previous iteration... > Is there a way to instruct nutch to only fetch the injected url? > > Thanks, > Enrico > > > Am 08.02.2006 um 14:56 schrieb Scott Owens: > > > > > Hi All, > > > > > > I wanted to check in to see if anyone has found an answer for this > > > issue. I am injecting new URLs on a daily basis, and only need to > > > fetch/index those new one's, but obviously need to maintain a complete > > > webdb. > > > > > > One thing I was thinking was to use a temporary webdb for the initial > > > injection, then updating (updatedb) my primary webdb after the fetch > > > or indexing. > > > > > > # prepare dirs and inject urls > > > rm -rf $db/* > > > $nutch admin -local $db -create > > > $nutch inject -local $db -urlfile $urlFile > > > > > > echo -e "\nGenerating next segment to fetch" > > > $nutch generate -local $db $segmentdir $fetchLimit > > > s=`ls -d $segmentdir/* | tail -1` > > > echo -e "\nFetching next segment" > > > $nutch fetch $s > > > echo -e "\nUpdating web database" > > > $nutch updatedb $dbmain $s > > > echo -e "\nAnalyzing links" > > > $nutch analyze $dbmain 5 > > > > > > OR after the segment is indexed -- as the above method wouldn't allow > > > a depth greather than 1? > > > > > > # prepare dirs and inject urls > > > rm -rf $db/* > > > $nutch admin -local $db -create > > > $nutch inject -local $db -urlfile $urlFile > > > > > > for i in `seq $depth` > > > do > > > echo -e "\nGenerating next segment to fetch" > > > $nutch generate -local $db $segmentdir $fetchLimit > > > s=`ls -d $segmentdir/* | tail -1` > > > echo -e "\nFetching next segment" > > > $nutch fetch $s > > > echo -e "\nUpdating web database" > > > $nutch updatedb $db $s > > > echo -e "\nAnalyzing links" > > > $nutch analyze $db 5 > > > done > > > > > > echo -e "\nFetch done" > > > echo "Indexing segments" > > > > > > for s in `ls -1d $segmentdir/*` > > > do > > > $nutch index $s > > > done > > > > > > echo -e "\nUpdating web database" > > > $nutch updatedb $dbmain $s > > > > > > > > > OR maybe I have no idea what I'm talking about : ) - I'm not a > > > developer, just trying to figure things out. > > > > > > If anyone has experience with this and some advice I'm all ears. > > > thanks! > > > > > > Scott > > > > > > On 11/10/05, Dean Elwood <[EMAIL PROTECTED]> wrote: > > >> Hi Lawrence, > > >> > > >> I'm stuck in the same position. I haven't yet examined the "merge" > > >> function, > > >> which might shed some light on it. > > >> > > >> Have you managed to discover anything so far? > > >> > > >>>> You can use the regular expression bases url filter. Than only > > >>>> urls that > > >>>> match the pattern will be added to a fetch list.<< > > >> > > >> Hi Stefan. Getting the new URL's to crawl is the easy part ;-) > > >> > > >> The trick, and the question, is how you add that to an existing > > >> database, > > >> and then re-index, without doing a full re-crawl? > > >> > > >> Thanks, > > >> > > >> Dean > > >> > > >> ----- Original Message ----- > > >> From: "Lawrence Pitcher" <[EMAIL PROTECTED]> > > >> To: <[email protected]> > > >> Sent: Thursday, November 10, 2005 5:05 PM > > >> Subject: How to add only new urls to DB > > >> > > >> > > >> Hi, > > >> > > >> Thanks to all for the best search solution available. > > >> > > >> I have installed the software, indexed 15,000 websites and tested > > >> the search > > >> and it works great! > > >> > > >> If I want to add only two more websites, so I made a "newurls.txt" > > >> file, > > >> then injected it to WebDB "bin/nutch inject db/ -urlfile > > >> newurls.txt", then > > >> generated a new segment "bin/nutch generate db/ segments/", I > > >> then checked > > >> for the new sement name in the directory "/segments' > > >> > > >> Took that new segment name and placed it in the fetch command "bin/ > > >> nutch > > >> fetch segments/20051110103316/" > > >> > > >> However it appears to re-fetch all 15,000 webpages along with the > > >> newurls.txt webpages. > > >> > > >> Can I not just index only the new and then Update the DB. > > >> > > >> Sorry for such a lame question but I have just started. > > >> > > >> Many thanks to all. > > >> Lawrence > > >> > > >> > > > > > > > --------------------------------------------------------------- > > company: http://www.media-style.com > > forum: http://www.text-mining.org > > blog: http://www.find23.net > > > > > > > > >
