Re: How to add only new urls to DB

Gal Nitzan Mon, 13 Feb 2006 08:10:19 -0800

no since generate looks in web db (crawldb) for the links which their
status is db_unfetched and it doesn't know that it was injected...



On Mon, 2006-02-13 at 16:52 +0100, Enrico Triolo wrote:
> > ...
> > In general if you inject a set of urls to a webdb and create new
> > segment the segment should only contains the new urls and pages that
> > are older than 30 days and fetched anyway.
> 
> Actually it seems to me that generated segments contain also urls that
> are in DB_UNFETCHED status from the latest fetching job.
> 
> I mean, if I inject an url and set a fetching depth of 1, at the end
> of the process the webdb will contain 1 url in DB_FETCHED status and n
> urls in DB_UNFETCHED (where n is the number of outgoing links of the
> injected url).
> If I then inject another url and generate a new segment, it will
> contain the url itself and the n urls from previous iteration...
> Is there a way to instruct nutch to only fetch the injected url?
> 
> Thanks,
> Enrico
> 
> > Am 08.02.2006 um 14:56 schrieb Scott Owens:
> >
> > > Hi All,
> > >
> > > I wanted to check in to see if anyone has found an answer for this
> > > issue.  I am injecting new URLs on a daily basis, and only need to
> > > fetch/index those new one's, but obviously need to maintain a complete
> > > webdb.
> > >
> > > One thing I was thinking was to use a temporary webdb for the initial
> > > injection, then updating (updatedb) my primary webdb after the fetch
> > > or indexing.
> > >
> > > # prepare dirs and inject urls
> > >        rm -rf $db/*
> > >        $nutch admin -local $db -create
> > >        $nutch inject -local $db -urlfile $urlFile
> > >
> > >       echo -e "\nGenerating next segment to fetch"
> > >       $nutch generate -local $db $segmentdir $fetchLimit
> > >       s=`ls -d $segmentdir/* | tail -1`
> > >       echo -e "\nFetching next segment"
> > >       $nutch fetch $s
> > >       echo -e "\nUpdating web database"
> > >       $nutch updatedb $dbmain $s
> > >       echo -e "\nAnalyzing links"
> > >       $nutch analyze $dbmain 5
> > >
> > > OR after the segment is indexed -- as the above method wouldn't allow
> > > a depth greather than 1?
> > >
> > > # prepare dirs and inject urls
> > >        rm -rf $db/*
> > >        $nutch admin -local $db -create
> > >        $nutch inject -local $db -urlfile $urlFile
> > >
> > > for i in `seq $depth`
> > > do
> > >       echo -e "\nGenerating next segment to fetch"
> > >       $nutch generate -local $db $segmentdir $fetchLimit
> > >       s=`ls -d $segmentdir/* | tail -1`
> > >       echo -e "\nFetching next segment"
> > >       $nutch fetch $s
> > >       echo -e "\nUpdating web database"
> > >       $nutch updatedb $db $s
> > >       echo -e "\nAnalyzing links"
> > >       $nutch analyze $db 5
> > > done
> > >
> > > echo -e "\nFetch done"
> > > echo "Indexing segments"
> > >
> > > for s in `ls -1d $segmentdir/*`
> > > do
> > >       $nutch index $s
> > > done
> > >
> > >       echo -e "\nUpdating web database"
> > >       $nutch updatedb $dbmain $s
> > >
> > >
> > > OR maybe I have no idea what I'm talking about : ) - I'm not a
> > > developer, just trying to figure things out.
> > >
> > > If anyone has experience with this and some advice I'm all ears.
> > > thanks!
> > >
> > > Scott
> > >
> > > On 11/10/05, Dean Elwood <[EMAIL PROTECTED]> wrote:
> > >> Hi Lawrence,
> > >>
> > >> I'm stuck in the same position. I haven't yet examined the "merge"
> > >> function,
> > >> which might shed some light on it.
> > >>
> > >> Have you managed to discover anything so far?
> > >>
> > >>>> You can use the regular expression bases url filter. Than only
> > >>>> urls that
> > >>>> match the pattern will be added to a fetch list.<<
> > >>
> > >> Hi Stefan. Getting the new URL's to crawl is the easy part ;-)
> > >>
> > >> The trick, and the question, is how you add that to an existing
> > >> database,
> > >> and then re-index, without doing a full re-crawl?
> > >>
> > >> Thanks,
> > >>
> > >> Dean
> > >>
> > >> ----- Original Message -----
> > >> From: "Lawrence Pitcher" <[EMAIL PROTECTED]>
> > >> To: <[email protected]>
> > >> Sent: Thursday, November 10, 2005 5:05 PM
> > >> Subject: How to add only new urls to DB
> > >>
> > >>
> > >> Hi,
> > >>
> > >> Thanks to all for the best search solution available.
> > >>
> > >> I have installed the software, indexed 15,000 websites and tested
> > >> the search
> > >> and it works great!
> > >>
> > >> If I want to add only two more websites, so I made a "newurls.txt"
> > >> file,
> > >> then injected it to WebDB "bin/nutch inject db/ -urlfile
> > >> newurls.txt", then
> > >> generated a new segment "bin/nutch generate db/ segments/",  I
> > >> then checked
> > >> for the new sement name in the directory "/segments'
> > >>
> > >> Took that new segment name and placed it in the fetch command "bin/
> > >> nutch
> > >> fetch segments/20051110103316/"
> > >>
> > >> However it appears to re-fetch all 15,000 webpages along with the
> > >> newurls.txt webpages.
> > >>
> > >> Can I not just index only the new and then Update the DB.
> > >>
> > >> Sorry for such a lame question but I have just started.
> > >>
> > >> Many thanks to all.
> > >> Lawrence
> > >>
> > >>
> > >
> >
> > ---------------------------------------------------------------
> > company:        http://www.media-style.com
> > forum:        http://www.text-mining.org
> > blog:            http://www.find23.net
> >
> >
> >
> >
>

Re: How to add only new urls to DB

Reply via email to