Re: How to add only new urls to DB

Enrico Triolo Mon, 13 Feb 2006 07:53:31 -0800

> ...
> In general if you inject a set of urls to a webdb and create new
> segment the segment should only contains the new urls and pages that
> are older than 30 days and fetched anyway.


Actually it seems to me that generated segments contain also urls that
are in DB_UNFETCHED status from the latest fetching job.

I mean, if I inject an url and set a fetching depth of 1, at the end
of the process the webdb will contain 1 url in DB_FETCHED status and n
urls in DB_UNFETCHED (where n is the number of outgoing links of the
injected url).
If I then inject another url and generate a new segment, it will
contain the url itself and the n urls from previous iteration...
Is there a way to instruct nutch to only fetch the injected url?

Thanks,
Enrico

> Am 08.02.2006 um 14:56 schrieb Scott Owens:
>
> > Hi All,
> >
> > I wanted to check in to see if anyone has found an answer for this
> > issue.  I am injecting new URLs on a daily basis, and only need to
> > fetch/index those new one's, but obviously need to maintain a complete
> > webdb.
> >
> > One thing I was thinking was to use a temporary webdb for the initial
> > injection, then updating (updatedb) my primary webdb after the fetch
> > or indexing.
> >
> > # prepare dirs and inject urls
> >        rm -rf $db/*
> >        $nutch admin -local $db -create
> >        $nutch inject -local $db -urlfile $urlFile
> >
> >       echo -e "\nGenerating next segment to fetch"
> >       $nutch generate -local $db $segmentdir $fetchLimit
> >       s=`ls -d $segmentdir/* | tail -1`
> >       echo -e "\nFetching next segment"
> >       $nutch fetch $s
> >       echo -e "\nUpdating web database"
> >       $nutch updatedb $dbmain $s
> >       echo -e "\nAnalyzing links"
> >       $nutch analyze $dbmain 5
> >
> > OR after the segment is indexed -- as the above method wouldn't allow
> > a depth greather than 1?
> >
> > # prepare dirs and inject urls
> >        rm -rf $db/*
> >        $nutch admin -local $db -create
> >        $nutch inject -local $db -urlfile $urlFile
> >
> > for i in `seq $depth`
> > do
> >       echo -e "\nGenerating next segment to fetch"
> >       $nutch generate -local $db $segmentdir $fetchLimit
> >       s=`ls -d $segmentdir/* | tail -1`
> >       echo -e "\nFetching next segment"
> >       $nutch fetch $s
> >       echo -e "\nUpdating web database"
> >       $nutch updatedb $db $s
> >       echo -e "\nAnalyzing links"
> >       $nutch analyze $db 5
> > done
> >
> > echo -e "\nFetch done"
> > echo "Indexing segments"
> >
> > for s in `ls -1d $segmentdir/*`
> > do
> >       $nutch index $s
> > done
> >
> >       echo -e "\nUpdating web database"
> >       $nutch updatedb $dbmain $s
> >
> >
> > OR maybe I have no idea what I'm talking about : ) - I'm not a
> > developer, just trying to figure things out.
> >
> > If anyone has experience with this and some advice I'm all ears.
> > thanks!
> >
> > Scott
> >
> > On 11/10/05, Dean Elwood <[EMAIL PROTECTED]> wrote:
> >> Hi Lawrence,
> >>
> >> I'm stuck in the same position. I haven't yet examined the "merge"
> >> function,
> >> which might shed some light on it.
> >>
> >> Have you managed to discover anything so far?
> >>
> >>>> You can use the regular expression bases url filter. Than only
> >>>> urls that
> >>>> match the pattern will be added to a fetch list.<<
> >>
> >> Hi Stefan. Getting the new URL's to crawl is the easy part ;-)
> >>
> >> The trick, and the question, is how you add that to an existing
> >> database,
> >> and then re-index, without doing a full re-crawl?
> >>
> >> Thanks,
> >>
> >> Dean
> >>
> >> ----- Original Message -----
> >> From: "Lawrence Pitcher" <[EMAIL PROTECTED]>
> >> To: <[email protected]>
> >> Sent: Thursday, November 10, 2005 5:05 PM
> >> Subject: How to add only new urls to DB
> >>
> >>
> >> Hi,
> >>
> >> Thanks to all for the best search solution available.
> >>
> >> I have installed the software, indexed 15,000 websites and tested
> >> the search
> >> and it works great!
> >>
> >> If I want to add only two more websites, so I made a "newurls.txt"
> >> file,
> >> then injected it to WebDB "bin/nutch inject db/ -urlfile
> >> newurls.txt", then
> >> generated a new segment "bin/nutch generate db/ segments/",  I
> >> then checked
> >> for the new sement name in the directory "/segments'
> >>
> >> Took that new segment name and placed it in the fetch command "bin/
> >> nutch
> >> fetch segments/20051110103316/"
> >>
> >> However it appears to re-fetch all 15,000 webpages along with the
> >> newurls.txt webpages.
> >>
> >> Can I not just index only the new and then Update the DB.
> >>
> >> Sorry for such a lame question but I have just started.
> >>
> >> Many thanks to all.
> >> Lawrence
> >>
> >>
> >
>
> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net
>
>
>
>

Re: How to add only new urls to DB

Reply via email to