Re: How to add only new urls to DB

Enrico Triolo Mon, 13 Feb 2006 08:20:17 -0800

Ok, this seems correct behaviour... Let's face the problem from
another perspective: can I remove urls that are in DB_UNFETCHED status
before injecting the new url?


Enrico

On 2/13/06, Gal Nitzan <[EMAIL PROTECTED]> wrote:
> no since generate looks in web db (crawldb) for the links which their
> status is db_unfetched and it doesn't know that it was injected...
>
>
> On Mon, 2006-02-13 at 16:52 +0100, Enrico Triolo wrote:
> > > ...
> > > In general if you inject a set of urls to a webdb and create new
> > > segment the segment should only contains the new urls and pages that
> > > are older than 30 days and fetched anyway.
> >
> > Actually it seems to me that generated segments contain also urls that
> > are in DB_UNFETCHED status from the latest fetching job.
> >
> > I mean, if I inject an url and set a fetching depth of 1, at the end
> > of the process the webdb will contain 1 url in DB_FETCHED status and n
> > urls in DB_UNFETCHED (where n is the number of outgoing links of the
> > injected url).
> > If I then inject another url and generate a new segment, it will
> > contain the url itself and the n urls from previous iteration...
> > Is there a way to instruct nutch to only fetch the injected url?
> >
> > Thanks,
> > Enrico
> >
> > > Am 08.02.2006 um 14:56 schrieb Scott Owens:
> > >
> > > > Hi All,
> > > >
> > > > I wanted to check in to see if anyone has found an answer for this
> > > > issue.  I am injecting new URLs on a daily basis, and only need to
> > > > fetch/index those new one's, but obviously need to maintain a complete
> > > > webdb.
> > > >
> > > > One thing I was thinking was to use a temporary webdb for the initial
> > > > injection, then updating (updatedb) my primary webdb after the fetch
> > > > or indexing.
> > > >
> > > > # prepare dirs and inject urls
> > > >        rm -rf $db/*
> > > >        $nutch admin -local $db -create
> > > >        $nutch inject -local $db -urlfile $urlFile
> > > >
> > > >       echo -e "\nGenerating next segment to fetch"
> > > >       $nutch generate -local $db $segmentdir $fetchLimit
> > > >       s=`ls -d $segmentdir/* | tail -1`
> > > >       echo -e "\nFetching next segment"
> > > >       $nutch fetch $s
> > > >       echo -e "\nUpdating web database"
> > > >       $nutch updatedb $dbmain $s
> > > >       echo -e "\nAnalyzing links"
> > > >       $nutch analyze $dbmain 5
> > > >
> > > > OR after the segment is indexed -- as the above method wouldn't allow
> > > > a depth greather than 1?
> > > >
> > > > # prepare dirs and inject urls
> > > >        rm -rf $db/*
> > > >        $nutch admin -local $db -create
> > > >        $nutch inject -local $db -urlfile $urlFile
> > > >
> > > > for i in `seq $depth`
> > > > do
> > > >       echo -e "\nGenerating next segment to fetch"
> > > >       $nutch generate -local $db $segmentdir $fetchLimit
> > > >       s=`ls -d $segmentdir/* | tail -1`
> > > >       echo -e "\nFetching next segment"
> > > >       $nutch fetch $s
> > > >       echo -e "\nUpdating web database"
> > > >       $nutch updatedb $db $s
> > > >       echo -e "\nAnalyzing links"
> > > >       $nutch analyze $db 5
> > > > done
> > > >
> > > > echo -e "\nFetch done"
> > > > echo "Indexing segments"
> > > >
> > > > for s in `ls -1d $segmentdir/*`
> > > > do
> > > >       $nutch index $s
> > > > done
> > > >
> > > >       echo -e "\nUpdating web database"
> > > >       $nutch updatedb $dbmain $s
> > > >
> > > >
> > > > OR maybe I have no idea what I'm talking about : ) - I'm not a
> > > > developer, just trying to figure things out.
> > > >
> > > > If anyone has experience with this and some advice I'm all ears.
> > > > thanks!
> > > >
> > > > Scott
> > > >
> > > > On 11/10/05, Dean Elwood <[EMAIL PROTECTED]> wrote:
> > > >> Hi Lawrence,
> > > >>
> > > >> I'm stuck in the same position. I haven't yet examined the "merge"
> > > >> function,
> > > >> which might shed some light on it.
> > > >>
> > > >> Have you managed to discover anything so far?
> > > >>
> > > >>>> You can use the regular expression bases url filter. Than only
> > > >>>> urls that
> > > >>>> match the pattern will be added to a fetch list.<<
> > > >>
> > > >> Hi Stefan. Getting the new URL's to crawl is the easy part ;-)
> > > >>
> > > >> The trick, and the question, is how you add that to an existing
> > > >> database,
> > > >> and then re-index, without doing a full re-crawl?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Dean
> > > >>
> > > >> ----- Original Message -----
> > > >> From: "Lawrence Pitcher" <[EMAIL PROTECTED]>
> > > >> To: <[email protected]>
> > > >> Sent: Thursday, November 10, 2005 5:05 PM
> > > >> Subject: How to add only new urls to DB
> > > >>
> > > >>
> > > >> Hi,
> > > >>
> > > >> Thanks to all for the best search solution available.
> > > >>
> > > >> I have installed the software, indexed 15,000 websites and tested
> > > >> the search
> > > >> and it works great!
> > > >>
> > > >> If I want to add only two more websites, so I made a "newurls.txt"
> > > >> file,
> > > >> then injected it to WebDB "bin/nutch inject db/ -urlfile
> > > >> newurls.txt", then
> > > >> generated a new segment "bin/nutch generate db/ segments/",  I
> > > >> then checked
> > > >> for the new sement name in the directory "/segments'
> > > >>
> > > >> Took that new segment name and placed it in the fetch command "bin/
> > > >> nutch
> > > >> fetch segments/20051110103316/"
> > > >>
> > > >> However it appears to re-fetch all 15,000 webpages along with the
> > > >> newurls.txt webpages.
> > > >>
> > > >> Can I not just index only the new and then Update the DB.
> > > >>
> > > >> Sorry for such a lame question but I have just started.
> > > >>
> > > >> Many thanks to all.
> > > >> Lawrence
> > > >>
> > > >>
> > > >
> > >
> > > ---------------------------------------------------------------
> > > company:        http://www.media-style.com
> > > forum:        http://www.text-mining.org
> > > blog:            http://www.find23.net
> > >
> > >
> > >
> > >
> >
>
>
>

Re: How to add only new urls to DB

Reply via email to