I use normally a simple trick in such situations.
I create a new empthy db inject the urls, create my segment and fetch the segment. Than I inject the urls a second time to my orginal db and update the the db with the segment.

Stefan

Am 12.02.2006 um 18:11 schrieb Chris Schneider:

Nutch colleagues,

I'm wondering how you inject new URLs into an existing MapReduce crawldb in a way that guarantees they'll end up on the next fetch list. The db.score.injected property used to control the score of newly injected URLs, but I don't see that getting loaded anywhere in the Nutch 0.8 code. It looks like the score of the CrawlDatum added by Injector.java will just be 1.0. Since that's the minimum score in the 45M unfetched pages I currently have in my crawldb, it doesn't seem likely that the 21 new URLs I'd like to inject will end up in the topN=500K URLs in the first fetch list.

Of course, I could just modify the code to honor the db.score.injected property, then set it to something like 2.0. However, I'm not sure I want to do this either. I'm guessing that would also bias the minimum score for all of the pages I get to from these new URLs.

What I'd like is to put this injection set on roughly equal footing with my original injection set. Thus, it seems like the proper way to handle this is to mark the injected URLs in some way that ensures that the Generator will put them on the first fetch list.

However, I'm probably missing something important here.

Ideas?

- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------


---------------------------------------------
George Orwel was an Optimist
blog: http://www.find23.org
company: http://www.media-style.com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to