I use normally a simple trick in such situations.
I create a new empthy db inject the urls, create my segment and fetch
the segment.
Than I inject the urls a second time to my orginal db and update the
the db with the segment.
Stefan
Am 12.02.2006 um 18:11 schrieb Chris Schneider:
Nutch colleagues,
I'm wondering how you inject new URLs into an existing MapReduce
crawldb in a way that guarantees they'll end up on the next fetch
list. The db.score.injected property used to control the score of
newly injected URLs, but I don't see that getting loaded anywhere
in the Nutch 0.8 code. It looks like the score of the CrawlDatum
added by Injector.java will just be 1.0. Since that's the minimum
score in the 45M unfetched pages I currently have in my crawldb, it
doesn't seem likely that the 21 new URLs I'd like to inject will
end up in the topN=500K URLs in the first fetch list.
Of course, I could just modify the code to honor the
db.score.injected property, then set it to something like 2.0.
However, I'm not sure I want to do this either. I'm guessing that
would also bias the minimum score for all of the pages I get to
from these new URLs.
What I'd like is to put this injection set on roughly equal footing
with my original injection set. Thus, it seems like the proper way
to handle this is to mark the injected URLs in some way that
ensures that the Generator will put them on the first fetch list.
However, I'm probably missing something important here.
Ideas?
- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------
---------------------------------------------
George Orwel was an Optimist
blog: http://www.find23.org
company: http://www.media-style.com
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general