Nutch colleagues,
I'm wondering how you inject new URLs into an existing MapReduce
crawldb in a way that guarantees they'll end up on the next fetch
list. The db.score.injected property used to control the score of
newly injected URLs, but I don't see that getting loaded anywhere in
the Nutch 0.8 code. It looks like the score of the CrawlDatum added
by Injector.java will just be 1.0. Since that's the minimum score in
the 45M unfetched pages I currently have in my crawldb, it doesn't
seem likely that the 21 new URLs I'd like to inject will end up in
the topN=500K URLs in the first fetch list.
Of course, I could just modify the code to honor the
db.score.injected property, then set it to something like 2.0.
However, I'm not sure I want to do this either. I'm guessing that
would also bias the minimum score for all of the pages I get to from
these new URLs.
What I'd like is to put this injection set on roughly equal footing
with my original injection set. Thus, it seems like the proper way to
handle this is to mark the injected URLs in some way that ensures
that the Generator will put them on the first fetch list.
However, I'm probably missing something important here.
Ideas?
- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general