Wasn't there some code to force the expiration? i thought i saw something in the list.
Does anyone have a better archive of this list other than Sourceforge? search on sourceforge really stinks :) thanks for the input Doug, i'm going to start refetching some of this material and make the recommended adjustments. When you referr to link analysis are you talking about ignoring the analyze db piece between segment fetching and using what comes in via the fetch/inject process? -byron --- Doug Cutting <[EMAIL PROTECTED]> wrote: > Byron Miller wrote: > > I did the entire dmoz (not a subset) and i only > ran > > the link analysis as 1 iteration (couple of times > in a > > row) and when i did new segments i did about 6-m > > million at a time. > > Byron, > > When I look at explanations on > http://www.mozdex.org/ I see very large > document boost values, which correspond to link > analysis scores. It > appears to me that the link analysis algorithm has > somehow run amok. I > wonder if you might be better off without it. > > One can radically diminish the impact of link > analysis scores on > searches by setting indexer.score.power to a very > small value, e.g. > 0.01. Note that you will then have to re-index, > however. > > Note that link analysis scores are also used to > prioritize pages for > fetching. So if you don't perform any link analysis > then you'll just > end up doing a breadth-first crawl. > > A final note: the pages you fetch initially don't > have a good set of > incoming anchor texts associated with them until you > fetch them the > second time. (We don't know about links we haven't > seen yet.) So, when > you initially inject the DMOZ pages it's a good idea > to set > db.default.fetch.interval to something smaller, like > 7, so that these > pages will be refreshed sooner with more complete > anchor texts. > > According to research, searching incoming anchor > text without link > analysis provides most of the benefits of both > combined. So it really > improves results to get good anchor texts. > > Doug > > > > ------------------------------------------------------- > This SF.Net email is sponsored by Sleepycat Software > Learn developer strategies Cisco, Motorola, Ericsson > & Lucent use to deliver > higher performing products faster, at low TCO. > http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 > _______________________________________________ > Nutch-general mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-general ------------------------------------------------------- This SF.Net email is sponsored by Sleepycat Software Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver higher performing products faster, at low TCO. http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
