I did the entire dmoz (not a subset) and i only ran
the link analysis as 1 iteration (couple of times in a
row) and when i did new segments i did about 6-m
million at a time.
Byron,
When I look at explanations on http://www.mozdex.org/ I see very large document boost values, which correspond to link analysis scores. It appears to me that the link analysis algorithm has somehow run amok. I wonder if you might be better off without it.
One can radically diminish the impact of link analysis scores on searches by setting indexer.score.power to a very small value, e.g. 0.01. Note that you will then have to re-index, however.
Note that link analysis scores are also used to prioritize pages for fetching. So if you don't perform any link analysis then you'll just end up doing a breadth-first crawl.
A final note: the pages you fetch initially don't have a good set of incoming anchor texts associated with them until you fetch them the second time. (We don't know about links we haven't seen yet.) So, when you initially inject the DMOZ pages it's a good idea to set db.default.fetch.interval to something smaller, like 7, so that these pages will be refreshed sooner with more complete anchor texts.
According to research, searching incoming anchor text without link analysis provides most of the benefits of both combined. So it really improves results to get good anchor texts.
Doug
------------------------------------------------------- This SF.Net email is sponsored by Sleepycat Software Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver higher performing products faster, at low TCO. http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
