Dear All, Has anybody devised a fix for the "score inflation" problem mentioned at http://wiki.apache.org/nutch/FixingOpicScoring ? After many "generate/fetch/updatedb" iteration cycles, the max and average scores reported by "bin/nutch readdb crawl/crawldb -stats" have grown to pretty ridiculous values:
min score: 0.0 avg score: 1.07425485E9 max score: 9.2233725E15 ...and many seed URL's are ignored as a result, because they lie too low in the pecking order to have any chance of being selcted as "-topN" by "bin/nutch generate" (and I refuse to inject them setting db.score.injected to 1E39...). So my questions are: 1. Is the score inflation issue expected to be fixed soon? 2. In the meantime, is there a way to "normalize" a crawldb and/or rebuild it from the data segments in order to get rid of these scoring aberrations? Thanks in advance, Enzo
