Any Paul Volcker for score inflation?

Enzo Michelangeli Wed, 15 Aug 2007 18:28:07 -0700

Dear All,

Has anybody devised a fix for the "score inflation" problem mentioned at
http://wiki.apache.org/nutch/FixingOpicScoring ? After many
"generate/fetch/updatedb" iteration cycles, the max and average scores
reported by "bin/nutch readdb crawl/crawldb -stats" have grown to pretty
ridiculous values:


min score:      0.0
avg score:      1.07425485E9
max score:      9.2233725E15

...and many seed URL's are ignored as a result, because they lie too low in
the pecking order to have any chance of being selcted as "-topN" by
"bin/nutch generate" (and I refuse to inject them setting db.score.injected
to 1E39...). So my questions are:

1. Is the score inflation issue expected to be fixed soon?

2. In the meantime, is there a way to "normalize" a crawldb and/or rebuild
it from the data segments in order to get rid of these scoring aberrations?

Thanks in advance,

Enzo

Any Paul Volcker for score inflation?

Reply via email to