Bill Goffe wrote:
Andrzej said:

Nutch 0.7 uses a variant of PageRank link analysis, and the analyze tool would perform a couple iterations to propagate the scores along links. However, it was a slow and very resource-hungry process, so sometimes it was even impossible to go through the analysis step even for moderatly-sized collections.

Interesting. If this is invoked with "bin/nutch analyze db_dir 3" (three
rounds of analysis) it took about 35 minutes with some 300,000 pages on a
dual Xeon machine with 3 gigs of RAM. This is a small share of time spent
fetching, generating segments, etc.

300,000 is a relatively small database. With DBs around 10-20mln docs this analyze step can take literally days, and consume hundreds GBs of disk space.

0.7 offers also an option to use a static ranking method, which doesn't
require running the analysis step, and which is based on the number of
outlinks and inlinks.

Um, it isn't clear how to do this. I don't see anything in
http://wiki.apache.org/nutch/CommandLineOptions nor nutch-default.xml.

It's not a command-line option. This is documented in nutch-default.xml under "fetchlist.score.by.link.count" and "indexer.boost.by.link.count". There was a discussion about this on the mailing list, ca 1 year ago - search the archives for "link analysis".


P.S. Any thoughts on how to downplay repeated instances of a word on a page?


You should implement your own Similarity, and override idf(Term term, Searcher searcher) - please see Lucene javadoc for details. If searcher.docFreq(term) > threshold you cap it at a fixed value, or even reduce the score factor. Be careful not to penalize common words, which may be very frequent for legitimate reasons (e.g. the stopwords).

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to