Bill Goffe wrote:
Andrzej said:
Nutch 0.7 uses a variant of PageRank link analysis, and the analyze tool
would perform a couple iterations to propagate the scores along links.
However, it was a slow and very resource-hungry process, so sometimes it
was even impossible to go through the analysis step even for
moderatly-sized collections.
Interesting. If this is invoked with "bin/nutch analyze db_dir 3" (three
rounds of analysis) it took about 35 minutes with some 300,000 pages on a
dual Xeon machine with 3 gigs of RAM. This is a small share of time spent
fetching, generating segments, etc.
300,000 is a relatively small database. With DBs around 10-20mln docs
this analyze step can take literally days, and consume hundreds GBs of
disk space.
0.7 offers also an option to use a static ranking method, which doesn't
require running the analysis step, and which is based on the number of
outlinks and inlinks.
Um, it isn't clear how to do this. I don't see anything in
http://wiki.apache.org/nutch/CommandLineOptions nor nutch-default.xml.
It's not a command-line option. This is documented in nutch-default.xml
under "fetchlist.score.by.link.count" and "indexer.boost.by.link.count".
There was a discussion about this on the mailing list, ca 1 year ago -
search the archives for "link analysis".
P.S. Any thoughts on how to downplay repeated instances of a word on
a page?
You should implement your own Similarity, and override idf(Term term,
Searcher searcher) - please see Lucene javadoc for details. If
searcher.docFreq(term) > threshold you cap it at a fixed value, or even
reduce the score factor. Be careful not to penalize common words, which
may be very frequent for legitimate reasons (e.g. the stopwords).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com