Once?

Andrzej Bialecki Fri, 22 Sep 2006 03:16:06 -0700

Bill Goffe wrote:

Andrzej said:
Nutch 0.7 uses a variant of PageRank link analysis, and the analyze toolwould perform a couple iterations to propagate the scores along links.However, it was a slow and very resource-hungry process, so sometimes itwas even impossible to go through the analysis step even formoderatly-sized collections.
Interesting. If this is invoked with "bin/nutch analyze db_dir 3" (three
rounds of analysis) it took about 35 minutes with some 300,000 pages on a
dual Xeon machine with 3 gigs of RAM. This is a small share of time spent
fetching, generating segments, etc.

300,000 is a relatively small database. With DBs around 10-20mln docsthis analyze step can take literally days, and consume hundreds GBs ofdisk space.

0.7 offers also an option to use a static ranking method, which doesn't
require running the analysis step, and which is based on the number of
outlinks and inlinks.


Um, it isn't clear how to do this. I don't see anything in
http://wiki.apache.org/nutch/CommandLineOptions nor nutch-default.xml.

It's not a command-line option. This is documented in nutch-default.xmlunder "fetchlist.score.by.link.count" and "indexer.boost.by.link.count".There was a discussion about this on the mailing list, ca 1 year ago -search the archives for "link analysis".

P.S. Any thoughts on how to downplay repeated instances of a word ona page?

You should implement your own Similarity, and override idf(Term term,Searcher searcher) - please see Lucene javadoc for details. Ifsearcher.docFreq(term) > threshold you cap it at a fixed value, or evenreduce the score factor. Be careful not to penalize common words, whichmay be very frequent for legitimate reasons (e.g. the stopwords).


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Boost for Occurances in a Page / Analyze > Once?

Reply via email to