1. Nutch follows the links within HTML web pages to crawl the full graph of a web of pages.
In addition, I think Nutch has PageRank-like scoring function as opposed to Lucene/Solr, those are based on vector space model scoring. koji -- http://soleami.com/blog/mahout-and-machine-learning-training-course-is-here.html