On Wed, Aug 19, 2009 at 12:13 PM, <[email protected]> wrote: > > Hi, > > I have read a few tutorials on running Nutch to crawl web. However, I still > do not understand the meaning of topN variable in crawl command. In tutorials > it is suggested to create 3 segments and fetch them with topN=1000. What if I > create 100 segments or only one. What would be difference. My goal is to > index urls I have in my seed file and nothing more. >
My understanding of "TopN" is that it interacts with the depth to help you keep crawling "interesting" areas. So if you have a depth of 3, and a topN of let's say 100 (just to keep the math easy). Every page I go to has 20 outlinks. I have 10 pages listed in my seed list. This is my understanding from reading the documentation and watching what happens, not from reading the code, I could be all wrong. Hopefully someone corrects any details I have wrong: depth 0: 10 pages fetched, 10 * 20 = 200 pending links to be fetched. depth 1: Because I have a "topN" of 100, of the 200 links I have, it will pick the "100" most interesting (using whatever algorithm is configured, I believe it is OPIC by default). depth 2: 100 pages fetched, 100 + 100 * 20 = 2100 pages to fetch. (100 existing, 100 pages with 20 outlinks) depth 3: 100 pages fetched, 2000 + 100 * 20 = 4000 pages to fetch. (2000 existing pages, 100 pages with 20 outlinks). (NOTE: This analysis assumes all the links are unique, which is highly unlikely). I believe the point is to not force you to do a depth first search of the web. Note that the algorithm might still not have fetched all of the pending links from depth 0 by depth 3 (or depth 100 for that matter). If they were deemed less interesting then other links, they could sit in the queue effectively forever. I view it as an latency vs. throughput thing: How much effort are you willing to always fetch _the most_ interesting page next. Evaluating and managing the computation of ordering that list is expensive. So queue the "topN" most interesting links you know about now, and process that without re-evaluating "interesting" as new information is gathered that would change the ordering. I also believe that "topN * depth" is an upper bound on the number of pages you will fetch during a crawl. However, take all this with a grain of salt. I haven't read the code closely, but that was gleaned while tracking down why some pages were not being fetched that I expected to be, reading the documentation, and modifying the topN parameter to fix my issues. Thanks, Kirby > Thanks. > Alex. > > > >
