Haha, awesome, +1 for being fast!
On 8/13/10 8:27 AM, "Andrzej Bialecki" <[email protected]> wrote: Hi, Here's a status line from a Benchmark job that I ran recently: 0/0 threads spinwaiting 38996 pages, 1 errors, 557.1 pages/s, 16995 kb/s, 0 URLs in 2 queues > reduce Interested in more details? :) I thought so... * first, this is a synthetic benchmark. Target pages were produced on the fly with the 'ant proxy' fake handler, with unlimited bandwidth of localhost connection (proxy was running on localhost). Fake pages were generated in a way that guaranteed that all pages and all hosts are unique, i.e. there were no outlinks to the same hosts or to the same pages across the whole run. * fetcher.parse=false, i.e. Fetcher only stored the content. There were 100 threads running. * there was no DNS resolving - all pages are produced by a proxy, so Nutch doesn't need to resolve names to IPs. What do these numbers mean? Well, this means that if there are no pesky obstacles like host-level blocking or DNS resolution or bandwidth limits, then the Fetcher is insanely fast. ;) That's good to know, actually - I was afraid that there is some inherent limitation in Fetcher due to synchronization that prevents it from working faster than ~100pages/sec (per task). Apparently that's not the case. I'll try to set up other benchmarks that introduce some of the above factors to see which one possibly has undue impact on performance. Oh, btw - the above benchmark was run on a 1 node Hadoop cluster, with HBase backend, with the following results: 10/08/13 15:52:41 INFO crawl.WebTableReader: Statistics for WebTable: 10/08/13 15:52:41 INFO crawl.WebTableReader: TOTAL urls: 2588551 10/08/13 15:52:41 INFO crawl.WebTableReader: retry 0: 2588534 10/08/13 15:52:41 INFO crawl.WebTableReader: retry 1: 17 10/08/13 15:52:41 INFO crawl.WebTableReader: min score: 0.0 10/08/13 15:52:41 INFO crawl.WebTableReader: avg score: 1.2037623E-6 10/08/13 15:52:41 INFO crawl.WebTableReader: max score: 1.116 10/08/13 15:52:41 INFO crawl.WebTableReader: status 1 (status_unfetched): 2415981 10/08/13 15:52:41 INFO crawl.WebTableReader: status 2 (status_fetched): 172570 10/08/13 15:52:41 INFO crawl.WebTableReader: WebTable statistics: done * Plugins: protocol-http|parse-tika|scoring-opic|urlfilter-regex|urlnormalizer-pass * Seeds: 1 * Depth: 6 * Threads: 100 * TopN: 9223372036854775807 * TOTAL ELAPSED: 4362745 - stage: inject run 0 23910 - stage: generate run 0 24187 run 1 23531 run 2 24234 run 3 27095 run 4 36100 run 5 129736 - stage: fetch run 0 51506 run 1 60231 run 2 72187 run 3 90323 run 4 383787 run 5 516860 - stage: parse run 0 12205 run 1 12125 run 2 15160 run 3 30101 run 4 198388 run 5 850305 - stage: update run 0 24297 run 1 24142 run 2 24117 run 3 36127 run 4 106444 run 5 1565639 Injection is nearly a no-op (I use a single seed URL), so this gives us the basic Hadoop overhead per job of ~24 seconds. Please note that unlike the previous benchmark results this one uses depth 6 - for unknown reasons even at this depth the number of collected URLs is _higher_ than at depth 7 run on branch-1.3 ... apparently there's something weird going on with URL accounting in trunk... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

