Hi,
Here's a status line from a Benchmark job that I ran recently:
0/0 threads spinwaiting 38996 pages, 1 errors, 557.1 pages/s, 16995
kb/s, 0 URLs in 2 queues > reduce
Interested in more details? :) I thought so...
* first, this is a synthetic benchmark. Target pages were produced on
the fly with the 'ant proxy' fake handler, with unlimited bandwidth of
localhost connection (proxy was running on localhost). Fake pages were
generated in a way that guaranteed that all pages and all hosts are
unique, i.e. there were no outlinks to the same hosts or to the same
pages across the whole run.
* fetcher.parse=false, i.e. Fetcher only stored the content. There were
100 threads running.
* there was no DNS resolving - all pages are produced by a proxy, so
Nutch doesn't need to resolve names to IPs.
What do these numbers mean? Well, this means that if there are no pesky
obstacles like host-level blocking or DNS resolution or bandwidth
limits, then the Fetcher is insanely fast. ;)
That's good to know, actually - I was afraid that there is some inherent
limitation in Fetcher due to synchronization that prevents it from
working faster than ~100pages/sec (per task). Apparently that's not the
case.
I'll try to set up other benchmarks that introduce some of the above
factors to see which one possibly has undue impact on performance.
Oh, btw - the above benchmark was run on a 1 node Hadoop cluster, with
HBase backend, with the following results:
10/08/13 15:52:41 INFO crawl.WebTableReader: Statistics for WebTable:
10/08/13 15:52:41 INFO crawl.WebTableReader: TOTAL urls: 2588551
10/08/13 15:52:41 INFO crawl.WebTableReader: retry 0: 2588534
10/08/13 15:52:41 INFO crawl.WebTableReader: retry 1: 17
10/08/13 15:52:41 INFO crawl.WebTableReader: min score: 0.0
10/08/13 15:52:41 INFO crawl.WebTableReader: avg score: 1.2037623E-6
10/08/13 15:52:41 INFO crawl.WebTableReader: max score: 1.116
10/08/13 15:52:41 INFO crawl.WebTableReader: status 1
(status_unfetched): 2415981
10/08/13 15:52:41 INFO crawl.WebTableReader: status 2 (status_fetched):
172570
10/08/13 15:52:41 INFO crawl.WebTableReader: WebTable statistics: done
* Plugins:
protocol-http|parse-tika|scoring-opic|urlfilter-regex|urlnormalizer-pass
* Seeds: 1
* Depth: 6
* Threads: 100
* TopN: 9223372036854775807
* TOTAL ELAPSED: 4362745
- stage: inject
run 0 23910
- stage: generate
run 0 24187
run 1 23531
run 2 24234
run 3 27095
run 4 36100
run 5 129736
- stage: fetch
run 0 51506
run 1 60231
run 2 72187
run 3 90323
run 4 383787
run 5 516860
- stage: parse
run 0 12205
run 1 12125
run 2 15160
run 3 30101
run 4 198388
run 5 850305
- stage: update
run 0 24297
run 1 24142
run 2 24117
run 3 36127
run 4 106444
run 5 1565639
Injection is nearly a no-op (I use a single seed URL), so this gives us
the basic Hadoop overhead per job of ~24 seconds.
Please note that unlike the previous benchmark results this one uses
depth 6 - for unknown reasons even at this depth the number of collected
URLs is _higher_ than at depth 7 run on branch-1.3 ... apparently
there's something weird going on with URL accounting in trunk...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com