Hi,

Here's a status line from a Benchmark job that I ran recently:

0/0 threads spinwaiting 38996 pages, 1 errors, 557.1 pages/s, 16995 kb/s, 0 URLs in 2 queues > reduce

Interested in more details? :) I thought so...

* first, this is a synthetic benchmark. Target pages were produced on the fly with the 'ant proxy' fake handler, with unlimited bandwidth of localhost connection (proxy was running on localhost). Fake pages were generated in a way that guaranteed that all pages and all hosts are unique, i.e. there were no outlinks to the same hosts or to the same pages across the whole run.

* fetcher.parse=false, i.e. Fetcher only stored the content. There were 100 threads running.

* there was no DNS resolving - all pages are produced by a proxy, so Nutch doesn't need to resolve names to IPs.

What do these numbers mean? Well, this means that if there are no pesky obstacles like host-level blocking or DNS resolution or bandwidth limits, then the Fetcher is insanely fast. ;)

That's good to know, actually - I was afraid that there is some inherent limitation in Fetcher due to synchronization that prevents it from working faster than ~100pages/sec (per task). Apparently that's not the case.

I'll try to set up other benchmarks that introduce some of the above factors to see which one possibly has undue impact on performance.

Oh, btw - the above benchmark was run on a 1 node Hadoop cluster, with HBase backend, with the following results:

10/08/13 15:52:41 INFO crawl.WebTableReader: Statistics for WebTable:
10/08/13 15:52:41 INFO crawl.WebTableReader: TOTAL urls:        2588551
10/08/13 15:52:41 INFO crawl.WebTableReader: retry 0:   2588534
10/08/13 15:52:41 INFO crawl.WebTableReader: retry 1:   17
10/08/13 15:52:41 INFO crawl.WebTableReader: min score: 0.0
10/08/13 15:52:41 INFO crawl.WebTableReader: avg score: 1.2037623E-6
10/08/13 15:52:41 INFO crawl.WebTableReader: max score: 1.116
10/08/13 15:52:41 INFO crawl.WebTableReader: status 1 (status_unfetched): 2415981 10/08/13 15:52:41 INFO crawl.WebTableReader: status 2 (status_fetched): 172570
10/08/13 15:52:41 INFO crawl.WebTableReader: WebTable statistics: done
* Plugins: protocol-http|parse-tika|scoring-opic|urlfilter-regex|urlnormalizer-pass
* Seeds:        1
* Depth:        6
* Threads:      100
* TopN: 9223372036854775807
* TOTAL ELAPSED:        4362745
- stage: inject
        run 0   23910
- stage: generate
        run 0   24187
        run 1   23531
        run 2   24234
        run 3   27095
        run 4   36100
        run 5   129736
- stage: fetch
        run 0   51506
        run 1   60231
        run 2   72187
        run 3   90323
        run 4   383787
        run 5   516860
- stage: parse
        run 0   12205
        run 1   12125
        run 2   15160
        run 3   30101
        run 4   198388
        run 5   850305
- stage: update
        run 0   24297
        run 1   24142
        run 2   24117
        run 3   36127
        run 4   106444
        run 5   1565639


Injection is nearly a no-op (I use a single seed URL), so this gives us the basic Hadoop overhead per job of ~24 seconds.

Please note that unlike the previous benchmark results this one uses depth 6 - for unknown reasons even at this depth the number of collected URLs is _higher_ than at depth 7 run on branch-1.3 ... apparently there's something weird going on with URL accounting in trunk...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to