Benchmark: max fetcher speed

Andrzej Bialecki Fri, 13 Aug 2010 08:28:10 -0700

Hi,

Here's a status line from a Benchmark job that I ran recently:

0/0 threads spinwaiting 38996 pages, 1 errors, 557.1 pages/s, 16995kb/s, 0 URLs in 2 queues > reduce


Interested in more details? :) I thought so...

* first, this is a synthetic benchmark. Target pages were produced onthe fly with the 'ant proxy' fake handler, with unlimited bandwidth oflocalhost connection (proxy was running on localhost). Fake pages weregenerated in a way that guaranteed that all pages and all hosts areunique, i.e. there were no outlinks to the same hosts or to the samepages across the whole run.

* fetcher.parse=false, i.e. Fetcher only stored the content. There were100 threads running.

* there was no DNS resolving - all pages are produced by a proxy, soNutch doesn't need to resolve names to IPs.

What do these numbers mean? Well, this means that if there are no peskyobstacles like host-level blocking or DNS resolution or bandwidthlimits, then the Fetcher is insanely fast. ;)

That's good to know, actually - I was afraid that there is some inherentlimitation in Fetcher due to synchronization that prevents it fromworking faster than ~100pages/sec (per task). Apparently that's not thecase.

I'll try to set up other benchmarks that introduce some of the abovefactors to see which one possibly has undue impact on performance.

Oh, btw - the above benchmark was run on a 1 node Hadoop cluster, withHBase backend, with the following results:


10/08/13 15:52:41 INFO crawl.WebTableReader: Statistics for WebTable:
10/08/13 15:52:41 INFO crawl.WebTableReader: TOTAL urls:        2588551
10/08/13 15:52:41 INFO crawl.WebTableReader: retry 0:   2588534
10/08/13 15:52:41 INFO crawl.WebTableReader: retry 1:   17
10/08/13 15:52:41 INFO crawl.WebTableReader: min score: 0.0
10/08/13 15:52:41 INFO crawl.WebTableReader: avg score: 1.2037623E-6
10/08/13 15:52:41 INFO crawl.WebTableReader: max score: 1.116

10/08/13 15:52:41 INFO crawl.WebTableReader: status 1(status_unfetched): 241598110/08/13 15:52:41 INFO crawl.WebTableReader: status 2 (status_fetched):172570

10/08/13 15:52:41 INFO crawl.WebTableReader: WebTable statistics: done

* Plugins:protocol-http|parse-tika|scoring-opic|urlfilter-regex|urlnormalizer-pass

* Seeds:        1
* Depth:        6
* Threads:      100
* TopN: 9223372036854775807
* TOTAL ELAPSED:        4362745
- stage: inject
        run 0   23910
- stage: generate
        run 0   24187
        run 1   23531
        run 2   24234
        run 3   27095
        run 4   36100
        run 5   129736
- stage: fetch
        run 0   51506
        run 1   60231
        run 2   72187
        run 3   90323
        run 4   383787
        run 5   516860
- stage: parse
        run 0   12205
        run 1   12125
        run 2   15160
        run 3   30101
        run 4   198388
        run 5   850305
- stage: update
        run 0   24297
        run 1   24142
        run 2   24117
        run 3   36127
        run 4   106444
        run 5   1565639

Injection is nearly a no-op (I use a single seed URL), so this gives usthe basic Hadoop overhead per job of ~24 seconds.

Please note that unlike the previous benchmark results this one usesdepth 6 - for unknown reasons even at this depth the number of collectedURLs is _higher_ than at depth 7 run on branch-1.3 ... apparentlythere's something weird going on with URL accounting in trunk...


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Benchmark: max fetcher speed

Reply via email to