[
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519281#comment-16519281
]
Sebastian Nagel commented on NUTCH-2576:
----------------------------------------
Sharing some metrics from testing protocol-okhttp:
- breadth-first crawl started from a list of homepages top-ranked hosts/domains
- 5 cycles, topN = 40 million, max. per host = 50
- 3 cycles using protocol-http (incl. patches for NUTCH-2549), 2 cycles using
protocol-okhttp
With caution, there is some evidence that Okhttp is faster (fetched more
content in less time). To get a better estimate how much faster, I plan to run
a crawl with same parameters next month, using okhttp for the first 3 cycles.
But here are the details:
1. Time elapsed (hours:minutes) per cycle until 50% of map progress are
reached, resp. 75% of map input are read (75% of URLs have been fetched or are
queued):
{noformat}
3:21 cycle 1 (protocol-http)
2:40 cycle 2 (protocol-http)
3:09 cycle 3 (protocol-http)
2:28 cycle 4 (protocol-okhttp)
2:47 cycle 5 (protocol-okhttp)
{noformat}
Of course, cycles are different regarding fetch lists. This is esp. true for
the first cycle with only a single URL per host. I didn't take the time until
all URLs have been fetched because it is also determined by slow queues.
2. Content fetched per cycle:
{noformat}
0.97 PiB cycle 1 (protocol-http)
1.07 PiB cycle 2 (protocol-http)
1.23 PiB cycle 3 (protocol-http)
1.41 PiB cycle 4 (protocol-okhttp)
1.50 PiB cycle 5 (protocol-okhttp)
{noformat}
This can be explained by longer documents in average with increasing crawl
depth.
3. Thread stack counts:
{noformat}
protocol-http | protocol-okhttp
375 at java.net.SocketInputStream.socketRead | 480 at
java.net.SocketInputStream.socketRead
181 at java.net.PlainSocketImpl.socketConnec | 124 at
java.net.Inet4AddressImpl.lookupAllHo
132 at java.net.Inet4AddressImpl.lookupAllHo | 121 at
java.net.PlainSocketImpl.socketConnec
36 at sun.misc.Unsafe.park(Native Method) | 53 at
sun.misc.Unsafe.park(Native Method)
{noformat}
Okhttp reuses socket connections which explains why it spends less time in
opening connections. Parked threads are waiting for the parser.
4. Frequency of error messages reported in task logs. It could be that
protocol-http (with NUTCH-2549) is a little bit more tolerant against bad
input. But there is also no evidence for significant regressions in
protocol-okhttp:
- 605k errors in cycle 3 (protocol-http):
{noformat}
207832 java.net.UnknownHostException
200772 java.net.SocketTimeoutException: Read timed out
117147 java.net.SocketTimeoutException: connect timed out
29346 javax.net.ssl.SSLException: Connection has been shutdown:
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException:
PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target
17273 java.net.ConnectException: Connection refused (Connection refused)
9356 javax.net.ssl.SSLException: Connection has been shutdown:
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException:
PKIX path validation failed: java.security.cert.CertPathValidatorException:
validity check failed
8966 java.net.SocketException: Connection reset
5716 java.net.NoRouteToHostException: No route to host (Host unreachable)
1186 javax.net.ssl.SSLException: Connection has been shutdown:
javax.net.ssl.SSLHandshakeException: Remote host closed connection during
handshake
1178 javax.net.ssl.SSLException: Connection has been shutdown:
javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure
1049 java.net.MalformedURLException: unknown protocol: ...
777 javax.net.ssl.SSLException: Connection has been shutdown:
javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
608 javax.net.ssl.SSLException: Connection has been shutdown:
javax.net.ssl.SSLException: java.net.SocketException: Connection reset
546 java.io.IOException: unzipBestEffort returned null
537 javax.net.ssl.SSLException: Connection has been shutdown:
javax.net.ssl.SSLException: Received fatal alert: internal_error
516 org.apache.nutch.protocol.http.api.HttpException: SSL reconnect to ...
failed with: handshake alert: unrecognized_name
332 javax.net.ssl.SSLException: Connection has been shutdown:
javax.net.ssl.SSLHandshakeException: Received fatal alert: unrecognized_name
278 org.apache.nutch.protocol.http.api.HttpException: bad chunk length: ...
272 javax.net.ssl.SSLException: Connection has been shutdown:
javax.net.ssl.SSLHandshakeException: DHPublicKey does not comply to algorithm
constraints
245 java.net.ConnectException: Invalid argument (connect failed)
195 java.io.IOException: Line exceeds max. buffer size: ...
106 java.net.SocketException: Network is unreachable (connect failed)
106 org.apache.nutch.protocol.http.api.HttpException: chunk eof after...
...
{noformat}
- 74k errors in cycle 4 (protocol-okhttp):
{noformat}
234326 java.net.UnknownHostException
195600 java.net.SocketTimeoutException: timeout
154595 java.net.SocketTimeoutException: connect timed out
29170 javax.net.ssl.SSLHandshakeException:
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target
16166 java.net.ConnectException: Failed to connect to ...
15124 javax.net.ssl.SSLPeerUnverifiedException: Hostname ... not verified
12443 java.net.SocketTimeoutException: Read timed out
10965 javax.net.ssl.SSLProtocolException: handshake alert: unrecognized_name
9929 java.net.SocketException: Connection reset
9592 javax.net.ssl.SSLHandshakeException:
sun.security.validator.ValidatorException: PKIX path validation failed:
java.security.cert.CertPathValidatorException: validity check failed
9119 java.io.IOException: unexpected end of stream on Connection...
8876 java.io.IOException: gzip finished without exhausting source
8658 java.io.EOFException: source exhausted prematurely
5908 java.net.NoRouteToHostException: No route to host (Host unreachable)
3428 java.io.EOFException
2763 javax.net.ssl.SSLHandshakeException: Remote host closed connection
during handshake
2646 java.io.IOException: CRC: ...
1567 java.io.IOException: ID1ID2: ...
1556 java.net.ProtocolException: unexpected end of stream
606 java.net.MalformedURLException: unknown protocol: ...
584 javax.net.ssl.SSLException: Unrecognized SSL message, plaintext
connection?
569 javax.net.ssl.SSLHandshakeException: Received fatal alert:
handshake_failure
420 javax.net.ssl.SSLException: Received fatal alert: internal_error
359 java.net.ProtocolException: Unexpected status line: ...
335 javax.net.ssl.SSLHandshakeException: Received fatal alert:
unrecognized_name
99 java.io.IOException: java.util.zip.DataFormatException: invalid code
lengths set
...
{noformat}
> HTTP protocol plugin based on okhttp
> ------------------------------------
>
> Key: NUTCH-2576
> URL: https://issues.apache.org/jira/browse/NUTCH-2576
> Project: Nutch
> Issue Type: Improvement
> Components: plugin, protocol
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.15
>
>
> [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library
> which supports HTTP/2. [~jnioche]'s implementation
> [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443]
> proves that it should be straightforward to implement a Nutch protocol plugin
> using okhttp. A recent HTTP protocol implementation should also fix (most of)
> the issues reported in NUTCH-2549.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)