[ 
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519281#comment-16519281
 ] 

Sebastian Nagel commented on NUTCH-2576:
----------------------------------------

Sharing some metrics from testing protocol-okhttp:
 - breadth-first crawl started from a list of homepages top-ranked hosts/domains
 - 5 cycles, topN = 40 million, max. per host = 50
 - 3 cycles using protocol-http (incl. patches for NUTCH-2549), 2 cycles using 
protocol-okhttp

With caution, there is some evidence that Okhttp is faster (fetched more 
content in less time). To get a better estimate how much faster, I plan to run 
a crawl with same parameters next month, using okhttp for the first 3 cycles. 
But here are the details:

1. Time elapsed (hours:minutes) per cycle until 50% of map progress are 
reached, resp. 75% of map input are read (75% of URLs have been fetched or are 
queued):
{noformat}
3:21  cycle 1 (protocol-http)
2:40  cycle 2 (protocol-http)
3:09  cycle 3 (protocol-http)
2:28  cycle 4 (protocol-okhttp)
2:47  cycle 5 (protocol-okhttp)
{noformat}
Of course, cycles are different regarding fetch lists. This is esp. true for 
the first cycle with only a single URL per host. I didn't take the time until 
all URLs have been fetched because it is also determined by slow queues.

2. Content fetched per cycle:
{noformat}
0.97 PiB  cycle 1 (protocol-http)
1.07 PiB  cycle 2 (protocol-http)
1.23 PiB  cycle 3 (protocol-http)
1.41 PiB  cycle 4 (protocol-okhttp)
1.50 PiB  cycle 5 (protocol-okhttp)
{noformat}
This can be explained by longer documents in average with increasing crawl 
depth.

3. Thread stack counts:
{noformat}
protocol-http                                          |   protocol-okhttp
375         at java.net.SocketInputStream.socketRead   |   480         at 
java.net.SocketInputStream.socketRead
181         at java.net.PlainSocketImpl.socketConnec   |   124         at 
java.net.Inet4AddressImpl.lookupAllHo
132         at java.net.Inet4AddressImpl.lookupAllHo   |   121         at 
java.net.PlainSocketImpl.socketConnec
 36         at sun.misc.Unsafe.park(Native Method)     |    53         at 
sun.misc.Unsafe.park(Native Method)
{noformat}
Okhttp reuses socket connections which explains why it spends less time in 
opening connections. Parked threads are waiting for the parser.

4. Frequency of error messages reported in task logs. It could be that 
protocol-http (with NUTCH-2549) is a little bit more tolerant against bad 
input. But there is also no evidence for significant regressions in 
protocol-okhttp:
 - 605k errors in cycle 3 (protocol-http):
{noformat}
207832  java.net.UnknownHostException
200772  java.net.SocketTimeoutException: Read timed out
117147  java.net.SocketTimeoutException: connect timed out
29346   javax.net.ssl.SSLException: Connection has been shutdown: 
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: 
PKIX path building failed: 
sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
valid certification path to requested target
17273   java.net.ConnectException: Connection refused (Connection refused)
9356    javax.net.ssl.SSLException: Connection has been shutdown: 
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: 
PKIX path validation failed: java.security.cert.CertPathValidatorException: 
validity check failed
8966    java.net.SocketException: Connection reset
5716    java.net.NoRouteToHostException: No route to host (Host unreachable)
1186    javax.net.ssl.SSLException: Connection has been shutdown: 
javax.net.ssl.SSLHandshakeException: Remote host closed connection during 
handshake
1178    javax.net.ssl.SSLException: Connection has been shutdown: 
javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure
1049    java.net.MalformedURLException: unknown protocol: ...
777     javax.net.ssl.SSLException: Connection has been shutdown: 
javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
608     javax.net.ssl.SSLException: Connection has been shutdown: 
javax.net.ssl.SSLException: java.net.SocketException: Connection reset
546     java.io.IOException: unzipBestEffort returned null
537     javax.net.ssl.SSLException: Connection has been shutdown: 
javax.net.ssl.SSLException: Received fatal alert: internal_error
516     org.apache.nutch.protocol.http.api.HttpException: SSL reconnect to ... 
failed with: handshake alert:  unrecognized_name
332     javax.net.ssl.SSLException: Connection has been shutdown: 
javax.net.ssl.SSLHandshakeException: Received fatal alert: unrecognized_name
278     org.apache.nutch.protocol.http.api.HttpException: bad chunk length: ...
272     javax.net.ssl.SSLException: Connection has been shutdown: 
javax.net.ssl.SSLHandshakeException: DHPublicKey does not comply to algorithm 
constraints
245     java.net.ConnectException: Invalid argument (connect failed)
195     java.io.IOException: Line exceeds max. buffer size: ...
106     java.net.SocketException: Network is unreachable (connect failed)
106     org.apache.nutch.protocol.http.api.HttpException: chunk eof after...
...
{noformat}

 - 74k errors in cycle 4 (protocol-okhttp):
{noformat}
234326  java.net.UnknownHostException
195600  java.net.SocketTimeoutException: timeout
154595  java.net.SocketTimeoutException: connect timed out
29170   javax.net.ssl.SSLHandshakeException: 
sun.security.validator.ValidatorException: PKIX path building failed: 
sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
valid certification path to requested target
16166   java.net.ConnectException: Failed to connect to ...
15124   javax.net.ssl.SSLPeerUnverifiedException: Hostname ... not verified
12443   java.net.SocketTimeoutException: Read timed out
10965   javax.net.ssl.SSLProtocolException: handshake alert:  unrecognized_name
9929    java.net.SocketException: Connection reset
9592    javax.net.ssl.SSLHandshakeException: 
sun.security.validator.ValidatorException: PKIX path validation failed: 
java.security.cert.CertPathValidatorException: validity check failed
9119    java.io.IOException: unexpected end of stream on Connection...
8876    java.io.IOException: gzip finished without exhausting source
8658    java.io.EOFException: source exhausted prematurely
5908    java.net.NoRouteToHostException: No route to host (Host unreachable)
3428    java.io.EOFException
2763    javax.net.ssl.SSLHandshakeException: Remote host closed connection 
during handshake
2646    java.io.IOException: CRC: ...
1567    java.io.IOException: ID1ID2: ...
1556    java.net.ProtocolException: unexpected end of stream
606     java.net.MalformedURLException: unknown protocol: ...
584     javax.net.ssl.SSLException: Unrecognized SSL message, plaintext 
connection?
569     javax.net.ssl.SSLHandshakeException: Received fatal alert: 
handshake_failure
420     javax.net.ssl.SSLException: Received fatal alert: internal_error
359     java.net.ProtocolException: Unexpected status line: ...
335     javax.net.ssl.SSLHandshakeException: Received fatal alert: 
unrecognized_name
99      java.io.IOException: java.util.zip.DataFormatException: invalid code 
lengths set
...
{noformat}

> HTTP protocol plugin based on okhttp
> ------------------------------------
>
>                 Key: NUTCH-2576
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2576
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin, protocol
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
>
>
> [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library 
> which supports HTTP/2. [~jnioche]'s implementation 
> [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] 
> proves that it should be straightforward to implement a Nutch protocol plugin 
> using okhttp. A recent HTTP protocol implementation should also fix (most of) 
> the issues reported in NUTCH-2549.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to