[ https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519281#comment-16519281 ]
Sebastian Nagel commented on NUTCH-2576: ---------------------------------------- Sharing some metrics from testing protocol-okhttp: - breadth-first crawl started from a list of homepages top-ranked hosts/domains - 5 cycles, topN = 40 million, max. per host = 50 - 3 cycles using protocol-http (incl. patches for NUTCH-2549), 2 cycles using protocol-okhttp With caution, there is some evidence that Okhttp is faster (fetched more content in less time). To get a better estimate how much faster, I plan to run a crawl with same parameters next month, using okhttp for the first 3 cycles. But here are the details: 1. Time elapsed (hours:minutes) per cycle until 50% of map progress are reached, resp. 75% of map input are read (75% of URLs have been fetched or are queued): {noformat} 3:21 cycle 1 (protocol-http) 2:40 cycle 2 (protocol-http) 3:09 cycle 3 (protocol-http) 2:28 cycle 4 (protocol-okhttp) 2:47 cycle 5 (protocol-okhttp) {noformat} Of course, cycles are different regarding fetch lists. This is esp. true for the first cycle with only a single URL per host. I didn't take the time until all URLs have been fetched because it is also determined by slow queues. 2. Content fetched per cycle: {noformat} 0.97 PiB cycle 1 (protocol-http) 1.07 PiB cycle 2 (protocol-http) 1.23 PiB cycle 3 (protocol-http) 1.41 PiB cycle 4 (protocol-okhttp) 1.50 PiB cycle 5 (protocol-okhttp) {noformat} This can be explained by longer documents in average with increasing crawl depth. 3. Thread stack counts: {noformat} protocol-http | protocol-okhttp 375 at java.net.SocketInputStream.socketRead | 480 at java.net.SocketInputStream.socketRead 181 at java.net.PlainSocketImpl.socketConnec | 124 at java.net.Inet4AddressImpl.lookupAllHo 132 at java.net.Inet4AddressImpl.lookupAllHo | 121 at java.net.PlainSocketImpl.socketConnec 36 at sun.misc.Unsafe.park(Native Method) | 53 at sun.misc.Unsafe.park(Native Method) {noformat} Okhttp reuses socket connections which explains why it spends less time in opening connections. Parked threads are waiting for the parser. 4. Frequency of error messages reported in task logs. It could be that protocol-http (with NUTCH-2549) is a little bit more tolerant against bad input. But there is also no evidence for significant regressions in protocol-okhttp: - 605k errors in cycle 3 (protocol-http): {noformat} 207832 java.net.UnknownHostException 200772 java.net.SocketTimeoutException: Read timed out 117147 java.net.SocketTimeoutException: connect timed out 29346 javax.net.ssl.SSLException: Connection has been shutdown: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target 17273 java.net.ConnectException: Connection refused (Connection refused) 9356 javax.net.ssl.SSLException: Connection has been shutdown: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed 8966 java.net.SocketException: Connection reset 5716 java.net.NoRouteToHostException: No route to host (Host unreachable) 1186 javax.net.ssl.SSLException: Connection has been shutdown: javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake 1178 javax.net.ssl.SSLException: Connection has been shutdown: javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure 1049 java.net.MalformedURLException: unknown protocol: ... 777 javax.net.ssl.SSLException: Connection has been shutdown: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection? 608 javax.net.ssl.SSLException: Connection has been shutdown: javax.net.ssl.SSLException: java.net.SocketException: Connection reset 546 java.io.IOException: unzipBestEffort returned null 537 javax.net.ssl.SSLException: Connection has been shutdown: javax.net.ssl.SSLException: Received fatal alert: internal_error 516 org.apache.nutch.protocol.http.api.HttpException: SSL reconnect to ... failed with: handshake alert: unrecognized_name 332 javax.net.ssl.SSLException: Connection has been shutdown: javax.net.ssl.SSLHandshakeException: Received fatal alert: unrecognized_name 278 org.apache.nutch.protocol.http.api.HttpException: bad chunk length: ... 272 javax.net.ssl.SSLException: Connection has been shutdown: javax.net.ssl.SSLHandshakeException: DHPublicKey does not comply to algorithm constraints 245 java.net.ConnectException: Invalid argument (connect failed) 195 java.io.IOException: Line exceeds max. buffer size: ... 106 java.net.SocketException: Network is unreachable (connect failed) 106 org.apache.nutch.protocol.http.api.HttpException: chunk eof after... ... {noformat} - 74k errors in cycle 4 (protocol-okhttp): {noformat} 234326 java.net.UnknownHostException 195600 java.net.SocketTimeoutException: timeout 154595 java.net.SocketTimeoutException: connect timed out 29170 javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target 16166 java.net.ConnectException: Failed to connect to ... 15124 javax.net.ssl.SSLPeerUnverifiedException: Hostname ... not verified 12443 java.net.SocketTimeoutException: Read timed out 10965 javax.net.ssl.SSLProtocolException: handshake alert: unrecognized_name 9929 java.net.SocketException: Connection reset 9592 javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed 9119 java.io.IOException: unexpected end of stream on Connection... 8876 java.io.IOException: gzip finished without exhausting source 8658 java.io.EOFException: source exhausted prematurely 5908 java.net.NoRouteToHostException: No route to host (Host unreachable) 3428 java.io.EOFException 2763 javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake 2646 java.io.IOException: CRC: ... 1567 java.io.IOException: ID1ID2: ... 1556 java.net.ProtocolException: unexpected end of stream 606 java.net.MalformedURLException: unknown protocol: ... 584 javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection? 569 javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure 420 javax.net.ssl.SSLException: Received fatal alert: internal_error 359 java.net.ProtocolException: Unexpected status line: ... 335 javax.net.ssl.SSLHandshakeException: Received fatal alert: unrecognized_name 99 java.io.IOException: java.util.zip.DataFormatException: invalid code lengths set ... {noformat} > HTTP protocol plugin based on okhttp > ------------------------------------ > > Key: NUTCH-2576 > URL: https://issues.apache.org/jira/browse/NUTCH-2576 > Project: Nutch > Issue Type: Improvement > Components: plugin, protocol > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.15 > > > [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library > which supports HTTP/2. [~jnioche]'s implementation > [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] > proves that it should be straightforward to implement a Nutch protocol plugin > using okhttp. A recent HTTP protocol implementation should also fix (most of) > the issues reported in NUTCH-2549. -- This message was sent by Atlassian JIRA (v7.6.3#76005)