[ https://issues.apache.org/jira/browse/NUTCH-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183490#comment-14183490 ]
Sebastian Nagel commented on NUTCH-1825: ---------------------------------------- Comments and reviews welcome! The problem is easily reproducible: * first terminal (with attached proxy.js and minimalistic document, delivered by local Apache): {noformat} % cat /var/www/test.html <html><head><title>test</title></head><body>test</body></html> % nodejs -v v0.10.25 % nodejs ./proxy.js Listening on port 8080 {noformat} * second terminal: {noformat} % bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http http://localhost:8080/test.html Status: exception(16), lastModified=0: java.net.SocketTimeoutException: Read timed outbin/nutch parsechecker http://localhost:8080/test.html % less .../hadoop.log 2014-10-24 22:37:13,214 ERROR http.Http - Failed to get protocol output java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ... at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:293) at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:221) {noformat} * also 2.x is affected! > protocol-http may hang for certain web pages > -------------------------------------------- > > Key: NUTCH-1825 > URL: https://issues.apache.org/jira/browse/NUTCH-1825 > Project: Nutch > Issue Type: Bug > Components: protocol > Affects Versions: 1.9 > Reporter: Phu Kieu > Priority: Minor > Attachments: HttpResponse.java.patch, NUTCH-1825-trunk-v2.patch, > NUTCH-1825-trunk-v3.patch, proxy.js > > > There is a rare case where protocol-http will wait for data even when all the > data has been sent. > Patch is attached; please test and confirm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)