[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1086: --- Fix Version/s: (was: 1.17) 1.18 > Rewrite protocol-httpclient > --- > > Key: NUTCH-1086 > URL: https://issues.apache.org/jira/browse/NUTCH-1086 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: nutchgora, 1.5 >Reporter: Markus Jelsma >Assignee: Fabio Santagostino >Priority: Major > Fix For: 1.18 > > Attachments: Http.java, HttpResponse.java > > > There are several issues about protocol-httpclient and several comments about > rewriting the plugin with the new http client libraries. There is, however, > not yet an issue for rewriting/reimplementing protocol-httpclient. > http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1086: --- Fix Version/s: (was: 2.5) > Rewrite protocol-httpclient > --- > > Key: NUTCH-1086 > URL: https://issues.apache.org/jira/browse/NUTCH-1086 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: nutchgora, 1.5 >Reporter: Markus Jelsma >Assignee: Fabio Santagostino >Priority: Major > Fix For: 1.17 > > Attachments: Http.java, HttpResponse.java > > > There are several issues about protocol-httpclient and several comments about > rewriting the plugin with the new http client libraries. There is, however, > not yet an issue for rewriting/reimplementing protocol-httpclient. > http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1086: --- Fix Version/s: (was: 1.16) 1.17 > Rewrite protocol-httpclient > --- > > Key: NUTCH-1086 > URL: https://issues.apache.org/jira/browse/NUTCH-1086 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: nutchgora, 1.5 >Reporter: Markus Jelsma >Assignee: Fabio Santagostino >Priority: Major > Fix For: 2.5, 1.17 > > Attachments: Http.java, HttpResponse.java > > > There are several issues about protocol-httpclient and several comments about > rewriting the plugin with the new http client libraries. There is, however, > not yet an issue for rewriting/reimplementing protocol-httpclient. > http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1086: --- Fix Version/s: 1.16 > Rewrite protocol-httpclient > --- > > Key: NUTCH-1086 > URL: https://issues.apache.org/jira/browse/NUTCH-1086 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: nutchgora, 1.5 >Reporter: Markus Jelsma >Assignee: Fabio Santagostino >Priority: Major > Fix For: 2.5, 1.16 > > Attachments: Http.java, HttpResponse.java > > > There are several issues about protocol-httpclient and several comments about > rewriting the plugin with the new http client libraries. There is, however, > not yet an issue for rewriting/reimplementing protocol-httpclient. > http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640872#comment-14640872 ] Nikolai Vasilev commented on NUTCH-1086: Hello Peter, the deprecation warning you see tells that you should no longer create HttpClient with DefaultHttpClient, and use HttpClientBuilder instead: http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/impl/client/DefaultHttpClient.html {code} Deprecated. (4.3) use HttpClientBuilder see also CloseableHttpClient. {code} There is a flaw in Fabio's implementation. By default DefaultHttpClient uses [BasicConnectionManager|http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/impl/conn/BasicClientConnectionManager.html], which is not supposed to manage connections in multithreaded environment. Which is crucial for Nutch. The [PoolingClientConnectionManager|http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/impl/conn/PoolingClientConnectionManager.html] should be used instead. In our project we launch Nutch at Amazon EMR and we suffered some weird dependency clashing, when tried to rewrite protocol-httpclient to HttpClient4.X. Unfortunatelly I have lost logs with errors and cannot tell exactly what was wrong. Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Assignee: Fabio Santagostino Fix For: 2.4 Attachments: Http.java, HttpResponse.java There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614244#comment-14614244 ] Peter Ciuffetti commented on NUTCH-1086: After and unsuccessful an attempt to resolve NUTCH-2059 I though Id try this upgrade to httpclient. I placed the attached java files into a branch based on the v1.11 trunk. But Im getting a unit test failure and some deprecation compiler warnings. {code} compile: [echo] Compiling plugin: protocol-httpclient [javac] Compiling 10 source files to /Users/pciuffetti/Documents/Dev/workspace/nutch/build/protocol-httpclient/classes [javac] /Users/pciuffetti/Documents/Dev/workspace/nutch/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java:35: warning: [deprecation] ConnRoutePNames in org.apache.http.conn.params has been deprecated [javac] import org.apache.http.conn.params.ConnRoutePNames; [javac] ^ [javac] /Users/pciuffetti/Documents/Dev/workspace/nutch/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java:37: warning: [deprecation] DefaultHttpClient in org.apache.http.impl.client has been deprecated [javac] import org.apache.http.impl.client.DefaultHttpClient; [javac] ^ [javac] /Users/pciuffetti/Documents/Dev/workspace/nutch/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java:68: warning: [deprecation] DefaultHttpClient in org.apache.http.impl.client has been deprecated [javac] private static DefaultHttpClient client = new DefaultHttpClient(); [javac] ^ [javac] /Users/pciuffetti/Documents/Dev/workspace/nutch/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java:68: warning: [deprecation] DefaultHttpClient in org.apache.http.impl.client has been deprecated [javac] private static DefaultHttpClient client = new DefaultHttpClient(); [javac] ^ [javac] /Users/pciuffetti/Documents/Dev/workspace/nutch/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java:96: warning: [deprecation] DefaultHttpClient in org.apache.http.impl.client has been deprecated [javac] static synchronized DefaultHttpClient getClient() { [javac] ^ [javac] /Users/pciuffetti/Documents/Dev/workspace/nutch/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java:201: warning: [deprecation] ConnRoutePNames in org.apache.http.conn.params has been deprecated [javac] client.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY, proxy); [javac] ^ [javac] 6 warnings {code} {code} Testcase: testNtlmAuth took 1.791 sec FAILED HTTP Status Code for http://127.0.0.1:47501/ntlm.jsp expected:200 but was:401 junit.framework.AssertionFailedError: HTTP Status Code for http://127.0.0.1:47501/ntlm.jsp expected:200 but was:401 at org.apache.nutch.protocol.httpclient.TestProtocolHttpClient.fetchPage(TestProtocolHttpClient.java:200) at org.apache.nutch.protocol.httpclient.TestProtocolHttpClient.testNtlmAuth(TestProtocolHttpClient.java:162) {code} ...will investigate if I can resolve these. Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Assignee: Fabio Santagostino Fix For: 2.4 Attachments: Http.java, HttpResponse.java There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1086: Assignee: Fabio Santagostino Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Assignee: Fabio Santagostino Fix For: 2.4 Attachments: Http.java, HttpResponse.java There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabio Santagostino updated NUTCH-1086: -- Attachment: HttpResponse.java Add httpclient 4.4 library Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Fix For: 2.4 Attachments: Http.java, HttpResponse.java There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322166#comment-14322166 ] Fabio Santagostino commented on NUTCH-1086: --- Hi, I've done an attempt to rewrite the component using httpclient 4.4. It works for me ! My main goal was to use a correct implementation of NTLMv2 auhentication for my corporate web sites. Anyway it seams to be backward compatible with previous implementation. Proxy support is the only part I've not tested yet. I had to change only 2 classes (in attachment) : - /src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java - /src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java Of course package dependency files must be modified also. In /ivy/ivy.xml : + added httpclient 4.4 version {code:xml} dependency org=org.apache.httpcomponents name=httpclient rev=4.4 conf=*-master / {code} + updated codec version from {code:xml}dependency org=commons-codec name=commons-codec rev=1.3 conf=*-default /{code} to {code:xml}dependency org=commons-codec name=commons-codec rev=1.4 conf=*-default /{code} Files in attachment are tested for v1.9 branch, but probably minor changes are needed to make it suitable for v2.3. Regards, Fabio Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Fix For: 2.4 Attachments: Http.java, HttpResponse.java There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabio Santagostino updated NUTCH-1086: -- Attachment: Http.java Add httpclient 4.4 library Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Fix For: 2.4 Attachments: Http.java There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063643#comment-14063643 ] Simon Zhu commented on NUTCH-1086: -- Hi Talat/Julien/Markus, I tested NTCredentials in components httpclient 4.3.4 by using a proxy server that requires NTLM authentication, and the response code was 200 OK, However when used NTCredentials of commons httpclient 3.1, which is currently used by protocol-httpclient, the returned code was 407, indicated the proxy server I'am using found NTCredentials in httpclient 3.1 could not explain NTLM protocol correctly. I supposed the reason is commons httpclient 3.1 was EOL in 2007 but the current NTLM version was released in 2008. Since httpclient 4.x does not compatible with 3.1, so IMHO it's not easy to address the NTLM authentication issue by adding a patch. But will be very happy if anyone can help to develop such a patch for the issue. Appreciate all kinds of advice/suggestions/clues for the proxy server authentication issue, more than happy to have further discussions on this. Regards Simon Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Fix For: 2.4 There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1086: - Priority: Major (was: Critical) Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Fix For: 2.4 There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1086: - Component/s: (was: fetcher) protocol Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Priority: Critical Fix For: 2.4 There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769338#comment-13769338 ] Talat UYARER commented on NUTCH-1086: - Hi Markus, Yes I know that Httpclient is still in development as part of Apache HttpComponents. Second comment is very good information for me. Actually i asked that question because i found a little bug in protocol-http: Even If I have http.content.limit value set, protocol-http fetches files of all sizes (larger files are fetched until limit allows). But when Parsing, parser skips incomplete files (parser.skip.truncated configuration). It seems like an unnecessary effort to partially fetch contents larger than limit if they are not gonna be parsed. What do you think about this? I will upload a patch about this issue. Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Priority: Critical Fix For: 2.4 There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768755#comment-13768755 ] Talat UYARER commented on NUTCH-1086: - Markus, I guess httpclient is end of life. Are you make any development for this issue ? Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Priority: Critical Fix For: 2.4 There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768814#comment-13768814 ] Markus Jelsma commented on NUTCH-1086: -- Hi Talat - what do you mean by EOL of HttpClient? Version 4.3 was just releases a few months ago. I assume you mean that Nutch' implementation of it is old, it is indeed! This issue is about completely rewriting Nutch' protocol-httpclient plugin to the most recent version of the HttpClient 4.x. Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Priority: Critical Fix For: 2.4 There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768819#comment-13768819 ] Markus Jelsma commented on NUTCH-1086: -- And to answer your question, no, i'm not working on this issue. We still manage with protocol-http and only use protocol-httpclient for TLS connections. It still works, for now :) Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Priority: Critical Fix For: 2.4 There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1086: --- Fix Version/s: (was: 1.7) 1.8 2.3 Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Priority: Critical Fix For: 2.3, 1.8 There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1086: Fix Version/s: (was: 2.1) 2.2 Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Priority: Critical Fix For: 1.6, 2.2 There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1086: Affects Version/s: 1.5 nutchgora Fix Version/s: 2.1 1.6 Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Markus Jelsma Priority: Critical Fix For: 1.6, 2.1 There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259288#comment-13259288 ] Ross Judson commented on NUTCH-1086: The Oracle bug report # is 7129065. HttpUrlConnection-based NTLM auth to Sharepoint succeeds with JDK 6, and crashes the VM on JDK. I am investigating other solutions to this. Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma Priority: Critical There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1086: Priority: Critical (was: Major) Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma Priority: Critical There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193007#comment-13193007 ] Ferdy Galema commented on NUTCH-1086: - Seems like a JVM bug, perhaps you could reproduce it using specific urls? Btw, does anyone has an NTLMv2 example URL that is publicly accessible? Besides lacking NTLMv2 support, is there anything else that isn't working properly? Support for https is not entirely broken, because https://www.iana.org/; for example can be fetched perfectly fine. Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193031#comment-13193031 ] Oleg Kalnichevski commented on NUTCH-1086: -- For what it is worth to you, HttpClient users have been reporting the best NTLMv2 compatibility results when using JCIFS as an NTLM engine. The trouble is the library is LGPL licensed and therefore may not be directly incorporated into ASF works. However, you might consider giving your users an option of hooking JCIFS up though an extension mechanism of some sort similar to that used by HttpClient [1] Oleg [1] http://hc.apache.org/httpcomponents-client-ga/ntlm.html Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
Thanks for dropping this on Remi. For future reference you might want to check out this online book on subversion [1]. Here at Nutch we use subversion for SCM and therefore this is the program we use to create patches, applying them and hopefully improving Nutch in the process ;0) It's straight forward no nonsense source code management and is real easy to get to grips with given a little time. Regarding this issue, unfortunately it has been open for a while and additionally it doesn't look like there is quite enough of a requirement from those using it to get a new implementation written up yet... I'm not even using it at all... Thanks again Lewis [1] http://svnbook.red-bean.com/en/1.7/index.html On Thu, Jan 19, 2012 at 6:56 AM, Remi Tassing (Commented) (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188961#comment-13188961] Remi Tassing commented on NUTCH-1086: - For the NTLMv2 issue I used a dirty solution in HttpResponse.java. Inside the creator and after the getResponseBodyAsStream()attempt: 1. I check the result code, if it's 500 (inside finally{...}) 2. I use HttpUrlConnection to authenticate and open a connection 3. Then read the InputStream, get the Content and change the code to 200 The problems with that solution are that: 1. The authentication keys are hardcoded 2. It doesn't check if the content is valid or not but set the return code to 200 3. Error code 500 doesn't necessarily mean that it's a NTLMv2 authentication problem I have no idea on how to write patches to the trunk... Remi Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira -- *Lewis*
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188505#comment-13188505 ] Lewis John McGibbney commented on NUTCH-1086: - When trying to access some SharePoint(IIS) website using NTLMv2 authentication, Nutch fails and gets an error code 500. HttpClient only supports an early version of NTLM but not NTLMv2. HttpUrlConnection can be used instead. [1]http://oaklandsoftware.com/papers/ntlm.html [2]http://developer-resource.blogspot.com/2008/06/ntlm-authentication-from-java.html Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188961#comment-13188961 ] Remi Tassing commented on NUTCH-1086: - For the NTLMv2 issue I used a dirty solution in HttpResponse.java. Inside the creator and after the getResponseBodyAsStream()attempt: 1. I check the result code, if it's 500 (inside finally{...}) 2. I use HttpUrlConnection to authenticate and open a connection 3. Then read the InputStream, get the Content and change the code to 200 The problems with that solution are that: 1. The authentication keys are hardcoded 2. It doesn't check if the content is valid or not but set the return code to 200 3. Error code 500 doesn't necessarily mean that it's a NTLMv2 authentication problem I have no idea on how to write patches to the trunk... Remi Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089314#comment-13089314 ] Aravind Srini commented on NUTCH-1086: -- Some transitive dependencies: * Solr 3.1.0 , seems to depend on commons-httpclient 3.1. Started an independent email thread with the solr community ( solr - httpclient from 3.x to 4.1.x ) to open it up for discussion. * hadoop 0.20.2 , depends on commons-httpclient 3.0.1 as well. Also - httpclient 4.1.2, depends on httpcore 4.1.2 - but there seems to have been an emergency release of httpcore 4.1.3 ( and httpclient , not republished after the same) so both needs to be explicitly published in ivy.xml (or pom.xml ). Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Rewrite protocol-httpclient
In branch 1.4 at first. It should be easy to port to trunk however. You're more than welcome to contribute. On Tue, Aug 23, 2011 at 12:28 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Please see Julien's comment in this recent thread: Re: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora- hbase;0.1: not found in Nutch trunk To be short: no. The bulk of the work is code and manual testing, not building or pushing deps around :) Agreed . Which branch would this go into, since I would like to pitch into the same and start contributing as well. Cheers, just a thought - while we are talking about package upgradation here, I see that the current build system uses ant/ build.xml , would there be any interest in moving towards a maven-ized build , to make upgradation / test upgradation a bit more simpler ? On Mon, Aug 22, 2011 at 11:39 PM, Markus Jelsma (JIRA) j...@apache.orgwrote: [ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira. plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088871#c omment-13088871] Markus Jelsma commented on NUTCH-1086: -- Preferably the 4.1.x version. Nutch still uses the deprecated 3.x and there are a lot of issues to be resolved such as HTTPS support.
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089466#comment-13089466 ] Oleg Kalnichevski commented on NUTCH-1086: -- The 4.1.3 release of HttpCore patched a regression affecting non-blocking (NIO) SSL transports only. There have been no changes between 4.1.2 and 4.1.3 releases in blocking transport components relevant for HttpClient. Please let me know if you need any help migrating off HttpClient 3.1 to HttpClient 4.1.x. Oleg Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089503#comment-13089503 ] Aravind Srini commented on NUTCH-1086: -- Thanks, Oleg for pitching in and confirming the right thing. Meanwhile - SOLR-2727 logged independently, to upgrade that to httpclient 4.x codeline. Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1086) Rewrite protocol-httpclient
Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088850#comment-13088850 ] Aravind Srini commented on NUTCH-1086: -- Are we talking about httpclient 4.0.1 ? Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088871#comment-13088871 ] Markus Jelsma commented on NUTCH-1086: -- Preferably the 4.1.x version. Nutch still uses the deprecated 3.x and there are a lot of issues to be resolved such as HTTPS support. Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088875#comment-13088875 ] Ken Krugler commented on NUTCH-1086: For what it's worth, there's a SimpleHttpFetcher in crawler-commons that uses HttpClient 4.1. Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Rewrite protocol-httpclient
Hi, Please see Julien's comment in this recent thread: Re: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora- hbase;0.1: not found in Nutch trunk To be short: no. The bulk of the work is code and manual testing, not building or pushing deps around :) Cheers, just a thought - while we are talking about package upgradation here, I see that the current build system uses ant/ build.xml , would there be any interest in moving towards a maven-ized build , to make upgradation / test upgradation a bit more simpler ? On Mon, Aug 22, 2011 at 11:39 PM, Markus Jelsma (JIRA) j...@apache.orgwrote: [ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira. plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088871#c omment-13088871] Markus Jelsma commented on NUTCH-1086: -- Preferably the 4.1.x version. Nutch still uses the deprecated 3.x and there are a lot of issues to be resolved such as HTTPS support.
Re: Rewrite protocol-httpclient
On Tue, Aug 23, 2011 at 12:28 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Please see Julien's comment in this recent thread: Re: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora- hbase;0.1: not found in Nutch trunk To be short: no. The bulk of the work is code and manual testing, not building or pushing deps around :) Agreed . Which branch would this go into, since I would like to pitch into the same and start contributing as well. Cheers, just a thought - while we are talking about package upgradation here, I see that the current build system uses ant/ build.xml , would there be any interest in moving towards a maven-ized build , to make upgradation / test upgradation a bit more simpler ? On Mon, Aug 22, 2011 at 11:39 PM, Markus Jelsma (JIRA) j...@apache.orgwrote: [ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira. plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088871#c omment-13088871] Markus Jelsma commented on NUTCH-1086: -- Preferably the 4.1.x version. Nutch still uses the deprecated 3.x and there are a lot of issues to be resolved such as HTTPS support.