On Tue, Mar 16, 2010 at 12:55 AM, Susam Pal <susam....@gmail.com> wrote: > On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti > <graziano.alibe...@eng.it> wrote: >> Il 13/03/2010 22.55, Susam Pal ha scritto: >>> >>> On Fri, Mar 12, 2010 at 3:17 PM, Susam Pal<susam....@gmail.com> wrote: >>> >>>> >>>> On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti >>>> <graziano.alibe...@eng.it> wrote: >>>> >>>>> >>>>> Il 11/03/2010 16.20, Susam Pal ha scritto: >>>>> >>>>>> >>>>>> On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti >>>>>> <graziano.alibe...@eng.it> wrote: >>>>>> >>>>>> >>>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> I'm trying to use nutch ver. 1.0 on a system under squid proxy >>>>>>> control. >>>>>>> When >>>>>>> I try to fetch my website list, into the log file I see that the >>>>>>> authentication was failed... >>>>>>> >>>>>>> I've configured my nutch-site.xml file with all that properties needed >>>>>>> for >>>>>>> proxy auth, but my error is "httpclient.HttpMethodDirector - No >>>>>>> credentials >>>>>>> available for BASIC 'Squid proxy-caching web >>>>>>> server'@proxy.my.host:my.port" >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> Did you replace 'protocol-http' with 'protocol-httpclient' in the >>>>>> value for 'plugins.include' property in 'conf/nutch-site.xml'? >>>>>> >>>>>> Regards, >>>>>> Susam Pal >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> Hi Susam, >>>>> >>>>> yes of course!! :) Maybe I can post you the configuration file: >>>>> >>>>> <?xml version="1.0"?> >>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >>>>> >>>>> <!-- Put site-specific property overrides in this file. --> >>>>> >>>>> <configuration> >>>>> >>>>> <property> >>>>> <name>http.agent.name</name> >>>>> <value>my.agent.name</value> >>>>> <description> >>>>> </description> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>plugin.includes</name> >>>>> >>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> >>>>> <description> >>>>> </description> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>http.auth.file</name> >>>>> <value>my_file.xml</value> >>>>> <description>Authentication configuration file for >>>>> 'protocol-httpclient' plugin. >>>>> </description> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>http.proxy.host</name> >>>>> <value>ip.my.proxy</value> >>>>> <description>The proxy hostname. If empty, no proxy is >>>>> used.</description> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>http.proxy.port</name> >>>>> <value>my.port</value> >>>>> <description>The proxy port.</description> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>http.proxy.username</name> >>>>> <value>my.user</value> >>>>> <description> >>>>> </description> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>http.proxy.password</name> >>>>> <value>my.pwd</value> >>>>> <description> >>>>> </description> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>http.proxy.realm</name> >>>>> <value>my_realm</value> >>>>> <description> >>>>> </description> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>http.agent.host</name> >>>>> <value>my.local.pc</value> >>>>> <description>The agent host.</description> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>http.useHttp11</name> >>>>> <value>true</value> >>>>> <description> >>>>> </description> >>>>> </property> >>>>> >>>>> </configuration> >>>>> >>>>> Only another question: where i must put the user authentication >>>>> parameters >>>>> (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for >>>>> authentication? >>>>> >>>>> Thank you for your attention, >>>>> >>>>> >>>>> -- >>>>> ----------- >>>>> >>>>> Graziano Aliberti >>>>> >>>>> Engineering Ingegneria Informatica S.p.A >>>>> >>>>> Via S. Martino della Battaglia, 56 - 00185 ROMA >>>>> >>>>> *Tel.:* 06.49.201.387 >>>>> >>>>> *E-Mail:* graziano.alibe...@eng.it >>>>> >>>>> >>>>> >>>>> >>>> >>>> The configuration looks okay to me. Yes, the proxy authentication >>>> details are set in 'conf/nutch-site.xml'. The file mentioned in >>>> 'http.auth.file' property is used for configuring authentication >>>> details for authenticating to a web server. >>>> >>>> Unfortunately, there aren't any log statements in the part of the code >>>> that reads the proxy authentication details. So, I can't suggest you >>>> to turn on debug logs to get some clues about the issue. However, in >>>> case you want to troubleshoot it yourself by building Nutch from >>>> source, I can tell you the code that deals with this. >>>> >>>> The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java : >>>> >>>> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup >>>> >>>> The line number is: 200. >>>> >>>> If I get time this weekend, I will try to insert some log statements >>>> into this code and send a modified JAR file to you which might help >>>> you to troubleshoot what is going on. But I can't promise this since >>>> it depends on my weekend plans. >>>> >>>> Two questions before I end this mail. Did you set the value of >>>> 'http.proxy.realm' property as: Squid proxy-caching web server ? >>>> >>>> Also, do you see any 'auth.AuthChallengeProcessor' lines in the log >>>> file? I'm not sure whether this line should appear for proxy >>>> authentication but it does appear for web server authentication. >>>> >>>> Regards, >>>> Susam Pal >>>> >>>> >>> >>> I managed to find some time to insert more logs into >>> protocol-httpclient and create a JAR. I have attached it with this >>> email. >>> >>> Please replace your >>> 'plugins/protocol-httpclient/protocol-httpclient.jar' file with the >>> one that I have attached. Also, edit your 'conf/log4j.properties' file >>> to add these two lines: >>> >>> log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout >>> log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout >>> >>> When you run a crawl now, you should see more logs in >>> 'logs/hadoop.log' than before. I hope it helps you in providing some >>> clues. In case you want to compare the logs with how the control flows >>> from the source code, I have attached the JAVA file as well. >>> >>> Regards, >>> Susam Pal >>> >> >> Hi Susam, >> >> first of all I want to thank you for your support :). I've tried your >> solution and I've seen in the log file the that the authentication >> parameters was correctly read by the application. >> >> In the log file I've finded these lines about auth.AuthChallengeProcessor: >> >> >> 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge for >> ntlm authentication scheme not available >> 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge for >> digest authentication scheme not available >> 2010-03-15 09:52:33,140 INFO auth.AuthChallengeProcessor - basic >> authentication scheme selected >> 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Using >> authentication scheme: basic >> 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Authorization >> challenge processed >> 2010-03-15 09:52:33,140 INFO httpclient.HttpMethodDirector - No credentials >> available for BASIC 'Squid proxy-caching web server'@my.proxy:my.port >> > > 'Squid proxy-caching web server'@my.proxy:my.port - should be the > authentication details mentioned in the proxy configuration. > > It means that the 'http.proxy.realm' should be specified as: Squid > proxy-caching web server > > You can also try omitting the value for 'http.proxy.realm' property. > > I was also wanted to confirm whether you got the following line in > 'logs/hadoop.log': > > Custom logs for troubleshooting authentication (set 4) > > If you have got this line and your configuration is correct, I don't > see a reason why AuthChallengeProcessor should complain about missing > credentials. It could be a bug either in Nutch or in the Jakarta > Commons HttpClient library which is being used in Nutch to do the > authentication. It could also be a mistake in the configuration. > > In case, you find a way to resolve it, please let us know what the > problem was and how you resolved it. > > Regards, > Susam Pal >
Here is an update. It is most likely a configuration problem. I just tested the proxy authentication feature with a Squid proxy server with the same realm as yours. It works well. Also, the issue you are facing seems to be due to an incorrect realm specified. I would suggest that you omit the realm and see if it works fine. When you omit the realm, the corresponding XML code for the configuration should look like this: <property> <name>http.proxy.realm</name> <value></value> <description></description> </property> In case you do want to specify the realm, your XML code should look like this: <property> <name>http.proxy.realm</name> <value>Squid proxy-caching web server</value> <description></description> </property> Note that this is the exact string appearing in the log message: INFO httpclient.HttpMethodDirector - No credentials available for BASIC 'Squid proxy-caching web server'@my.proxy:my.port There should be no quotes around the string. If everything goes fine, the logs should appear like the following. These logs are from my system. 2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Supported authentication schemes in the order of preference: [ntlm, digest, basic] 2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Challenge for ntlm authentication scheme not available 2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Challenge for digest authentication scheme not available 2010-03-16 02:45:28,280 INFO auth.AuthChallengeProcessor - basic authentication scheme selected 2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: basic 2010-03-16 02:45:28,281 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2010-03-16 02:45:28,282 DEBUG auth.AuthChallengeProcessor - Supported authentication schemes in the order of preference: [ntlm, digest, basic] 2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Challenge for ntlm authentication scheme not available 2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Challenge for digest authentication scheme not available 2010-03-16 02:45:28,283 INFO auth.AuthChallengeProcessor - basic authentication scheme selected 2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: basic 2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2010-03-16 02:45:28,284 DEBUG auth.BasicScheme - enter BasicScheme.authenticate(Credentials, HttpMethod) 2010-03-16 02:45:28,286 DEBUG auth.BasicScheme - enter BasicScheme.authenticate(UsernamePasswordCredentials, String) 2010-03-16 02:45:28,286 DEBUG auth.AuthChallengeProcessor - Supported authentication schemes in the order of preference: [ntlm, digest, basic] 2010-03-16 02:45:28,287 DEBUG auth.AuthChallengeProcessor - Challenge for ntlm authentication scheme not available 2010-03-16 02:45:28,287 DEBUG auth.AuthChallengeProcessor - Challenge for digest authentication scheme not available 2010-03-16 02:45:28,287 INFO auth.AuthChallengeProcessor - basic authentication scheme selected 2010-03-16 02:45:28,287 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: basic 2010-03-16 02:45:28,288 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2010-03-16 02:45:28,288 DEBUG auth.BasicScheme - enter BasicScheme.authenticate(Credentials, HttpMethod) 2010-03-16 02:45:28,288 DEBUG auth.BasicScheme - enter BasicScheme.authenticate(UsernamePasswordCredentials, String) 2010-03-16 02:45:28,284 DEBUG auth.BasicScheme - enter BasicScheme.authenticate(Credentials, HttpMethod) 2010-03-16 02:45:28,289 DEBUG auth.BasicScheme - enter BasicScheme.authenticate(UsernamePasswordCredentials, String) 2010-03-16 02:45:28,330 DEBUG httpclient.Http - url: http://en.wikipedia.org/robots.txt; status code: 200; bytes received: 4853; Content-Length: 4853; Content-Encoding: gzip; extracted to 26147 bytes I hope this helps. If you still face issues, please send me the complete log file (logs/hadoop.log) and the complete configuration file (conf/nutch-site.xml). It is easier to spot configuration mistakes if you send the complete files. Please do remove the existing hadoop.log file before starting a new crawl so that that the log file you send us isn't too large. Regards, Susam Pal