On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti <graziano.alibe...@eng.it> wrote: > Il 13/03/2010 22.55, Susam Pal ha scritto: >> >> On Fri, Mar 12, 2010 at 3:17 PM, Susam Pal<susam....@gmail.com> wrote: >> >>> >>> On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti >>> <graziano.alibe...@eng.it> wrote: >>> >>>> >>>> Il 11/03/2010 16.20, Susam Pal ha scritto: >>>> >>>>> >>>>> On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti >>>>> <graziano.alibe...@eng.it> wrote: >>>>> >>>>> >>>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I'm trying to use nutch ver. 1.0 on a system under squid proxy >>>>>> control. >>>>>> When >>>>>> I try to fetch my website list, into the log file I see that the >>>>>> authentication was failed... >>>>>> >>>>>> I've configured my nutch-site.xml file with all that properties needed >>>>>> for >>>>>> proxy auth, but my error is "httpclient.HttpMethodDirector - No >>>>>> credentials >>>>>> available for BASIC 'Squid proxy-caching web >>>>>> server'@proxy.my.host:my.port" >>>>>> >>>>>> >>>>>> >>>>> >>>>> Did you replace 'protocol-http' with 'protocol-httpclient' in the >>>>> value for 'plugins.include' property in 'conf/nutch-site.xml'? >>>>> >>>>> Regards, >>>>> Susam Pal >>>>> >>>>> >>>>> >>>>> >>>> >>>> Hi Susam, >>>> >>>> yes of course!! :) Maybe I can post you the configuration file: >>>> >>>> <?xml version="1.0"?> >>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >>>> >>>> <!-- Put site-specific property overrides in this file. --> >>>> >>>> <configuration> >>>> >>>> <property> >>>> <name>http.agent.name</name> >>>> <value>my.agent.name</value> >>>> <description> >>>> </description> >>>> </property> >>>> >>>> <property> >>>> <name>plugin.includes</name> >>>> >>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> >>>> <description> >>>> </description> >>>> </property> >>>> >>>> <property> >>>> <name>http.auth.file</name> >>>> <value>my_file.xml</value> >>>> <description>Authentication configuration file for >>>> 'protocol-httpclient' plugin. >>>> </description> >>>> </property> >>>> >>>> <property> >>>> <name>http.proxy.host</name> >>>> <value>ip.my.proxy</value> >>>> <description>The proxy hostname. If empty, no proxy is >>>> used.</description> >>>> </property> >>>> >>>> <property> >>>> <name>http.proxy.port</name> >>>> <value>my.port</value> >>>> <description>The proxy port.</description> >>>> </property> >>>> >>>> <property> >>>> <name>http.proxy.username</name> >>>> <value>my.user</value> >>>> <description> >>>> </description> >>>> </property> >>>> >>>> <property> >>>> <name>http.proxy.password</name> >>>> <value>my.pwd</value> >>>> <description> >>>> </description> >>>> </property> >>>> >>>> <property> >>>> <name>http.proxy.realm</name> >>>> <value>my_realm</value> >>>> <description> >>>> </description> >>>> </property> >>>> >>>> <property> >>>> <name>http.agent.host</name> >>>> <value>my.local.pc</value> >>>> <description>The agent host.</description> >>>> </property> >>>> >>>> <property> >>>> <name>http.useHttp11</name> >>>> <value>true</value> >>>> <description> >>>> </description> >>>> </property> >>>> >>>> </configuration> >>>> >>>> Only another question: where i must put the user authentication >>>> parameters >>>> (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for >>>> authentication? >>>> >>>> Thank you for your attention, >>>> >>>> >>>> -- >>>> ----------- >>>> >>>> Graziano Aliberti >>>> >>>> Engineering Ingegneria Informatica S.p.A >>>> >>>> Via S. Martino della Battaglia, 56 - 00185 ROMA >>>> >>>> *Tel.:* 06.49.201.387 >>>> >>>> *E-Mail:* graziano.alibe...@eng.it >>>> >>>> >>>> >>>> >>> >>> The configuration looks okay to me. Yes, the proxy authentication >>> details are set in 'conf/nutch-site.xml'. The file mentioned in >>> 'http.auth.file' property is used for configuring authentication >>> details for authenticating to a web server. >>> >>> Unfortunately, there aren't any log statements in the part of the code >>> that reads the proxy authentication details. So, I can't suggest you >>> to turn on debug logs to get some clues about the issue. However, in >>> case you want to troubleshoot it yourself by building Nutch from >>> source, I can tell you the code that deals with this. >>> >>> The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java : >>> >>> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup >>> >>> The line number is: 200. >>> >>> If I get time this weekend, I will try to insert some log statements >>> into this code and send a modified JAR file to you which might help >>> you to troubleshoot what is going on. But I can't promise this since >>> it depends on my weekend plans. >>> >>> Two questions before I end this mail. Did you set the value of >>> 'http.proxy.realm' property as: Squid proxy-caching web server ? >>> >>> Also, do you see any 'auth.AuthChallengeProcessor' lines in the log >>> file? I'm not sure whether this line should appear for proxy >>> authentication but it does appear for web server authentication. >>> >>> Regards, >>> Susam Pal >>> >>> >> >> I managed to find some time to insert more logs into >> protocol-httpclient and create a JAR. I have attached it with this >> email. >> >> Please replace your >> 'plugins/protocol-httpclient/protocol-httpclient.jar' file with the >> one that I have attached. Also, edit your 'conf/log4j.properties' file >> to add these two lines: >> >> log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout >> log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout >> >> When you run a crawl now, you should see more logs in >> 'logs/hadoop.log' than before. I hope it helps you in providing some >> clues. In case you want to compare the logs with how the control flows >> from the source code, I have attached the JAVA file as well. >> >> Regards, >> Susam Pal >> > > Hi Susam, > > first of all I want to thank you for your support :). I've tried your > solution and I've seen in the log file the that the authentication > parameters was correctly read by the application. > > In the log file I've finded these lines about auth.AuthChallengeProcessor: > > > 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge for > ntlm authentication scheme not available > 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge for > digest authentication scheme not available > 2010-03-15 09:52:33,140 INFO auth.AuthChallengeProcessor - basic > authentication scheme selected > 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Using > authentication scheme: basic > 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Authorization > challenge processed > 2010-03-15 09:52:33,140 INFO httpclient.HttpMethodDirector - No credentials > available for BASIC 'Squid proxy-caching web server'@my.proxy:my.port >
'Squid proxy-caching web server'@my.proxy:my.port - should be the authentication details mentioned in the proxy configuration. It means that the 'http.proxy.realm' should be specified as: Squid proxy-caching web server You can also try omitting the value for 'http.proxy.realm' property. I was also wanted to confirm whether you got the following line in 'logs/hadoop.log': Custom logs for troubleshooting authentication (set 4) If you have got this line and your configuration is correct, I don't see a reason why AuthChallengeProcessor should complain about missing credentials. It could be a bug either in Nutch or in the Jakarta Commons HttpClient library which is being used in Nutch to do the authentication. It could also be a mistake in the configuration. In case, you find a way to resolve it, please let us know what the problem was and how you resolved it. Regards, Susam Pal