On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti
<graziano.alibe...@eng.it> wrote:
> Il 13/03/2010 22.55, Susam Pal ha scritto:
>>
>> On Fri, Mar 12, 2010 at 3:17 PM, Susam Pal<susam....@gmail.com>  wrote:
>>
>>>
>>> On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
>>> <graziano.alibe...@eng.it>  wrote:
>>>
>>>>
>>>> Il 11/03/2010 16.20, Susam Pal ha scritto:
>>>>
>>>>>
>>>>> On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
>>>>> <graziano.alibe...@eng.it>    wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I'm trying to use nutch ver. 1.0 on a system under squid proxy
>>>>>> control.
>>>>>> When
>>>>>> I try to fetch my website list, into the log file I see that the
>>>>>> authentication was failed...
>>>>>>
>>>>>> I've configured my nutch-site.xml file with all that properties needed
>>>>>> for
>>>>>> proxy auth, but my error is "httpclient.HttpMethodDirector - No
>>>>>> credentials
>>>>>> available for BASIC 'Squid proxy-caching web
>>>>>> server'@proxy.my.host:my.port"
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Did you replace 'protocol-http' with 'protocol-httpclient' in the
>>>>> value for 'plugins.include' property in 'conf/nutch-site.xml'?
>>>>>
>>>>> Regards,
>>>>> Susam Pal
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> Hi Susam,
>>>>
>>>> yes of course!! :) Maybe I can post you the configuration file:
>>>>
>>>> <?xml version="1.0"?>
>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>>
>>>> <!-- Put site-specific property overrides in this file. -->
>>>>
>>>> <configuration>
>>>>
>>>> <property>
>>>> <name>http.agent.name</name>
>>>> <value>my.agent.name</value>
>>>> <description>
>>>> </description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>plugin.includes</name>
>>>>
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>>> <description>
>>>> </description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>http.auth.file</name>
>>>> <value>my_file.xml</value>
>>>> <description>Authentication configuration file for
>>>>  'protocol-httpclient' plugin.
>>>> </description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>http.proxy.host</name>
>>>> <value>ip.my.proxy</value>
>>>> <description>The proxy hostname.  If empty, no proxy is
>>>> used.</description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>http.proxy.port</name>
>>>> <value>my.port</value>
>>>> <description>The proxy port.</description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>http.proxy.username</name>
>>>> <value>my.user</value>
>>>> <description>
>>>> </description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>http.proxy.password</name>
>>>> <value>my.pwd</value>
>>>> <description>
>>>> </description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>http.proxy.realm</name>
>>>> <value>my_realm</value>
>>>> <description>
>>>> </description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>http.agent.host</name>
>>>> <value>my.local.pc</value>
>>>> <description>The agent host.</description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>http.useHttp11</name>
>>>> <value>true</value>
>>>> <description>
>>>> </description>
>>>> </property>
>>>>
>>>> </configuration>
>>>>
>>>> Only another question: where i must put the user authentication
>>>> parameters
>>>> (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for
>>>> authentication?
>>>>
>>>> Thank you for your attention,
>>>>
>>>>
>>>> --
>>>> -----------
>>>>
>>>> Graziano Aliberti
>>>>
>>>> Engineering Ingegneria Informatica S.p.A
>>>>
>>>> Via S. Martino della Battaglia, 56 - 00185 ROMA
>>>>
>>>> *Tel.:* 06.49.201.387
>>>>
>>>> *E-Mail:* graziano.alibe...@eng.it
>>>>
>>>>
>>>>
>>>>
>>>
>>> The configuration looks okay to me. Yes, the proxy authentication
>>> details are set in 'conf/nutch-site.xml'. The file mentioned in
>>> 'http.auth.file' property is used for configuring authentication
>>> details for authenticating to a web server.
>>>
>>> Unfortunately, there aren't any log statements in the part of the code
>>> that reads the proxy authentication details. So, I can't suggest you
>>> to turn on debug logs to get some clues about the issue. However, in
>>> case you want to troubleshoot it yourself by building Nutch from
>>> source, I can tell you the code that deals with this.
>>>
>>> The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java :
>>>
>>> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup
>>>
>>> The line number is: 200.
>>>
>>> If I get time this weekend, I will try to insert some log statements
>>> into this code and send a modified JAR file to you which might help
>>> you to troubleshoot what is going on. But I can't promise this since
>>> it depends on my weekend plans.
>>>
>>> Two questions before I end this mail. Did you set the value of
>>> 'http.proxy.realm' property as: Squid proxy-caching web server ?
>>>
>>> Also, do you see any 'auth.AuthChallengeProcessor' lines in the log
>>> file? I'm not sure whether this line should appear for proxy
>>> authentication but it does appear for web server authentication.
>>>
>>> Regards,
>>> Susam Pal
>>>
>>>
>>
>> I managed to find some time to insert more logs into
>> protocol-httpclient and create a JAR. I have attached it with this
>> email.
>>
>> Please replace your
>> 'plugins/protocol-httpclient/protocol-httpclient.jar' file with the
>> one that I have attached. Also, edit your 'conf/log4j.properties' file
>> to add these two lines:
>>
>> log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
>> log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout
>>
>> When you run a crawl now, you should see more logs in
>> 'logs/hadoop.log' than before. I hope it helps you in providing some
>> clues. In case you want to compare the logs with how the control flows
>> from the source code, I have attached the JAVA file as well.
>>
>> Regards,
>> Susam Pal
>>
>
> Hi Susam,
>
> first of all I want to thank you for your support :). I've tried your
> solution and I've seen in the log file the that the authentication
> parameters was correctly read by the application.
>
> In the log file I've finded these lines about auth.AuthChallengeProcessor:
>
>
> 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge for
> ntlm authentication scheme not available
> 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge for
> digest authentication scheme not available
> 2010-03-15 09:52:33,140 INFO  auth.AuthChallengeProcessor - basic
> authentication scheme selected
> 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Using
> authentication scheme: basic
> 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Authorization
> challenge processed
> 2010-03-15 09:52:33,140 INFO  httpclient.HttpMethodDirector - No credentials
> available for BASIC 'Squid proxy-caching web server'@my.proxy:my.port
>

'Squid proxy-caching web server'@my.proxy:my.port - should be the
authentication details mentioned in the proxy configuration.

It means that the 'http.proxy.realm' should be specified as: Squid
proxy-caching web server

You can also try omitting the value for 'http.proxy.realm' property.

I was also wanted to confirm whether you got the following line in
'logs/hadoop.log':

Custom logs for troubleshooting authentication (set 4)

If you have got this line and your configuration is correct, I don't
see a reason why AuthChallengeProcessor should complain about missing
credentials. It could be a bug either in Nutch or in the Jakarta
Commons HttpClient library which is being used in Nutch to do the
authentication. It could also be a mistake in the configuration.

In case, you find a way to resolve it, please let us know what the
problem was and how you resolved it.

Regards,
Susam Pal

Reply via email to