On Tue, Mar 16, 2010 at 12:55 AM, Susam Pal <susam....@gmail.com> wrote:
> On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti
> <graziano.alibe...@eng.it> wrote:
>> Il 13/03/2010 22.55, Susam Pal ha scritto:
>>>
>>> On Fri, Mar 12, 2010 at 3:17 PM, Susam Pal<susam....@gmail.com>  wrote:
>>>
>>>>
>>>> On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
>>>> <graziano.alibe...@eng.it>  wrote:
>>>>
>>>>>
>>>>> Il 11/03/2010 16.20, Susam Pal ha scritto:
>>>>>
>>>>>>
>>>>>> On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
>>>>>> <graziano.alibe...@eng.it>    wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I'm trying to use nutch ver. 1.0 on a system under squid proxy
>>>>>>> control.
>>>>>>> When
>>>>>>> I try to fetch my website list, into the log file I see that the
>>>>>>> authentication was failed...
>>>>>>>
>>>>>>> I've configured my nutch-site.xml file with all that properties needed
>>>>>>> for
>>>>>>> proxy auth, but my error is "httpclient.HttpMethodDirector - No
>>>>>>> credentials
>>>>>>> available for BASIC 'Squid proxy-caching web
>>>>>>> server'@proxy.my.host:my.port"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Did you replace 'protocol-http' with 'protocol-httpclient' in the
>>>>>> value for 'plugins.include' property in 'conf/nutch-site.xml'?
>>>>>>
>>>>>> Regards,
>>>>>> Susam Pal
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Hi Susam,
>>>>>
>>>>> yes of course!! :) Maybe I can post you the configuration file:
>>>>>
>>>>> <?xml version="1.0"?>
>>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>>>
>>>>> <!-- Put site-specific property overrides in this file. -->
>>>>>
>>>>> <configuration>
>>>>>
>>>>> <property>
>>>>> <name>http.agent.name</name>
>>>>> <value>my.agent.name</value>
>>>>> <description>
>>>>> </description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>plugin.includes</name>
>>>>>
>>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>>>> <description>
>>>>> </description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>http.auth.file</name>
>>>>> <value>my_file.xml</value>
>>>>> <description>Authentication configuration file for
>>>>>  'protocol-httpclient' plugin.
>>>>> </description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>http.proxy.host</name>
>>>>> <value>ip.my.proxy</value>
>>>>> <description>The proxy hostname.  If empty, no proxy is
>>>>> used.</description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>http.proxy.port</name>
>>>>> <value>my.port</value>
>>>>> <description>The proxy port.</description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>http.proxy.username</name>
>>>>> <value>my.user</value>
>>>>> <description>
>>>>> </description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>http.proxy.password</name>
>>>>> <value>my.pwd</value>
>>>>> <description>
>>>>> </description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>http.proxy.realm</name>
>>>>> <value>my_realm</value>
>>>>> <description>
>>>>> </description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>http.agent.host</name>
>>>>> <value>my.local.pc</value>
>>>>> <description>The agent host.</description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>http.useHttp11</name>
>>>>> <value>true</value>
>>>>> <description>
>>>>> </description>
>>>>> </property>
>>>>>
>>>>> </configuration>
>>>>>
>>>>> Only another question: where i must put the user authentication
>>>>> parameters
>>>>> (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for
>>>>> authentication?
>>>>>
>>>>> Thank you for your attention,
>>>>>
>>>>>
>>>>> --
>>>>> -----------
>>>>>
>>>>> Graziano Aliberti
>>>>>
>>>>> Engineering Ingegneria Informatica S.p.A
>>>>>
>>>>> Via S. Martino della Battaglia, 56 - 00185 ROMA
>>>>>
>>>>> *Tel.:* 06.49.201.387
>>>>>
>>>>> *E-Mail:* graziano.alibe...@eng.it
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> The configuration looks okay to me. Yes, the proxy authentication
>>>> details are set in 'conf/nutch-site.xml'. The file mentioned in
>>>> 'http.auth.file' property is used for configuring authentication
>>>> details for authenticating to a web server.
>>>>
>>>> Unfortunately, there aren't any log statements in the part of the code
>>>> that reads the proxy authentication details. So, I can't suggest you
>>>> to turn on debug logs to get some clues about the issue. However, in
>>>> case you want to troubleshoot it yourself by building Nutch from
>>>> source, I can tell you the code that deals with this.
>>>>
>>>> The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java :
>>>>
>>>> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup
>>>>
>>>> The line number is: 200.
>>>>
>>>> If I get time this weekend, I will try to insert some log statements
>>>> into this code and send a modified JAR file to you which might help
>>>> you to troubleshoot what is going on. But I can't promise this since
>>>> it depends on my weekend plans.
>>>>
>>>> Two questions before I end this mail. Did you set the value of
>>>> 'http.proxy.realm' property as: Squid proxy-caching web server ?
>>>>
>>>> Also, do you see any 'auth.AuthChallengeProcessor' lines in the log
>>>> file? I'm not sure whether this line should appear for proxy
>>>> authentication but it does appear for web server authentication.
>>>>
>>>> Regards,
>>>> Susam Pal
>>>>
>>>>
>>>
>>> I managed to find some time to insert more logs into
>>> protocol-httpclient and create a JAR. I have attached it with this
>>> email.
>>>
>>> Please replace your
>>> 'plugins/protocol-httpclient/protocol-httpclient.jar' file with the
>>> one that I have attached. Also, edit your 'conf/log4j.properties' file
>>> to add these two lines:
>>>
>>> log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
>>> log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout
>>>
>>> When you run a crawl now, you should see more logs in
>>> 'logs/hadoop.log' than before. I hope it helps you in providing some
>>> clues. In case you want to compare the logs with how the control flows
>>> from the source code, I have attached the JAVA file as well.
>>>
>>> Regards,
>>> Susam Pal
>>>
>>
>> Hi Susam,
>>
>> first of all I want to thank you for your support :). I've tried your
>> solution and I've seen in the log file the that the authentication
>> parameters was correctly read by the application.
>>
>> In the log file I've finded these lines about auth.AuthChallengeProcessor:
>>
>>
>> 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge for
>> ntlm authentication scheme not available
>> 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge for
>> digest authentication scheme not available
>> 2010-03-15 09:52:33,140 INFO  auth.AuthChallengeProcessor - basic
>> authentication scheme selected
>> 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Using
>> authentication scheme: basic
>> 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Authorization
>> challenge processed
>> 2010-03-15 09:52:33,140 INFO  httpclient.HttpMethodDirector - No credentials
>> available for BASIC 'Squid proxy-caching web server'@my.proxy:my.port
>>
>
> 'Squid proxy-caching web server'@my.proxy:my.port - should be the
> authentication details mentioned in the proxy configuration.
>
> It means that the 'http.proxy.realm' should be specified as: Squid
> proxy-caching web server
>
> You can also try omitting the value for 'http.proxy.realm' property.
>
> I was also wanted to confirm whether you got the following line in
> 'logs/hadoop.log':
>
> Custom logs for troubleshooting authentication (set 4)
>
> If you have got this line and your configuration is correct, I don't
> see a reason why AuthChallengeProcessor should complain about missing
> credentials. It could be a bug either in Nutch or in the Jakarta
> Commons HttpClient library which is being used in Nutch to do the
> authentication. It could also be a mistake in the configuration.
>
> In case, you find a way to resolve it, please let us know what the
> problem was and how you resolved it.
>
> Regards,
> Susam Pal
>

Here is an update. It is most likely a configuration problem. I just
tested the proxy authentication feature with a Squid proxy server with
the same realm as yours. It works well.

Also, the issue you are facing seems to be due to an incorrect realm
specified. I would suggest that you omit the realm and see if it works
fine. When you omit the realm, the corresponding XML code for the
configuration should look like this:

<property>
  <name>http.proxy.realm</name>
  <value></value>
  <description></description>
</property>

In case you do want to specify the realm, your XML code should look like this:

<property>
  <name>http.proxy.realm</name>
  <value>Squid proxy-caching web server</value>
  <description></description>
</property>

Note that this is the exact string appearing in the log message:

INFO  httpclient.HttpMethodDirector - No credentials available for
BASIC 'Squid proxy-caching web server'@my.proxy:my.port

There should be no quotes around the string.

If everything goes fine, the logs should appear like the following.
These logs are from my system.

2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest,
basic]
2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Challenge
for ntlm authentication scheme not available
2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Challenge
for digest authentication scheme not available
2010-03-16 02:45:28,280 INFO  auth.AuthChallengeProcessor - basic
authentication scheme selected
2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: basic
2010-03-16 02:45:28,281 DEBUG auth.AuthChallengeProcessor -
Authorization challenge processed
2010-03-16 02:45:28,282 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest,
basic]
2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Challenge
for ntlm authentication scheme not available
2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Challenge
for digest authentication scheme not available
2010-03-16 02:45:28,283 INFO  auth.AuthChallengeProcessor - basic
authentication scheme selected
2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: basic
2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor -
Authorization challenge processed
2010-03-16 02:45:28,284 DEBUG auth.BasicScheme - enter
BasicScheme.authenticate(Credentials, HttpMethod)
2010-03-16 02:45:28,286 DEBUG auth.BasicScheme - enter
BasicScheme.authenticate(UsernamePasswordCredentials, String)
2010-03-16 02:45:28,286 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest,
basic]
2010-03-16 02:45:28,287 DEBUG auth.AuthChallengeProcessor - Challenge
for ntlm authentication scheme not available
2010-03-16 02:45:28,287 DEBUG auth.AuthChallengeProcessor - Challenge
for digest authentication scheme not available
2010-03-16 02:45:28,287 INFO  auth.AuthChallengeProcessor - basic
authentication scheme selected
2010-03-16 02:45:28,287 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: basic
2010-03-16 02:45:28,288 DEBUG auth.AuthChallengeProcessor -
Authorization challenge processed
2010-03-16 02:45:28,288 DEBUG auth.BasicScheme - enter
BasicScheme.authenticate(Credentials, HttpMethod)
2010-03-16 02:45:28,288 DEBUG auth.BasicScheme - enter
BasicScheme.authenticate(UsernamePasswordCredentials, String)
2010-03-16 02:45:28,284 DEBUG auth.BasicScheme - enter
BasicScheme.authenticate(Credentials, HttpMethod)
2010-03-16 02:45:28,289 DEBUG auth.BasicScheme - enter
BasicScheme.authenticate(UsernamePasswordCredentials, String)
2010-03-16 02:45:28,330 DEBUG httpclient.Http - url:
http://en.wikipedia.org/robots.txt; status code: 200; bytes received:
4853; Content-Length: 4853; Content-Encoding: gzip; extracted to 26147
bytes

I hope this helps.

If you still face issues, please send me the complete log file
(logs/hadoop.log) and the complete configuration file
(conf/nutch-site.xml). It is easier to spot configuration mistakes if
you send the complete files. Please do remove the existing hadoop.log
file before starting a new crawl so that that the log file you send us
isn't too large.

Regards,
Susam Pal

Reply via email to