Il 13/03/2010 22.55, Susam Pal ha scritto:
On Fri, Mar 12, 2010 at 3:17 PM, Susam Pal<susam....@gmail.com> wrote:
On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
<graziano.alibe...@eng.it> wrote:
Il 11/03/2010 16.20, Susam Pal ha scritto:
On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
<graziano.alibe...@eng.it> wrote:
Hi everyone,
I'm trying to use nutch ver. 1.0 on a system under squid proxy control.
When
I try to fetch my website list, into the log file I see that the
authentication was failed...
I've configured my nutch-site.xml file with all that properties needed
for
proxy auth, but my error is "httpclient.HttpMethodDirector - No
credentials
available for BASIC 'Squid proxy-caching web
server'@proxy.my.host:my.port"
Did you replace 'protocol-http' with 'protocol-httpclient' in the
value for 'plugins.include' property in 'conf/nutch-site.xml'?
Regards,
Susam Pal
Hi Susam,
yes of course!! :) Maybe I can post you the configuration file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>my.agent.name</value>
<description>
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>
</description>
</property>
<property>
<name>http.auth.file</name>
<value>my_file.xml</value>
<description>Authentication configuration file for
'protocol-httpclient' plugin.
</description>
</property>
<property>
<name>http.proxy.host</name>
<value>ip.my.proxy</value>
<description>The proxy hostname. If empty, no proxy is used.</description>
</property>
<property>
<name>http.proxy.port</name>
<value>my.port</value>
<description>The proxy port.</description>
</property>
<property>
<name>http.proxy.username</name>
<value>my.user</value>
<description>
</description>
</property>
<property>
<name>http.proxy.password</name>
<value>my.pwd</value>
<description>
</description>
</property>
<property>
<name>http.proxy.realm</name>
<value>my_realm</value>
<description>
</description>
</property>
<property>
<name>http.agent.host</name>
<value>my.local.pc</value>
<description>The agent host.</description>
</property>
<property>
<name>http.useHttp11</name>
<value>true</value>
<description>
</description>
</property>
</configuration>
Only another question: where i must put the user authentication parameters
(user,pwd)? In nutch-site.xml file or in my_file.xml that I use for
authentication?
Thank you for your attention,
--
-----------
Graziano Aliberti
Engineering Ingegneria Informatica S.p.A
Via S. Martino della Battaglia, 56 - 00185 ROMA
*Tel.:* 06.49.201.387
*E-Mail:* graziano.alibe...@eng.it
The configuration looks okay to me. Yes, the proxy authentication
details are set in 'conf/nutch-site.xml'. The file mentioned in
'http.auth.file' property is used for configuring authentication
details for authenticating to a web server.
Unfortunately, there aren't any log statements in the part of the code
that reads the proxy authentication details. So, I can't suggest you
to turn on debug logs to get some clues about the issue. However, in
case you want to troubleshoot it yourself by building Nutch from
source, I can tell you the code that deals with this.
The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java :
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup
The line number is: 200.
If I get time this weekend, I will try to insert some log statements
into this code and send a modified JAR file to you which might help
you to troubleshoot what is going on. But I can't promise this since
it depends on my weekend plans.
Two questions before I end this mail. Did you set the value of
'http.proxy.realm' property as: Squid proxy-caching web server ?
Also, do you see any 'auth.AuthChallengeProcessor' lines in the log
file? I'm not sure whether this line should appear for proxy
authentication but it does appear for web server authentication.
Regards,
Susam Pal
I managed to find some time to insert more logs into
protocol-httpclient and create a JAR. I have attached it with this
email.
Please replace your
'plugins/protocol-httpclient/protocol-httpclient.jar' file with the
one that I have attached. Also, edit your 'conf/log4j.properties' file
to add these two lines:
log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout
When you run a crawl now, you should see more logs in
'logs/hadoop.log' than before. I hope it helps you in providing some
clues. In case you want to compare the logs with how the control flows
from the source code, I have attached the JAVA file as well.
Regards,
Susam Pal
Hi Susam,
first of all I want to thank you for your support :). I've tried your
solution and I've seen in the log file the that the authentication
parameters was correctly read by the application.
In the log file I've finded these lines about auth.AuthChallengeProcessor:
2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge
for ntlm authentication scheme not available
2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge
for digest authentication scheme not available
2010-03-15 09:52:33,140 INFO auth.AuthChallengeProcessor - basic
authentication scheme selected
2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: basic
2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor -
Authorization challenge processed
2010-03-15 09:52:33,140 INFO httpclient.HttpMethodDirector - No
credentials available for BASIC 'Squid proxy-caching web
server'@my.proxy:my.port
Best regards,
--
-----------
Graziano Aliberti
Engineering Ingegneria Informatica S.p.A
Via S. Martino della Battaglia, 56 - 00185 ROMA
*Tel.:* 06.49.201.387
*E-Mail:* graziano.alibe...@eng.it