I am surprised that #1 (default authentication scope) did not work.
You also mention that the user id needs additional permissions to
fetch the resource. In that case, it should require these permissions
even if you try to access the resource with a browser.

It would help if you could provide the logs for #1 as well as #2. It
would be interesting to see why the fetch fails for #1 but succeeds
for #2.

Regards,
Susam Pal

On Tue, Mar 31, 2009 at 11:01 PM, Austin, David <[email protected]> wrote:
> Hello again,
>
>
> Did you set the 'http.agent.host' in 'conf/nutch-site.xml' ?
> I didn't have it set, but have now set it.
> <property>
>  <name>http.agent.host</name>
>  <value>serverB.domain.com</value>
> </property>
>
> #1 didn't work.
>
> #2 ended up working.  Though the user id needs additional permissions as 
> we're seeing but it's working nonetheless.
>
> -----Original Message-----
> From: Susam Pal [mailto:[email protected]]
> Sent: Tuesday, March 31, 2009 10:44 AM
> To: [email protected]
> Subject: Re: Nutch 1.0 - NTLM question
>
> Hi Austin,
>
> I read the logs and I went back to the code too
> <http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?revision=749247&view=markup>.
>
> However, I don't find anything unusual that could cause this
> authentication problem. I just want to check another point even though
> it is not very important. Did you set the 'http.agent.host' in
> 'conf/nutch-site.xml' ?
>
> I would like to know the following:
>
> 1. Whether this works:
>
> <credentials username="user" password="pass">
>  <default/>
> </credentials>
>
> 2. Whether this works:
>
>  <credentials username="user" password="pass">
>   <authscope host="server.domain.com" port="80"/>
>  </credentials>
>
> 3. Whether this works:
>
>  <credentials username="user" password="pass">
>   <authscope host="server.domain.com" port="80" scheme="NTLM"/>
>  </credentials>
>
> If possible, please provide me the relevant logs for each of these three 
> cases.
>
> Regards,
> Susam Pal
>
> On Tue, Mar 31, 2009 at 9:44 PM, Austin, David <[email protected]> 
> wrote:
>> Hi Susam,
>>
>> Thanks for your quick response.  I've gone through the "Need Help" section.  
>> Modified a few things accordingly.
>>
>> Turned on the debugging using:
>> log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
>>
>> I had missed the following in nutch-site.xml, so I've since added that so I 
>> now see it trying to authenticate.
>>  <property>
>>  <name>plugin.includes</name>
>>  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>  </property>
>>
>> In my logs, whether I have NTLM selected or not in http-client-auth.xml, I 
>> see the following (note: I've tried domain\user and just user with the realm 
>> as the domain and neither work):
>>
>> 2009-03-31 10:07:03,601 DEBUG httpclient.Http - Credentials - username: 
>> user; set for AuthScope - host: server.domain.com; port: 80; realm: domain; 
>> scheme:
>> 2009-03-31 10:07:03,648 INFO  fetcher.Fetcher - -finishing thread 
>> FetcherThread, activeThreads=8
>> 2009-03-31 10:07:03,648 INFO  fetcher.Fetcher - -finishing thread 
>> FetcherThread, activeThreads=7
>> 2009-03-31 10:07:03,664 INFO  fetcher.Fetcher - -finishing thread 
>> FetcherThread, activeThreads=6
>> 2009-03-31 10:07:03,664 INFO  fetcher.Fetcher - -finishing thread 
>> FetcherThread, activeThreads=5
>> 2009-03-31 10:07:03,664 INFO  fetcher.Fetcher - -finishing thread 
>> FetcherThread, activeThreads=4
>> 2009-03-31 10:07:03,680 INFO  fetcher.Fetcher - -finishing thread 
>> FetcherThread, activeThreads=3
>> 2009-03-31 10:07:03,680 INFO  fetcher.Fetcher - -finishing thread 
>> FetcherThread, activeThreads=2
>> 2009-03-31 10:07:03,695 INFO  fetcher.Fetcher - -finishing thread 
>> FetcherThread, activeThreads=1
>> 2009-03-31 10:07:03,758 INFO  auth.AuthChallengeProcessor - ntlm 
>> authentication scheme selected
>> 2009-03-31 10:07:04,070 INFO  httpclient.HttpMethodDirector - Failure 
>> authenticating with NTLM <any realm>@server.domain.com:80
>> 2009-03-31 10:07:04,070 DEBUG httpclient.Http - url: 
>> http://server.domain.com/secured; status code: 401; bytes received: 24; 
>> Content-Length: 24
>> 2009-03-31 10:07:04,117 DEBUG httpclient.Http - 401 Authentication Required
>>
>> -----Original Message-----
>> From: Susam Pal [mailto:[email protected]]
>> Sent: Tuesday, March 31, 2009 9:58 AM
>> To: [email protected]
>> Subject: Re: Nutch 1.0 - NTLM question
>>
>> On Tue, Mar 31, 2009 at 9:03 PM, Austin, David <[email protected]> 
>> wrote:
>>> Got Nutch 1.0 setup fairly easily and even did a couple crawls. Very
>>> pleased with the results so far. However, now I am trying to get the
>>> NTLM portion to work.
>>>
>>> Following the instructions here:
>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>> <http://wiki.apache.org/nutch/HttpAuthenticationSchemes>
>>>
>>> My httpclient-auth.xml looks as follows:
>>>
>>> <auth-configuration>
>>>  <credentials username="user" password="pass">
>>>    <authscope host="server.domain.com" port="80" realm="domain.com"
>>> scheme="NTLM"/>
>>>  </credentials>
>>> </auth-configuration>
>>
>> Hi David,
>>
>> For troubleshooting, I would suggest that you start with the simplest
>> configuration for authentication. The simplest configuration contains
>> only the default authentication scope.
>>
>> <credentials username="susam" password="masus">
>>  <default/>
>> </credentials>
>>
>> This is discussed in
>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes in section
>> "Crawling an Intranet with Default Authentication Scope". If this
>> doesn't work fine, please go to "Need Help?" section in the same wiki
>> article and follow the checklist and send us the relevant log files.
>> If this goes fine, probably the authentication scope is not configured
>> properly. You could ensure that the server indeed requires NTLM
>> authentication and not Basic or Digest authentication. The realm value
>> is another thing that could go wrong.
>>
>>>
>>> Is this the correct setup for NTLM?  At present I'm only receiving 401's
>>> so it doesn't appear to be working in this setup.  Basic auth would look
>>> like "domain\user" if we were to login that way in case you're curious.
>>
>> I doubt that the value you have put in realm is correct. If you visit
>> the page you are trying to crawl using a browser, what credentials do
>> you enter? If you enter "domain\user" as the user name then only
>> "domain" should go as the value of realm. However, before configuring
>> authentication scope for NTLM scheme, I would suggest that you first
>> get the default authentication scope working and then proceed with the
>> configuration for NTLM authentication scheme.
>>
>>>
>>> I noticed that for 0.9 there were properties that had to be setup in
>>> nutch-site.xml; is that still the case?  Refering to this link:
>>> http://www.mail-archive.com/[email protected]/msg02102.htm
>>> l
>>> Here it looks like several http.auth.username and http.auth.password
>>> have to be set.  Based on what I read though that's not needed anymore
>>> in 1.0 based upon [NUTCH-559:
>>> https://issues.apache.org/jira/browse/NUTCH-559], correct?
>>
>> Yes, you are right. http.auth.username and http.auth.password are not
>> required. They were present during the development of this feature but
>> they were removed as the development progressed.
>>
>> Regards,
>> Susam Pal
>>
>> This email communication and any files transmitted with it may contain 
>> confidential and or proprietary information and is provided for the use of 
>> the intended recipient only.  Any review, retransmission or dissemination of 
>> this information by anyone other than the intended recipient is prohibited.  
>> If you receive this email in error, please contact the sender and delete 
>> this communication and any copies immediately.  Thank you.
>> http://www.encana.com
>>
>>
>

Reply via email to