Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

------------------------------------------------------------------------------
  === Important Points ===
   1. For <authscope> tag, 'host' and 'port' attribute should always be 
specified. 'realm' and 'scheme' attributes may or may not be specified 
depending on your needs. If you are tempted to omit the 'host' and 'port' 
attribute, because you want the credentials to be used for any host and any 
port for that realm/scheme, please use the 'default' tag instead. That's what 
'default' tag is meant for.
   1. One authentication scope should not be defined twice as different 
<authscope> tags for different <credentials> tag. However, if this is done by 
mistake, the credentials for the last defined <authscope> tag would be used. 
This is because, the XML parsing code, reads the file from top to bottom and 
sets the credentials for authentication-scopes. If the same authentication 
scope is encountered once again, it will be overwritten with the new 
credentials. However, one should not rely on this behavior as this might change 
with further developments.
-  1. Do not define multiple authscope tags with the same host, port but 
different realms if the server requires NTLM authentication. This can means 
there should not be multiple tags with same host, port, scheme="NTLM" but 
different realms. If you are omitting the scheme attribute and the server 
requires NTLM authentication, then there should not be multiple tags with same 
host, port but different realms. This is discussed more in the next section.
+  1. Do not define multiple authscope tags with the same host, port but 
different realms if the server requires NTLM authentication. This means there 
should not be multiple tags with same host, port, scheme="NTLM" but different 
realms. If you are omitting the scheme attribute and the server requires NTLM 
authentication, then there should not be multiple tags with same host, port but 
different realms. This is discussed more in the next section.
   1. If you are using NTLM scheme, you should also set the 'http.agent.host' 
property in conf/nutch-site.xml
  
  === A note on NTLM domains ===
@@ -104, +104 @@

   1. Do you see Nutch trying to fetch the pages you were expecting in 
'logs/hadoop.log'. You should see some logs like "fetching 
http://www.example.com/expectedpage.html"; where the URL is the page you were 
expecting to be fetched. If you don't see such lines for the pages you were 
expecting, the error is outside the scope of this feature. This feature comes 
into action only when the crawler is fetching a page but the page requires 
authentication.
   1. With debug logs enabled, check whether there are logs beginning with 
"Credentials" in 'logs/hadoop.log'. The lines would look like "Credentials - 
username someuser; set ...". For every entry in 'conf/httpclient-auth.xml' you 
should find a corresponding log. If they are absent, probably you haven't 
included 'plugin.includes'. In case you have manually patched Nutch 0.9 source 
code with the patch, this issue may be caused if you have not built the project.
   1. Do you see logs like this: "auth.!AuthChallengeProcessor - basic 
authentication scheme selected"? Instead of the word 'basic', you might see 
'digest' or 'NTLM' depending on the scheme supported by the page being fetched? 
If you do not see it at all, probably the web server or the page being fetched 
does not require authentication. In that case, the crawler would not try to 
authenticate. If you were expecting an authentication for the page, probably 
something needs to be fixed at the server side.
-  1. You should also see some logs that begin with: "Pre-configured 
credentials with scope". It is very unlikely that this should happen after you 
have ensured all the above points. If it happens, please let us know in the 
mailing list.
  
  Once you have checked the items listed above and you are still unable to fix 
the problem or confused about any point listed above, please mail the issue 
with the following information:
  
   1. Version of Nutch you are running.
-  1. Did you get this feature directly from subversion or did you download the 
patch separately and apply?
+  1. Complete code in ''conf/httpclient-auth.xml' file.
   1. Relevant portion from 'logs/hadoop.log' file. If you are clueless, send 
the complete file.
  

Reply via email to