On Fri, Mar 12, 2010 at 3:17 PM, Susam Pal <susam....@gmail.com> wrote:
> On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
> <graziano.alibe...@eng.it> wrote:
>> Il 11/03/2010 16.20, Susam Pal ha scritto:
>>>
>>> On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
>>> <graziano.alibe...@eng.it>  wrote:
>>>
>>>>
>>>> Hi everyone,
>>>>
>>>> I'm trying to use nutch ver. 1.0 on a system under squid proxy control.
>>>> When
>>>> I try to fetch my website list, into the log file I see that the
>>>> authentication was failed...
>>>>
>>>> I've configured my nutch-site.xml file with all that properties needed
>>>> for
>>>> proxy auth, but my error is "httpclient.HttpMethodDirector - No
>>>> credentials
>>>> available for BASIC 'Squid proxy-caching web
>>>> server'@proxy.my.host:my.port"
>>>>
>>>>
>>>
>>> Did you replace 'protocol-http' with 'protocol-httpclient' in the
>>> value for 'plugins.include' property in 'conf/nutch-site.xml'?
>>>
>>> Regards,
>>> Susam Pal
>>>
>>>
>>>
>>
>> Hi Susam,
>>
>> yes of course!! :) Maybe I can post you the configuration file:
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>>
>> <property>
>> <name>http.agent.name</name>
>> <value>my.agent.name</value>
>> <description>
>> </description>
>> </property>
>>
>> <property>
>> <name>plugin.includes</name>
>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>> <description>
>> </description>
>> </property>
>>
>> <property>
>> <name>http.auth.file</name>
>> <value>my_file.xml</value>
>> <description>Authentication configuration file for
>>  'protocol-httpclient' plugin.
>> </description>
>> </property>
>>
>> <property>
>> <name>http.proxy.host</name>
>> <value>ip.my.proxy</value>
>> <description>The proxy hostname.  If empty, no proxy is used.</description>
>> </property>
>>
>> <property>
>> <name>http.proxy.port</name>
>> <value>my.port</value>
>> <description>The proxy port.</description>
>> </property>
>>
>> <property>
>> <name>http.proxy.username</name>
>> <value>my.user</value>
>> <description>
>> </description>
>> </property>
>>
>> <property>
>> <name>http.proxy.password</name>
>> <value>my.pwd</value>
>> <description>
>> </description>
>> </property>
>>
>> <property>
>> <name>http.proxy.realm</name>
>> <value>my_realm</value>
>> <description>
>> </description>
>> </property>
>>
>> <property>
>> <name>http.agent.host</name>
>> <value>my.local.pc</value>
>> <description>The agent host.</description>
>> </property>
>>
>> <property>
>> <name>http.useHttp11</name>
>> <value>true</value>
>> <description>
>> </description>
>> </property>
>>
>> </configuration>
>>
>> Only another question: where i must put the user authentication parameters
>> (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for
>> authentication?
>>
>> Thank you for your attention,
>>
>>
>> --
>> -----------
>>
>> Graziano Aliberti
>>
>> Engineering Ingegneria Informatica S.p.A
>>
>> Via S. Martino della Battaglia, 56 - 00185 ROMA
>>
>> *Tel.:* 06.49.201.387
>>
>> *E-Mail:* graziano.alibe...@eng.it
>>
>>
>>
>
> The configuration looks okay to me. Yes, the proxy authentication
> details are set in 'conf/nutch-site.xml'. The file mentioned in
> 'http.auth.file' property is used for configuring authentication
> details for authenticating to a web server.
>
> Unfortunately, there aren't any log statements in the part of the code
> that reads the proxy authentication details. So, I can't suggest you
> to turn on debug logs to get some clues about the issue. However, in
> case you want to troubleshoot it yourself by building Nutch from
> source, I can tell you the code that deals with this.
>
> The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java :
> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup
>
> The line number is: 200.
>
> If I get time this weekend, I will try to insert some log statements
> into this code and send a modified JAR file to you which might help
> you to troubleshoot what is going on. But I can't promise this since
> it depends on my weekend plans.
>
> Two questions before I end this mail. Did you set the value of
> 'http.proxy.realm' property as: Squid proxy-caching web server ?
>
> Also, do you see any 'auth.AuthChallengeProcessor' lines in the log
> file? I'm not sure whether this line should appear for proxy
> authentication but it does appear for web server authentication.
>
> Regards,
> Susam Pal
>

I managed to find some time to insert more logs into
protocol-httpclient and create a JAR. I have attached it with this
email.

Please replace your
'plugins/protocol-httpclient/protocol-httpclient.jar' file with the
one that I have attached. Also, edit your 'conf/log4j.properties' file
to add these two lines:

log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout

When you run a crawl now, you should see more logs in
'logs/hadoop.log' than before. I hope it helps you in providing some
clues. In case you want to compare the logs with how the control flows
from the source code, I have attached the JAVA file as well.

Regards,
Susam Pal

Attachment: protocol-httpclient.jar
Description: application/java-archive

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.nutch.protocol.httpclient;

// JDK imports
import java.io.InputStream;
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.xml.sax.SAXException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;

// Commons Logging imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

// HTTP Client imports
import org.apache.commons.httpclient.Header;
import org.apache.commons.httpclient.HostConfiguration;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.MultiThreadedHttpConnectionManager;
import org.apache.commons.httpclient.NTCredentials;
import org.apache.commons.httpclient.auth.AuthScope;
import org.apache.commons.httpclient.params.HttpConnectionManagerParams;
import org.apache.commons.httpclient.protocol.Protocol;

// Nutch imports
import org.apache.nutch.util.LogUtil;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.net.protocols.Response;
import org.apache.nutch.protocol.ProtocolException;
import org.apache.nutch.protocol.http.api.HttpBase;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.util.NutchConfiguration;

/**
 * This class is a protocol plugin that configures an HTTP client for
 * Basic, Digest and NTLM authentication schemes for web server as well
 * as proxy server. It takes care of HTTPS protocol as well as cookies
 * in a single fetch session.
 *
 * @author Susam Pal
 */
public class Http extends HttpBase {

  public static final Log LOG = LogFactory.getLog(Http.class);

  private static MultiThreadedHttpConnectionManager connectionManager =
          new MultiThreadedHttpConnectionManager();

  // Since the Configuration has not yet been set,
  // then an unconfigured client is returned.
  private static HttpClient client = new HttpClient(connectionManager);
  private static String defaultUsername;
  private static String defaultPassword;
  private static String defaultRealm;
  private static String defaultScheme;
  private static String authFile;
  private static String agentHost;
  private static boolean authRulesRead = false;
  private static Configuration conf;

  int maxThreadsTotal = 10;

  private String proxyUsername;
  private String proxyPassword;
  private String proxyRealm;


  /**
   * Returns the configured HTTP client.
   *
   * @return HTTP client
   */
  static synchronized HttpClient getClient() {
    return client;
  }

  /**
   * Constructs this plugin.
   */
  public Http() {
    super(LOG);
  }

  /**
   * Reads the configuration from the Nutch configuration files and sets
   * the configuration.
   *
   * @param conf Configuration
   */
  public void setConf(Configuration conf) {
    super.setConf(conf);
    this.conf = conf;
    this.maxThreadsTotal = conf.getInt("fetcher.threads.fetch", 10);
    this.proxyUsername = conf.get("http.proxy.username", "");
    this.proxyPassword = conf.get("http.proxy.password", "");
    this.proxyRealm = conf.get("http.proxy.realm", "");

    if (LOG.isTraceEnabled()) {
        LOG.trace("------------------------------------------------------");
        LOG.trace("Custom logs for troubleshooting authentication (set 1)");
        LOG.trace("Proxy host: " + this.proxyHost);
        LOG.trace("Proxy port: " + this.proxyPort);
        LOG.trace("useProxy: " + this.useProxy);
        LOG.trace("Proxy username: " + this.proxyUsername);
        LOG.trace("Proxy password: " + (this.proxyPassword.length() > 0 ?
                                        "password is present" :
                                        "password is absent"));
        LOG.trace("Proxy realm: " + this.proxyRealm);
        LOG.trace("------------------------------------------------------");
    }

    agentHost = conf.get("http.agent.host", "");
    authFile = conf.get("http.auth.file", "");
    configureClient();
    try {
      setCredentials();
    } catch (Exception ex) {
      if (LOG.isFatalEnabled()) {
        LOG.fatal("Could not read " + authFile + " : " + ex.getMessage());
        ex.printStackTrace(LogUtil.getErrorStream(LOG));
      }
    }
  }

  /**
   * Main method.
   *
   * @param args Command line arguments
   */
  public static void main(String[] args) throws Exception {
    Http http = new Http();
    http.setConf(NutchConfiguration.create());
    main(http, args);
  }

  /**
   * Fetches the <code>url</code> with a configured HTTP client and
   * gets the response.
   *
   * @param url       URL to be fetched
   * @param datum     Crawl data
   * @param redirect  Follow redirects if and only if true
   * @return          HTTP response
   */
  protected Response getResponse(URL url, CrawlDatum datum, boolean redirect)
    throws ProtocolException, IOException {
    resolveCredentials(url);
    return new HttpResponse(this, url, datum, redirect);
  }

  /**
   * Configures the HTTP client
   */
  private void configureClient() {

    // Set up an HTTPS socket factory that accepts self-signed certs.
    Protocol https = new Protocol("https",
        new DummySSLProtocolSocketFactory(), 443);
    Protocol.registerProtocol("https", https);

    HttpConnectionManagerParams params = connectionManager.getParams();
    params.setConnectionTimeout(timeout);
    params.setSoTimeout(timeout);
    params.setSendBufferSize(BUFFER_SIZE);
    params.setReceiveBufferSize(BUFFER_SIZE);
    params.setMaxTotalConnections(maxThreadsTotal);
    if (maxThreadsTotal > maxThreadsPerHost) {
      params.setDefaultMaxConnectionsPerHost(maxThreadsPerHost);
    } else {
      params.setDefaultMaxConnectionsPerHost(maxThreadsTotal);
    }

    // executeMethod(HttpMethod) seems to ignore the connection timeout on the connection manager.
    // set it explicitly on the HttpClient.
    client.getParams().setConnectionManagerTimeout(timeout);

    HostConfiguration hostConf = client.getHostConfiguration();
    ArrayList headers = new ArrayList();
    // Set the User Agent in the header
    headers.add(new Header("User-Agent", userAgent));
    // prefer English
    headers.add(new Header("Accept-Language", "en-us,en-gb,en;q=0.7,*;q=0.3"));
    // prefer UTF-8
    headers.add(new Header("Accept-Charset", "utf-8,ISO-8859-1;q=0.7,*;q=0.7"));
    // prefer understandable formats
    headers.add(new Header("Accept",
            "text/html,application/xml;q=0.9,application/xhtml+xml,text/xml;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"));
    // accept gzipped content
    headers.add(new Header("Accept-Encoding", "x-gzip, gzip, deflate"));
    hostConf.getParams().setParameter("http.default-headers", headers);

    // HTTP proxy server details
    if (useProxy) {
      hostConf.setProxy(proxyHost, proxyPort);

      if (LOG.isTraceEnabled()) {
          LOG.trace("------------------------------------------------------");
          LOG.trace("Custom logs for troubleshooting authentication (set 2)");
          LOG.trace("Proxy host set in host configuration: " +
                    hostConf.getProxyHost());
          LOG.trace("Proxy port set in host configuration: " +
                    hostConf.getProxyPort());
          LOG.trace("------------------------------------------------------");
      }

      if (proxyUsername.length() > 0) {

        if (LOG.isTraceEnabled()) {
        }

        AuthScope proxyAuthScope = getAuthScope(
            this.proxyHost, this.proxyPort, this.proxyRealm);

        if (LOG.isTraceEnabled()) {
            LOG.trace("------------------------------------------------------");
            LOG.trace("Custom logs for troubleshooting authentication (set 3)");
            LOG.trace("Proxy host set in authentication scope: " +
                       proxyAuthScope.getHost());
            LOG.trace("Proxy port set in authentication scope: " +
                       proxyAuthScope.getPort());
            LOG.trace("Proxy realm set in authentication scope: " +
                       proxyAuthScope.getRealm());
            LOG.trace("------------------------------------------------------");
        }

        NTCredentials proxyCredentials = new NTCredentials(
            this.proxyUsername, this.proxyPassword,
            this.agentHost, this.proxyRealm);

        if (LOG.isTraceEnabled()) {
            LOG.trace("------------------------------------------------------");
            LOG.trace("Custom logs for troubleshooting authentication (set 4)");
            LOG.trace("Proxy user name set in proxy credentials: " +
                      proxyCredentials.getUserName());
            LOG.trace("Proxy user name password set in proxy credentials: " +
                      (proxyCredentials.getPassword().length() > 0 ?
                       "password is prsent" : "password is absent"));
            LOG.trace("Host set in proxy credentials: " +
                      proxyCredentials.getHost());
            LOG.trace("Domain set in proxy credentials: " +
                      proxyCredentials.getDomain());
            LOG.trace("------------------------------------------------------");
        }
        client.getState().setProxyCredentials(
            proxyAuthScope, proxyCredentials);

        if (LOG.isTraceEnabled()) {
            LOG.trace("Client set with authentication scope and credentials.");
        }
      }
    }

  }

  /**
   * Reads authentication configuration file (defined as
   * 'http.auth.file' in Nutch configuration file) and sets the
   * credentials for the configured authentication scopes in the HTTP
   * client object.
   *
   * @throws ParserConfigurationException  If a document builder can not
   *                                       be created.
   * @throws SAXException                  If any parsing error occurs.
   * @throws IOException                   If any I/O error occurs.
   */
  private static synchronized void setCredentials() throws 
      ParserConfigurationException, SAXException, IOException {

    if (authRulesRead)
      return;

    authRulesRead = true; // Avoid re-attempting to read

    InputStream is = conf.getConfResourceAsInputStream(authFile);    
    if (is != null) {
      Document doc = DocumentBuilderFactory.newInstance()
                     .newDocumentBuilder().parse(is);

      Element rootElement = doc.getDocumentElement();
      if (!"auth-configuration".equals(rootElement.getTagName())) {
        if (LOG.isWarnEnabled())
          LOG.warn("Bad auth conf file: root element <"
              + rootElement.getTagName() + "> found in " + authFile
              + " - must be <auth-configuration>");
      }

      // For each set of credentials
      NodeList credList = rootElement.getChildNodes();
      for (int i = 0; i < credList.getLength(); i++) {
        Node credNode = credList.item(i);
        if (!(credNode instanceof Element))
          continue;    

        Element credElement = (Element) credNode;
        if (!"credentials".equals(credElement.getTagName())) {
          if (LOG.isWarnEnabled())
            LOG.warn("Bad auth conf file: Element <"
            + credElement.getTagName() + "> not recognized in "
            + authFile + " - expected <credentials>");
          continue;
        }

        String username = credElement.getAttribute("username");
        String password = credElement.getAttribute("password");

        // For each authentication scope
        NodeList scopeList = credElement.getChildNodes();
        for (int j = 0; j < scopeList.getLength(); j++) {
          Node scopeNode = scopeList.item(j);
          if (!(scopeNode instanceof Element))
            continue;
          
          Element scopeElement = (Element) scopeNode;

          if ("default".equals(scopeElement.getTagName())) {

            // Determine realm and scheme, if any
            String realm = scopeElement.getAttribute("realm");
            String scheme = scopeElement.getAttribute("scheme");

            // Set default credentials
            defaultUsername = username;
            defaultPassword = password;
            defaultRealm = realm;
            defaultScheme = scheme;

            if (LOG.isTraceEnabled()) {
              LOG.trace("Credentials - username: " + username 
                  + "; set as default"
                  + " for realm: " + realm + "; scheme: " + scheme);
            }

          } else if ("authscope".equals(scopeElement.getTagName())) {

            // Determine authentication scope details
            String host = scopeElement.getAttribute("host");
            int port = -1; // For setting port to AuthScope.ANY_PORT
            try {
              port = Integer.parseInt(
                  scopeElement.getAttribute("port"));
            } catch (Exception ex) {
              // do nothing, port is already set to any port
            }
            String realm = scopeElement.getAttribute("realm");
            String scheme = scopeElement.getAttribute("scheme");

            // Set credentials for the determined scope
            AuthScope authScope = getAuthScope(host, port, realm, scheme);
            NTCredentials credentials = new NTCredentials(
                username, password, agentHost, realm);

            client.getState().setCredentials(authScope, credentials);

            if (LOG.isTraceEnabled()) {
              LOG.trace("Credentials - username: " + username
                  + "; set for AuthScope - " + "host: " + host
                  + "; port: " + port + "; realm: " + realm
                  + "; scheme: " + scheme);
            }

          } else {
            if (LOG.isWarnEnabled())
              LOG.warn("Bad auth conf file: Element <"
                  + scopeElement.getTagName() + "> not recognized in "
                  + authFile + " - expected <authscope>");
          }
        }
        is.close();
      }
    }
  }

  /**
   * If credentials for the authentication scope determined from the
   * specified <code>url</code> is not already set in the HTTP client,
   * then this method sets the default credentials to fetch the
   * specified <code>url</code>. If credentials are found for the
   * authentication scope, the method returns without altering the
   * client.
   *
   * @param url URL to be fetched
   */
  private void resolveCredentials(URL url) {

    if (defaultUsername != null && defaultUsername.length() > 0) {

      int port = url.getPort();
      if (port == -1) {
        if ("https".equals(url.getProtocol()))
          port = 443;
        else
          port = 80;
      }

      AuthScope scope = new AuthScope(url.getHost(), port);

      if (client.getState().getCredentials(scope) != null) {
        if (LOG.isTraceEnabled())
          LOG.trace("Pre-configured credentials with scope - host: "
              + url.getHost() + "; port: " + port
              + "; found for url: " + url);

        // Credentials are already configured, so do nothing and return
        return;
      }

      if (LOG.isTraceEnabled())
          LOG.trace("Pre-configured credentials with scope -  host: "
              + url.getHost() + "; port: " + port
              + "; not found for url: " + url);

      AuthScope serverAuthScope = getAuthScope(
          url.getHost(), port, defaultRealm, defaultScheme);

      NTCredentials serverCredentials = new NTCredentials(
          defaultUsername, defaultPassword,
          agentHost, defaultRealm);

      client.getState().setCredentials(
          serverAuthScope, serverCredentials);
    }
  }

  /**
   * Returns an authentication scope for the specified
   * <code>host</code>, <code>port</code>, <code>realm</code> and
   * <code>scheme</code>.
   *
   * @param host    Host name or address.
   * @param port    Port number.
   * @param realm   Authentication realm.
   * @param scheme  Authentication scheme.
   */
  private static AuthScope getAuthScope(String host, int port,
      String realm, String scheme) {
    
    if (host.length() == 0)
      host = null;

    if (port < 0)
      port = -1;

    if (realm.length() == 0)
      realm = null;

    if (scheme.length() == 0)
      scheme = null;

    return new AuthScope(host, port, realm, scheme);
  }

  /**
   * Returns an authentication scope for the specified
   * <code>host</code>, <code>port</code> and <code>realm</code>.
   *
   * @param host    Host name or address.
   * @param port    Port number.
   * @param realm   Authentication realm.
   */
  private static AuthScope getAuthScope(String host, int port,
      String realm) {

      return getAuthScope(host, port, realm, "");
  }
}

Reply via email to