Your configuration seems fine. Ideally http.agent.url should point to
a page where you describe your crawler, but that shouldn't cause an
error.

If you are facing any problem, please post the relevant logs from
logs/hadoop.log and describe your problem in detail.

Regards,
Susam Pal

On 1/1/08, Nidhi malik <[EMAIL PROTECTED]> wrote:
> I am forwading my Nutch-site.xml
>  please coorect it
>
> ---------- Forwarded message ----------
> From: Nidhi malik <[EMAIL PROTECTED]>
> Date: Jan 1, 2008 11:47 PM
> Subject: nutch-site.xml
> To: [EMAIL PROTECTED]
>
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>   <name>http.agent.name</name>
>   <value>digvijay</value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>   please set this to a single word uniquely related to your organization.
>
>   NOTE: You should also check other related properties:
>
>     http.robots.agents
>     http.agent.description
>     http.agent.url
>     http.agent.email
>     http.agent.version
>
>   and set their values appropriately.
>
>   </description>
> </property>
>
> <property>
>   <name>http.agent.description</name>
>   <value>digvijay crawler</value>
>   <description>Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent name.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.url</name>
>   <value>http://google.com</value>
>   <description>A URL to advertise in the User-Agent header.  This will
>    appear in parenthesis after the agent name. Custom dictates that this
>    should be a URL of a page explaining the purpose and behavior of this
>    crawler.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.email</name>
>   <value>[EMAIL PROTECTED]</value>
>   <description>An email address to advertise in the HTTP 'From' request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
> </property>
>
>
> <property>
>   <name>http.proxy.host</name>
>   <value>netmon.iitb.ac.in</value>
>   <description>The proxy hostname.  If empty, no proxy is
> used.</description>
> </property>
>
> <property>
>   <name>http.proxy.port</name>
>   <value>80</value>
>   <description>The proxy port.</description>
> </property>
>
>
> <property>
>   <name>http.proxy.username</name>
>   <value>xyz</value>
>   <description>Username for proxy. This will be used by
>   'protocol-httpclient', if the proxy server requests basic, digest
>   and/or NTLM authentication. To use this, 'protocol-httpclient' must
>   be present in the value of 'plugin.includes' property.
>   NOTE: For NTLM authentication, do not prefix the username with the
>   domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
>   </description>
> </property>
>
> <property>
>   <name>http.proxy.password</name>
>   <value>xyz</value>
>   <description>Password for proxy. This will be used by
>   'protocol-httpclient', if the proxy server requests basic, digest
>   and/or NTLM authentication. To use this, 'protocol-httpclient' must
>   be present in the value of 'plugin.includes' property.
>   </description>
> </property>
>
> <property>
>   <name>http.proxy.realm</name>
>   <value>Squid proxy-caching web server</value>
>   <description>Authentication realm for proxy. Do not define a value
>   if realm is not required or authentication should take place for any
>   realm. NTLM does not use the notion of realms. Specify the domain name
>   of NTLM authentication as the value for this property. To use this,
>   'protocol-httpclient' must be present in the value of
>   'plugin.includes' property.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.host</name>
>   <value>10.129.30.14</value>
>   <description>Name or IP address of the host on which the Nutch crawler
>   would be running. Currently this is used by 'protocol-httpclient'
>   plugin.
>   </description>
> </property>
>
> <property>
>   <name>searcher.dir</name>
>   <value>crawl</value>
>   <description>
>   Path to root of crawl.  This directory is searched (in
>   order) for either the file search-servers.txt, containing a list of
>   distributed search servers, or the directory "index" containing
>   merged indexes, or the directory "segments" containing segment
>   indexes.
>   </description>
> </property>
>
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
>
> <property>
>   <name>http.timeout</name>
>   <value>1000000</value>
>   <description>The default network timeout, in milliseconds.</description>
> </property>
>
> </configuration>
>

-- 
Sent from Gmail for mobile | mobile.google.com

Reply via email to