Your configuration seems fine. Ideally http.agent.url should point to a page where you describe your crawler, but that shouldn't cause an error.
If you are facing any problem, please post the relevant logs from logs/hadoop.log and describe your problem in detail. Regards, Susam Pal On 1/1/08, Nidhi malik <[EMAIL PROTECTED]> wrote: > I am forwading my Nutch-site.xml > please coorect it > > ---------- Forwarded message ---------- > From: Nidhi malik <[EMAIL PROTECTED]> > Date: Jan 1, 2008 11:47 PM > Subject: nutch-site.xml > To: [EMAIL PROTECTED] > > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > <property> > <name>http.agent.name</name> > <value>digvijay</value> > <description>HTTP 'User-Agent' request header. MUST NOT be empty - > please set this to a single word uniquely related to your organization. > > NOTE: You should also check other related properties: > > http.robots.agents > http.agent.description > http.agent.url > http.agent.email > http.agent.version > > and set their values appropriately. > > </description> > </property> > > <property> > <name>http.agent.description</name> > <value>digvijay crawler</value> > <description>Further description of our bot- this text is used in > the User-Agent header. It appears in parenthesis after the agent name. > </description> > </property> > > <property> > <name>http.agent.url</name> > <value>http://google.com</value> > <description>A URL to advertise in the User-Agent header. This will > appear in parenthesis after the agent name. Custom dictates that this > should be a URL of a page explaining the purpose and behavior of this > crawler. > </description> > </property> > > <property> > <name>http.agent.email</name> > <value>[EMAIL PROTECTED]</value> > <description>An email address to advertise in the HTTP 'From' request > header and User-Agent header. A good practice is to mangle this > address (e.g. 'info at example dot com') to avoid spamming. > </description> > </property> > > > <property> > <name>http.proxy.host</name> > <value>netmon.iitb.ac.in</value> > <description>The proxy hostname. If empty, no proxy is > used.</description> > </property> > > <property> > <name>http.proxy.port</name> > <value>80</value> > <description>The proxy port.</description> > </property> > > > <property> > <name>http.proxy.username</name> > <value>xyz</value> > <description>Username for proxy. This will be used by > 'protocol-httpclient', if the proxy server requests basic, digest > and/or NTLM authentication. To use this, 'protocol-httpclient' must > be present in the value of 'plugin.includes' property. > NOTE: For NTLM authentication, do not prefix the username with the > domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect. > </description> > </property> > > <property> > <name>http.proxy.password</name> > <value>xyz</value> > <description>Password for proxy. This will be used by > 'protocol-httpclient', if the proxy server requests basic, digest > and/or NTLM authentication. To use this, 'protocol-httpclient' must > be present in the value of 'plugin.includes' property. > </description> > </property> > > <property> > <name>http.proxy.realm</name> > <value>Squid proxy-caching web server</value> > <description>Authentication realm for proxy. Do not define a value > if realm is not required or authentication should take place for any > realm. NTLM does not use the notion of realms. Specify the domain name > of NTLM authentication as the value for this property. To use this, > 'protocol-httpclient' must be present in the value of > 'plugin.includes' property. > </description> > </property> > > <property> > <name>http.agent.host</name> > <value>10.129.30.14</value> > <description>Name or IP address of the host on which the Nutch crawler > would be running. Currently this is used by 'protocol-httpclient' > plugin. > </description> > </property> > > <property> > <name>searcher.dir</name> > <value>crawl</value> > <description> > Path to root of crawl. This directory is searched (in > order) for either the file search-servers.txt, containing a list of > distributed search servers, or the directory "index" containing > merged indexes, or the directory "segments" containing segment > indexes. > </description> > </property> > > <property> > <name>plugin.includes</name> > > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. By > > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > </description> > </property> > > <property> > <name>http.timeout</name> > <value>1000000</value> > <description>The default network timeout, in milliseconds.</description> > </property> > > </configuration> > -- Sent from Gmail for mobile | mobile.google.com
