Hi ,

i am not able to crawl throught the internet URL's . i am getting Http 
code=407

These is the setting i have given in the nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>http.agent.name</name>
  <value>vimal</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

        http.robots.agents
        http.agent.description
        http.agent.url
        http.agent.email
        http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value></value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>http.proxy.host</name>
  <value>172.19.144.174</value>
  <description>The proxy hostname.  If empty, no proxy is 
used.</description>
</property>

<property>
  <name>http.proxy.port</name>
  <value>8080</value>
  <description>The proxy port.</description>
</property>

<property>
  <name>http.proxy.username</name>
  <value>vimal</value>
  <description>Username for proxy. This will be used by
        'protocol-httpclient', if the proxy server requests basic, digest
   and/or NTLM authentication. To use this, 'protocol-httpclient' must
   be present in the value of 'plugin.includes' property.
   NOTE: For NTLM authentication, do not prefix the username with the
   domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
 </description>
</property>
<property>
 <name>http.proxy.password</name>
 <value>Password</value>
 <description>Password for proxy. This will be used by
 'protocol-httpclient', if the proxy server requests basic, digest
 and/or NTLM authentication. To use this, 'protocol-httpclient' must
 be present in the value of 'plugin.includes' property.
 </description>
</property>

<property>
 <name>http.proxy.realm</name>
  <value>INDIA</value><!--MY NETWORK DOMAIN-->
  <description>Authentication realm for proxy. Do not define a value
  if realm is not required or authentication should take place for any
  realm. NTLM does not use the notion of realms. Specify the domain name
  of NTLM authentication as the value for this property. To use this,
  'protocol-httpclient' must be present in the value of
  'plugin.includes' property.
  </description>
</property> 

<property>
  <name>http.robots.agents</name>
  <value>vimal,*</value>
  <description>The agent strings we'll look for in robots.txt files,
   comma-separated, in decreasing order of precedence. You should
   put the value of http.agent.name as the first agent name, and keep the
   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
   </description>
</property>

<property>
  <name>plugin.auto-activation</name>
  <value>true</value>
  <description>Defines if some plugins that are not activated regarding 
the plugin.includes and plugin.excludes properties must be automaticaly 
activated if they are needed by some actived plugins. </description>
</property>

<property>
  <name>plugin.folders</name>
  <value>D:\nutch-0.9\build\plugins\</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

<property>
  <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|feed</value>
  <description>Regular expression naming plugin directory names to 
include.  Any plugin not matching this expression is excluded.In any case 
you need at least include the nutch-extensionpoints plugin. By default 
Nutch includes crawling just HTML and plain text via HTTP, and basic 
indexing and search plugins. In order to use HTTPS please enable 
protocol-httpclient, but be aware of possible intermittent problems with 
the  underlying commons-httpclient library. </description>
</property>


</configuration>

Vimal Varghese
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you


Reply via email to