Problem in running Nutch where proxy authentication is required.

naveen.goswami Wed, 12 Mar 2008 09:21:05 -0700

Hi All,

I am facing a problem in running nutch where the proxy authentication is
required to crawl the site.(eg. google.com, yahoo.com)
I am able to crawl the sites which do not require proxy authentication
from our domain (eg abc.com), it is successfully creating a crawl folder
and 5 subfolders..
I have put all the values in conf/nutch-site.xml &
conf/nutch-default.xml as given.
I have given below all the entries which i have modified to run
nutch(eg. settings in urls/urls.txt, conf/crawl-urlfilter.txt,
conf/nutch-site.xml, conf/nutch-default.xml)
I have also given the crawl.log text for your reference.


while crawling through cygwin, it is giving an exception(Please help me
out what i have to do to run nutch successfully(where i have to put any
entry to pass through Proxy Authentication))

Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
 at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
9)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

=====================================================================>>>
>>>>>>>>>>>>>


=======>crawl.log

crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080109122052
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080109122052
Fetcher: threads: 10
fetching http://www.yahoo.com/
fetch of http://www.yahoo.com/ <http://www.yahoo.com/>  failed with:
Http code=407, url=http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080109122052]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080109122101
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080109122101
Fetcher: threads: 10
fetching http://www.yahoo.com/
fetch of http://www.yahoo.com/ <http://www.yahoo.com/>  failed with:
Http code=407, url=http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080109122101]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080109122110
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080109122110
Fetcher: threads: 10
fetching http://www.yahoo.com/
fetch of http://www.yahoo.com/ <http://www.yahoo.com/>  failed with:
Http code=407, url=http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080109122110]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080109122052
LinkDb: adding segment: crawl/segments/20080109122101
LinkDb: adding segment: crawl/segments/20080109122110
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20080109122052
Indexer: adding segment: crawl/segments/20080109122101
Indexer: adding segment: crawl/segments/20080109122110
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
 at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
9)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

=====================================================================>>>
>>>>>>>>>>>>>


========>urls/urls.txt
http://www.yahoo.com <http://www.yahoo.com/>


=====================================================================>>>
>>>>>>>>>>>>>

=======>conf/crawl-urlfilter.txt

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*yahoo.com/

# skip everything else
-.


=====================================================================>>>
>>>>>>>>>>>>>

====> conf/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value>abc</value>
  <description>Description</description>
</property>

<property>
  <name>http.agent.description</name>
  <value>Naveen</value>
  <description>Description </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://www.abc.com </value>
  <description> Description </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>[EMAIL PROTECTED]</value>
  <description> Description </description>
</property>
</configuration>

=====================================================================>>>
>>>>>>>>>>>>>

========>conf/nutch-default.xml


---------some default properties------------
<property>
  <name>http.agent.name</name>
  <value>abc</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your
organization.

  NOTE: You should also check other related properties:

 http.robots.agents
 http.agent.description
 http.agent.url
 http.agent.email
 http.agent.version

  and set their values appropriately.

  </description>
</property>

---------some default properties------------

<property>
  <name>http.agent.description</name>
  <value>Naveen</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent
name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://www.abc.com</value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>[EMAIL PROTECTED]</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>http.agent.version</name>
  <value>Nutch-0.9</value>
  <description>A version string to advertise in the User-Agent
   header.</description>
</property>

---------some default properties------------

<property>
  <name>http.proxy.host</name>
  <value>xyz.abc.com</value>
  <description>The proxy hostname.  If empty, no proxy is
used.</description>
</property>

<property>
  <name>http.proxy.port</name>
  <value>8080</value>
  <description>The proxy port.</description>
</property>

---------some default properties------------

<property>
  <name>http.proxy.username</name>
  <value>abc</value>
  <description>Username for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  NOTE: For NTLM authentication, do not prefix the username with the
  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
  </description>
</property>

<property>
  <name>http.proxy.password</name>
  <value>password</value>
  <description>Password for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  </description>
</property>

<property>
  <name>http.proxy.realm</name>
  <value></value>
  <description>Authentication realm for proxy. Do not define a value
  if realm is not required or authentication should take place for any
  realm. NTLM does not use the notion of realms. Specify the domain name

  of NTLM authentication as the value for this property. To use this,
  'protocol-httpclient' must be present in the value of
  'plugin.includes' property.
  </description>
</property>

<property>
  <name>http.agent.host</name>
  <value>10.105.115.89</value>
  <description>Name or IP address of the host on which the Nutch crawler

  would be running. Currently this is used by 'protocol-httpclient'
  plugin.
  </description>
</property>

---------some default properties------------

<property>
  <name>plugin.includes</name>

<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|o
o|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|
site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</v
alue>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints
plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please
enable
  protocol-httpclient, but be aware of possible intermittent problems
with the
  underlying commons-httpclient library.
  </description>
</property>

---------some default properties------------


=====================================================================>>>
>>>>>>>>>>>>>



Please help me out what i have to do to run nutch successfully(where i
have to put any entry to pass through Proxy Authentication)


Thanks & Regards,
Naveen Goswami


The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com

Problem in running Nutch where proxy authentication is required.

Reply via email to