Re: Problem in running Nutch where proxy authentication is required.

Susam Pal Fri, 14 Mar 2008 10:42:29 -0700

I still can't see any DEBUG logs in your log file. Did you go through
my earlier mail?


Regards,
Susam Pal

On Wed, Mar 12, 2008 at 9:39 PM,  <[EMAIL PROTECTED]> wrote:
>
> Hi All,
>
>  I am facing a problem in running nutch where the proxy authentication is
>  required to crawl the site.(eg. google.com, yahoo.com)
>  I am able to crawl the sites which do not require proxy authentication
>  from our domain (eg abc.com), it is successfully creating a crawl folder
>  and 5 subfolders..
>  I have put all the values in conf/nutch-site.xml &
>  conf/nutch-default.xml as given.
>  I have given below all the entries which i have modified to run
>  nutch(eg. settings in urls/urls.txt, conf/crawl-urlfilter.txt,
>  conf/nutch-site.xml, conf/nutch-default.xml)
>  I have also given the crawl.log text for your reference.
>
>  while crawling through cygwin, it is giving an exception(Please help me
>  out what i have to do to run nutch successfully(where i have to put any
>  entry to pass through Proxy Authentication))
>
>  Dedup: starting
>  Dedup: adding indexes in: crawl/indexes
>  Exception in thread "main" java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>   at
>  org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
>  9)
>   at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
>  =====================================================================>>>
>  >>>>>>>>>>>>>
>
>
>  =======>crawl.log
>
>  crawl started in: crawl
>  rootUrlDir = urls
>  threads = 10
>  depth = 3
>  topN = 50
>  Injector: starting
>  Injector: crawlDb: crawl/crawldb
>  Injector: urlDir: urls
>  Injector: Converting injected urls to crawl db entries.
>  Injector: Merging injected urls into crawl db.
>  Injector: done
>  Generator: Selecting best-scoring urls due for fetch.
>  Generator: starting
>  Generator: segment: crawl/segments/20080109122052
>  Generator: filtering: false
>  Generator: topN: 50
>  Generator: jobtracker is 'local', generating exactly one partition.
>  Generator: Partitioning selected urls by host, for politeness.
>  Generator: done.
>  Fetcher: starting
>  Fetcher: segment: crawl/segments/20080109122052
>  Fetcher: threads: 10
>  fetching http://www.yahoo.com/
>  fetch of http://www.yahoo.com/ <http://www.yahoo.com/>  failed with:
>  Http code=407, url=http://www.yahoo.com/
>  Fetcher: done
>  CrawlDb update: starting
>  CrawlDb update: db: crawl/crawldb
>  CrawlDb update: segments: [crawl/segments/20080109122052]
>  CrawlDb update: additions allowed: true
>  CrawlDb update: URL normalizing: true
>  CrawlDb update: URL filtering: true
>  CrawlDb update: Merging segment data into db.
>  CrawlDb update: done
>  Generator: Selecting best-scoring urls due for fetch.
>  Generator: starting
>  Generator: segment: crawl/segments/20080109122101
>  Generator: filtering: false
>  Generator: topN: 50
>  Generator: jobtracker is 'local', generating exactly one partition.
>  Generator: Partitioning selected urls by host, for politeness.
>  Generator: done.
>  Fetcher: starting
>  Fetcher: segment: crawl/segments/20080109122101
>  Fetcher: threads: 10
>  fetching http://www.yahoo.com/
>  fetch of http://www.yahoo.com/ <http://www.yahoo.com/>  failed with:
>  Http code=407, url=http://www.yahoo.com/
>  Fetcher: done
>  CrawlDb update: starting
>  CrawlDb update: db: crawl/crawldb
>  CrawlDb update: segments: [crawl/segments/20080109122101]
>  CrawlDb update: additions allowed: true
>  CrawlDb update: URL normalizing: true
>  CrawlDb update: URL filtering: true
>  CrawlDb update: Merging segment data into db.
>  CrawlDb update: done
>  Generator: Selecting best-scoring urls due for fetch.
>  Generator: starting
>  Generator: segment: crawl/segments/20080109122110
>  Generator: filtering: false
>  Generator: topN: 50
>  Generator: jobtracker is 'local', generating exactly one partition.
>  Generator: Partitioning selected urls by host, for politeness.
>  Generator: done.
>  Fetcher: starting
>  Fetcher: segment: crawl/segments/20080109122110
>  Fetcher: threads: 10
>  fetching http://www.yahoo.com/
>  fetch of http://www.yahoo.com/ <http://www.yahoo.com/>  failed with:
>  Http code=407, url=http://www.yahoo.com/
>  Fetcher: done
>  CrawlDb update: starting
>  CrawlDb update: db: crawl/crawldb
>  CrawlDb update: segments: [crawl/segments/20080109122110]
>  CrawlDb update: additions allowed: true
>  CrawlDb update: URL normalizing: true
>  CrawlDb update: URL filtering: true
>  CrawlDb update: Merging segment data into db.
>  CrawlDb update: done
>  LinkDb: starting
>  LinkDb: linkdb: crawl/linkdb
>  LinkDb: URL normalize: true
>  LinkDb: URL filter: true
>  LinkDb: adding segment: crawl/segments/20080109122052
>  LinkDb: adding segment: crawl/segments/20080109122101
>  LinkDb: adding segment: crawl/segments/20080109122110
>  LinkDb: done
>  Indexer: starting
>  Indexer: linkdb: crawl/linkdb
>  Indexer: adding segment: crawl/segments/20080109122052
>  Indexer: adding segment: crawl/segments/20080109122101
>  Indexer: adding segment: crawl/segments/20080109122110
>  Optimizing index.
>  Indexer: done
>  Dedup: starting
>  Dedup: adding indexes in: crawl/indexes
>  Exception in thread "main" java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>   at
>  org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
>  9)
>   at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
>  =====================================================================>>>
>  >>>>>>>>>>>>>
>
>
>  ========>urls/urls.txt
>  http://www.yahoo.com <http://www.yahoo.com/>
>
>
>  =====================================================================>>>
>  >>>>>>>>>>>>>
>
>  =======>conf/crawl-urlfilter.txt
>
>  # The url filter file used by the crawl command.
>
>  # Better for intranet crawling.
>  # Be sure to change MY.DOMAIN.NAME to your domain name.
>
>  # Each non-comment, non-blank line contains a regular expression
>  # prefixed by '+' or '-'.  The first matching pattern in the file
>  # determines whether a URL is included or ignored.  If no pattern
>  # matches, the URL is ignored.
>
>  # skip file:, ftp:, & mailto: urls
>  -^(file|ftp|mailto):
>
>  # skip image and other suffixes we can't yet parse
>  -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
>  pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
>  # skip URLs containing certain characters as probable queries, etc.
>  [EMAIL PROTECTED]
>
>  # skip URLs with slash-delimited segment that repeats 3+ times, to break
>  loops
>  -.*(/.+?)/.*?\1/.*?\1/
>
>  # accept hosts in MY.DOMAIN.NAME
>  +^http://([a-z0-9]*\.)*yahoo.com/
>
>  # skip everything else
>  -.
>
>
>  =====================================================================>>>
>  >>>>>>>>>>>>>
>
>  ====> conf/nutch-site.xml
>
>  <?xml version="1.0"?>
>  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
>  <!-- Put site-specific property overrides in this file. -->
>
>  <configuration>
>  <property>
>   <name>http.agent.name</name>
>   <value>abc</value>
>   <description>Description</description>
>  </property>
>
>  <property>
>   <name>http.agent.description</name>
>   <value>Naveen</value>
>   <description>Description </description>
>  </property>
>
>  <property>
>   <name>http.agent.url</name>
>   <value>http://www.abc.com </value>
>   <description> Description </description>
>  </property>
>
>  <property>
>   <name>http.agent.email</name>
>   <value>[EMAIL PROTECTED]</value>
>   <description> Description </description>
>  </property>
>  </configuration>
>
>  =====================================================================>>>
>  >>>>>>>>>>>>>
>
>  ========>conf/nutch-default.xml
>
>
>  ---------some default properties------------
>  <property>
>   <name>http.agent.name</name>
>   <value>abc</value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>   please set this to a single word uniquely related to your
>  organization.
>
>   NOTE: You should also check other related properties:
>
>   http.robots.agents
>   http.agent.description
>   http.agent.url
>   http.agent.email
>   http.agent.version
>
>   and set their values appropriately.
>
>   </description>
>  </property>
>
>  ---------some default properties------------
>
>  <property>
>   <name>http.agent.description</name>
>   <value>Naveen</value>
>   <description>Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent
>  name.
>   </description>
>  </property>
>
>  <property>
>   <name>http.agent.url</name>
>   <value>http://www.abc.com</value>
>   <description>A URL to advertise in the User-Agent header.  This will
>    appear in parenthesis after the agent name. Custom dictates that this
>    should be a URL of a page explaining the purpose and behavior of this
>    crawler.
>   </description>
>  </property>
>
>  <property>
>   <name>http.agent.email</name>
>   <value>[EMAIL PROTECTED]</value>
>   <description>An email address to advertise in the HTTP 'From' request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
>  </property>
>
>  <property>
>   <name>http.agent.version</name>
>   <value>Nutch-0.9</value>
>   <description>A version string to advertise in the User-Agent
>    header.</description>
>  </property>
>
>  ---------some default properties------------
>
>  <property>
>   <name>http.proxy.host</name>
>   <value>xyz.abc.com</value>
>   <description>The proxy hostname.  If empty, no proxy is
>  used.</description>
>  </property>
>
>  <property>
>   <name>http.proxy.port</name>
>   <value>8080</value>
>   <description>The proxy port.</description>
>  </property>
>
>  ---------some default properties------------
>
>  <property>
>   <name>http.proxy.username</name>
>   <value>abc</value>
>   <description>Username for proxy. This will be used by
>   'protocol-httpclient', if the proxy server requests basic, digest
>   and/or NTLM authentication. To use this, 'protocol-httpclient' must
>   be present in the value of 'plugin.includes' property.
>   NOTE: For NTLM authentication, do not prefix the username with the
>   domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
>   </description>
>  </property>
>
>  <property>
>   <name>http.proxy.password</name>
>   <value>password</value>
>   <description>Password for proxy. This will be used by
>   'protocol-httpclient', if the proxy server requests basic, digest
>   and/or NTLM authentication. To use this, 'protocol-httpclient' must
>   be present in the value of 'plugin.includes' property.
>   </description>
>  </property>
>
>  <property>
>   <name>http.proxy.realm</name>
>   <value></value>
>   <description>Authentication realm for proxy. Do not define a value
>   if realm is not required or authentication should take place for any
>   realm. NTLM does not use the notion of realms. Specify the domain name
>
>   of NTLM authentication as the value for this property. To use this,
>   'protocol-httpclient' must be present in the value of
>   'plugin.includes' property.
>   </description>
>  </property>
>
>  <property>
>   <name>http.agent.host</name>
>   <value>10.105.115.89</value>
>   <description>Name or IP address of the host on which the Nutch crawler
>
>   would be running. Currently this is used by 'protocol-httpclient'
>   plugin.
>   </description>
>  </property>
>
>  ---------some default properties------------
>
>  <property>
>   <name>plugin.includes</name>
>
>  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|o
>  o|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|
>  site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</v
>  alue>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints
>  plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please
>  enable
>   protocol-httpclient, but be aware of possible intermittent problems
>  with the
>   underlying commons-httpclient library.
>   </description>
>  </property>
>
>  ---------some default properties------------
>
>
>  =====================================================================>>>
>  >>>>>>>>>>>>>
>
>
>
>  Please help me out what i have to do to run nutch successfully(where i
>  have to put any entry to pass through Proxy Authentication)
>
>
>  Thanks & Regards,
>  Naveen Goswami
>
>
>  The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments.
>
>  WARNING: Computer viruses can be transmitted via email. The recipient should 
> check this email and any attachments for the presence of viruses. The company 
> accepts no liability for any damage caused by any virus transmitted by this 
> email.
>
>  www.wipro.com
>
>

Re: Problem in running Nutch where proxy authentication is required.

Reply via email to