FW: Problem in running Nutch where proxy authentication is required.

naveen.goswami Sat, 15 Mar 2008 04:58:55 -0700

 Hi Susam,

I have mailed on the list 2 times but the mails bounced back with the
following message


ezmlm-reject: fatal: Sorry, I don't accept messages larger than 100000
bytes (#5.2.3)


Thanks & Regards,
Naveen Goswami

-----Original Message-----
From: Naveen Goswami (WT01 - E-ENABLING)
Sent: Saturday, March 15, 2008 5:01 PM
To: 'nutch-dev@lucene.apache.org'
Cc: '[EMAIL PROTECTED]'
Subject: RE: Problem in running Nutch where proxy authentication is
required.

Hi Susam,


Thanks for the help. Yeah I have got your earlier mail.
I have followed all the steps given by you.
I am attaching the hadoop.log and crawl.log for your reference.

I have used the below command to run the crawl.
 bin/nutch crawl urls -dir crawl -depth 1 -threads 1 >& crawl.log

Please tell me what is the problem.

Thanks & Regards,
Naveen Goswami
91 9899547886

-----Original Message-----
From: Susam Pal [mailto:[EMAIL PROTECTED]
Sent: Friday, March 14, 2008 11:12 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Problem in running Nutch where proxy authentication is
required.

I still can't see any DEBUG logs in your log file. Did you go through my
earlier mail?

Regards,
Susam Pal

On Wed, Mar 12, 2008 at 9:39 PM,  <[EMAIL PROTECTED]> wrote:
>
> Hi All,
>
>  I am facing a problem in running nutch where the proxy authentication

> is  required to crawl the site.(eg. google.com, yahoo.com)  I am able
> to crawl the sites which do not require proxy authentication  from our

> domain (eg abc.com), it is successfully creating a crawl folder  and 5

> subfolders..
>  I have put all the values in conf/nutch-site.xml &
> conf/nutch-default.xml as given.
>  I have given below all the entries which i have modified to run
> nutch(eg. settings in urls/urls.txt, conf/crawl-urlfilter.txt,
> conf/nutch-site.xml, conf/nutch-default.xml)  I have also given the
> crawl.log text for your reference.
>
>  while crawling through cygwin, it is giving an exception(Please help
> me  out what i have to do to run nutch successfully(where i have to
> put any  entry to pass through Proxy Authentication))
>
>  Dedup: starting
>  Dedup: adding indexes in: crawl/indexes  Exception in thread "main"
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>   at
>
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:
> 43
>  9)
>   at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
>
> =====================================================================>
> >>
>  >>>>>>>>>>>>>
>
>
>  =======>crawl.log
>
>  crawl started in: crawl
>  rootUrlDir = urls
>  threads = 10
>  depth = 3
>  topN = 50
>  Injector: starting
>  Injector: crawlDb: crawl/crawldb
>  Injector: urlDir: urls
>  Injector: Converting injected urls to crawl db entries.
>  Injector: Merging injected urls into crawl db.
>  Injector: done
>  Generator: Selecting best-scoring urls due for fetch.
>  Generator: starting
>  Generator: segment: crawl/segments/20080109122052
>  Generator: filtering: false
>  Generator: topN: 50
>  Generator: jobtracker is 'local', generating exactly one partition.
>  Generator: Partitioning selected urls by host, for politeness.
>  Generator: done.
>  Fetcher: starting
>  Fetcher: segment: crawl/segments/20080109122052
>  Fetcher: threads: 10
>  fetching http://www.yahoo.com/
>  fetch of http://www.yahoo.com/ <http://www.yahoo.com/>  failed with:
>  Http code=407, url=http://www.yahoo.com/
>  Fetcher: done
>  CrawlDb update: starting
>  CrawlDb update: db: crawl/crawldb
>  CrawlDb update: segments: [crawl/segments/20080109122052]  CrawlDb
> update: additions allowed: true  CrawlDb update: URL normalizing: true

> CrawlDb update: URL filtering: true  CrawlDb update: Merging segment
> data into db.
>  CrawlDb update: done
>  Generator: Selecting best-scoring urls due for fetch.
>  Generator: starting
>  Generator: segment: crawl/segments/20080109122101
>  Generator: filtering: false
>  Generator: topN: 50
>  Generator: jobtracker is 'local', generating exactly one partition.
>  Generator: Partitioning selected urls by host, for politeness.
>  Generator: done.
>  Fetcher: starting
>  Fetcher: segment: crawl/segments/20080109122101
>  Fetcher: threads: 10
>  fetching http://www.yahoo.com/
>  fetch of http://www.yahoo.com/ <http://www.yahoo.com/>  failed with:
>  Http code=407, url=http://www.yahoo.com/
>  Fetcher: done
>  CrawlDb update: starting
>  CrawlDb update: db: crawl/crawldb
>  CrawlDb update: segments: [crawl/segments/20080109122101]  CrawlDb
> update: additions allowed: true  CrawlDb update: URL normalizing: true

> CrawlDb update: URL filtering: true  CrawlDb update: Merging segment
> data into db.
>  CrawlDb update: done
>  Generator: Selecting best-scoring urls due for fetch.
>  Generator: starting
>  Generator: segment: crawl/segments/20080109122110
>  Generator: filtering: false
>  Generator: topN: 50
>  Generator: jobtracker is 'local', generating exactly one partition.
>  Generator: Partitioning selected urls by host, for politeness.
>  Generator: done.
>  Fetcher: starting
>  Fetcher: segment: crawl/segments/20080109122110
>  Fetcher: threads: 10
>  fetching http://www.yahoo.com/
>  fetch of http://www.yahoo.com/ <http://www.yahoo.com/>  failed with:
>  Http code=407, url=http://www.yahoo.com/
>  Fetcher: done
>  CrawlDb update: starting
>  CrawlDb update: db: crawl/crawldb
>  CrawlDb update: segments: [crawl/segments/20080109122110]  CrawlDb
> update: additions allowed: true  CrawlDb update: URL normalizing: true

> CrawlDb update: URL filtering: true  CrawlDb update: Merging segment
> data into db.
>  CrawlDb update: done
>  LinkDb: starting
>  LinkDb: linkdb: crawl/linkdb
>  LinkDb: URL normalize: true
>  LinkDb: URL filter: true
>  LinkDb: adding segment: crawl/segments/20080109122052
>  LinkDb: adding segment: crawl/segments/20080109122101
>  LinkDb: adding segment: crawl/segments/20080109122110
>  LinkDb: done
>  Indexer: starting
>  Indexer: linkdb: crawl/linkdb
>  Indexer: adding segment: crawl/segments/20080109122052
>  Indexer: adding segment: crawl/segments/20080109122101
>  Indexer: adding segment: crawl/segments/20080109122110  Optimizing
> index.
>  Indexer: done
>  Dedup: starting
>  Dedup: adding indexes in: crawl/indexes  Exception in thread "main"
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>   at
>
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:
> 43
>  9)
>   at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
>
> =====================================================================>
> >>
>  >>>>>>>>>>>>>
>
>
>  ========>urls/urls.txt
>  http://www.yahoo.com <http://www.yahoo.com/>
>
>
>
> =====================================================================>
> >>
>  >>>>>>>>>>>>>
>
>  =======>conf/crawl-urlfilter.txt
>
>  # The url filter file used by the crawl command.
>
>  # Better for intranet crawling.
>  # Be sure to change MY.DOMAIN.NAME to your domain name.
>
>  # Each non-comment, non-blank line contains a regular expression  #
> prefixed by '+' or '-'.  The first matching pattern in the file  #
> determines whether a URL is included or ignored.  If no pattern  #
> matches, the URL is ignored.
>
>  # skip file:, ftp:, & mailto: urls
>  -^(file|ftp|mailto):
>
>  # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz
> |r  pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
>  # skip URLs containing certain characters as probable queries, etc.
>  [EMAIL PROTECTED]
>
>  # skip URLs with slash-delimited segment that repeats 3+ times, to
> break  loops  -.*(/.+?)/.*?\1/.*?\1/
>
>  # accept hosts in MY.DOMAIN.NAME
>  +^http://([a-z0-9]*\.)*yahoo.com/
>
>  # skip everything else
>  -.
>
>
>
> =====================================================================>
> >>
>  >>>>>>>>>>>>>
>
>  ====> conf/nutch-site.xml
>
>  <?xml version="1.0"?>
>  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
>  <!-- Put site-specific property overrides in this file. -->
>
>  <configuration>
>  <property>
>   <name>http.agent.name</name>
>   <value>abc</value>
>   <description>Description</description>
>  </property>
>
>  <property>
>   <name>http.agent.description</name>
>   <value>Naveen</value>
>   <description>Description </description>  </property>
>
>  <property>
>   <name>http.agent.url</name>
>   <value>http://www.abc.com </value>
>   <description> Description </description>  </property>
>
>  <property>
>   <name>http.agent.email</name>
>   <value>[EMAIL PROTECTED]</value>
>   <description> Description </description>  </property>
> </configuration>
>
>
> =====================================================================>
> >>
>  >>>>>>>>>>>>>
>
>  ========>conf/nutch-default.xml
>
>
>  ---------some default properties------------  <property>
>   <name>http.agent.name</name>
>   <value>abc</value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>   please set this to a single word uniquely related to your
> organization.
>
>   NOTE: You should also check other related properties:
>
>   http.robots.agents
>   http.agent.description
>   http.agent.url
>   http.agent.email
>   http.agent.version
>
>   and set their values appropriately.
>
>   </description>
>  </property>
>
>  ---------some default properties------------
>
>  <property>
>   <name>http.agent.description</name>
>   <value>Naveen</value>
>   <description>Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent
> name.
>   </description>
>  </property>
>
>  <property>
>   <name>http.agent.url</name>
>   <value>http://www.abc.com</value>
>   <description>A URL to advertise in the User-Agent header.  This will
>    appear in parenthesis after the agent name. Custom dictates that
this
>    should be a URL of a page explaining the purpose and behavior of
this
>    crawler.
>   </description>
>  </property>
>
>  <property>
>   <name>http.agent.email</name>
>   <value>[EMAIL PROTECTED]</value>
>   <description>An email address to advertise in the HTTP 'From'
request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
>  </property>
>
>  <property>
>   <name>http.agent.version</name>
>   <value>Nutch-0.9</value>
>   <description>A version string to advertise in the User-Agent
>    header.</description>
>  </property>
>
>  ---------some default properties------------
>
>  <property>
>   <name>http.proxy.host</name>
>   <value>xyz.abc.com</value>
>   <description>The proxy hostname.  If empty, no proxy is
> used.</description>  </property>
>
>  <property>
>   <name>http.proxy.port</name>
>   <value>8080</value>
>   <description>The proxy port.</description>  </property>
>
>  ---------some default properties------------
>
>  <property>
>   <name>http.proxy.username</name>
>   <value>abc</value>
>   <description>Username for proxy. This will be used by
>   'protocol-httpclient', if the proxy server requests basic, digest
>   and/or NTLM authentication. To use this, 'protocol-httpclient' must
>   be present in the value of 'plugin.includes' property.
>   NOTE: For NTLM authentication, do not prefix the username with the
>   domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
>   </description>
>  </property>
>
>  <property>
>   <name>http.proxy.password</name>
>   <value>password</value>
>   <description>Password for proxy. This will be used by
>   'protocol-httpclient', if the proxy server requests basic, digest
>   and/or NTLM authentication. To use this, 'protocol-httpclient' must
>   be present in the value of 'plugin.includes' property.
>   </description>
>  </property>
>
>  <property>
>   <name>http.proxy.realm</name>
>   <value></value>
>   <description>Authentication realm for proxy. Do not define a value
>   if realm is not required or authentication should take place for any
>   realm. NTLM does not use the notion of realms. Specify the domain
> name
>
>   of NTLM authentication as the value for this property. To use this,
>   'protocol-httpclient' must be present in the value of
>   'plugin.includes' property.
>   </description>
>  </property>
>
>  <property>
>   <name>http.agent.host</name>
>   <value>10.105.115.89</value>
>   <description>Name or IP address of the host on which the Nutch
> crawler
>
>   would be running. Currently this is used by 'protocol-httpclient'
>   plugin.
>   </description>
>  </property>
>
>  ---------some default properties------------
>
>  <property>
>   <name>plugin.includes</name>
>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3
> |o
> o|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basi
> c|
> site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)<
> /v
>  alue>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints
> plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please
> enable
>   protocol-httpclient, but be aware of possible intermittent problems
> with the
>   underlying commons-httpclient library.
>   </description>
>  </property>
>
>  ---------some default properties------------
>
>
>
> =====================================================================>
> >>
>  >>>>>>>>>>>>>
>
>
>
>  Please help me out what i have to do to run nutch successfully(where
> i  have to put any entry to pass through Proxy Authentication)
>
>
>  Thanks & Regards,
>  Naveen Goswami
>
>
>  The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
>
>  WARNING: Computer viruses can be transmitted via email. The recipient
should check this email and any attachments for the presence of viruses.
The company accepts no liability for any damage caused by any virus
transmitted by this email.
>
>  www.wipro.com
>
>

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com

FW: Problem in running Nutch where proxy authentication is required.

Reply via email to