Hi All, I am facing a problem in running nutch where the proxy authentication is required to crawl the site.(eg. google.com, yahoo.com) I am able to crawl the sites which do not require proxy authentication from our domain (eg abc.com), it is successfully creating a crawl folder and 5 subfolders.. I have put all the values in conf/nutch-site.xml & conf/nutch-default.xml as given. I have given below all the entries which i have modified to run nutch(eg. settings in urls/urls.txt, conf/crawl-urlfilter.txt, conf/nutch-site.xml, conf/nutch-default.xml) I have also given the crawl.log text for your reference.
while crawling through cygwin, it is giving an exception(Please help me out what i have to do to run nutch successfully(where i have to put any entry to pass through Proxy Authentication)) Dedup: starting Dedup: adding indexes in: crawl/indexes Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43 9) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) =====================================================================>>> >>>>>>>>>>>>> =======>crawl.log crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 topN = 50 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080109122052 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080109122052 Fetcher: threads: 10 fetching http://www.yahoo.com/ fetch of http://www.yahoo.com/ <http://www.yahoo.com/> failed with: Http code=407, url=http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080109122052] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080109122101 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080109122101 Fetcher: threads: 10 fetching http://www.yahoo.com/ fetch of http://www.yahoo.com/ <http://www.yahoo.com/> failed with: Http code=407, url=http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080109122101] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080109122110 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080109122110 Fetcher: threads: 10 fetching http://www.yahoo.com/ fetch of http://www.yahoo.com/ <http://www.yahoo.com/> failed with: Http code=407, url=http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080109122110] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: crawl/segments/20080109122052 LinkDb: adding segment: crawl/segments/20080109122101 LinkDb: adding segment: crawl/segments/20080109122110 LinkDb: done Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20080109122052 Indexer: adding segment: crawl/segments/20080109122101 Indexer: adding segment: crawl/segments/20080109122110 Optimizing index. Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43 9) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) =====================================================================>>> >>>>>>>>>>>>> ========>urls/urls.txt http://www.yahoo.com <http://www.yahoo.com/> =====================================================================>>> >>>>>>>>>>>>> =======>conf/crawl-urlfilter.txt # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*yahoo.com/ # skip everything else -. =====================================================================>>> >>>>>>>>>>>>> ====> conf/nutch-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>abc</value> <description>Description</description> </property> <property> <name>http.agent.description</name> <value>Naveen</value> <description>Description </description> </property> <property> <name>http.agent.url</name> <value>http://www.abc.com </value> <description> Description </description> </property> <property> <name>http.agent.email</name> <value>[EMAIL PROTECTED]</value> <description> Description </description> </property> </configuration> =====================================================================>>> >>>>>>>>>>>>> ========>conf/nutch-default.xml ---------some default properties------------ <property> <name>http.agent.name</name> <value>abc</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> ---------some default properties------------ <property> <name>http.agent.description</name> <value>Naveen</value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value>http://www.abc.com</value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value>[EMAIL PROTECTED]</value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property> <property> <name>http.agent.version</name> <value>Nutch-0.9</value> <description>A version string to advertise in the User-Agent header.</description> </property> ---------some default properties------------ <property> <name>http.proxy.host</name> <value>xyz.abc.com</value> <description>The proxy hostname. If empty, no proxy is used.</description> </property> <property> <name>http.proxy.port</name> <value>8080</value> <description>The proxy port.</description> </property> ---------some default properties------------ <property> <name>http.proxy.username</name> <value>abc</value> <description>Username for proxy. This will be used by 'protocol-httpclient', if the proxy server requests basic, digest and/or NTLM authentication. To use this, 'protocol-httpclient' must be present in the value of 'plugin.includes' property. NOTE: For NTLM authentication, do not prefix the username with the domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect. </description> </property> <property> <name>http.proxy.password</name> <value>password</value> <description>Password for proxy. This will be used by 'protocol-httpclient', if the proxy server requests basic, digest and/or NTLM authentication. To use this, 'protocol-httpclient' must be present in the value of 'plugin.includes' property. </description> </property> <property> <name>http.proxy.realm</name> <value></value> <description>Authentication realm for proxy. Do not define a value if realm is not required or authentication should take place for any realm. NTLM does not use the notion of realms. Specify the domain name of NTLM authentication as the value for this property. To use this, 'protocol-httpclient' must be present in the value of 'plugin.includes' property. </description> </property> <property> <name>http.agent.host</name> <value>10.105.115.89</value> <description>Name or IP address of the host on which the Nutch crawler would be running. Currently this is used by 'protocol-httpclient' plugin. </description> </property> ---------some default properties------------ <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|o o|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic| site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</v alue> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> ---------some default properties------------ =====================================================================>>> >>>>>>>>>>>>> Please help me out what i have to do to run nutch successfully(where i have to put any entry to pass through Proxy Authentication) Thanks & Regards, Naveen Goswami The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com