I still can't see any DEBUG logs in your log file. Did you go through my earlier mail?
Regards, Susam Pal On Wed, Mar 12, 2008 at 9:39 PM, <[EMAIL PROTECTED]> wrote: > > Hi All, > > I am facing a problem in running nutch where the proxy authentication is > required to crawl the site.(eg. google.com, yahoo.com) > I am able to crawl the sites which do not require proxy authentication > from our domain (eg abc.com), it is successfully creating a crawl folder > and 5 subfolders.. > I have put all the values in conf/nutch-site.xml & > conf/nutch-default.xml as given. > I have given below all the entries which i have modified to run > nutch(eg. settings in urls/urls.txt, conf/crawl-urlfilter.txt, > conf/nutch-site.xml, conf/nutch-default.xml) > I have also given the crawl.log text for your reference. > > while crawling through cygwin, it is giving an exception(Please help me > out what i have to do to run nutch successfully(where i have to put any > entry to pass through Proxy Authentication)) > > Dedup: starting > Dedup: adding indexes in: crawl/indexes > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at > org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43 > 9) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) > > =====================================================================>>> > >>>>>>>>>>>>> > > > =======>crawl.log > > crawl started in: crawl > rootUrlDir = urls > threads = 10 > depth = 3 > topN = 50 > Injector: starting > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20080109122052 > Generator: filtering: false > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: crawl/segments/20080109122052 > Fetcher: threads: 10 > fetching http://www.yahoo.com/ > fetch of http://www.yahoo.com/ <http://www.yahoo.com/> failed with: > Http code=407, url=http://www.yahoo.com/ > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20080109122052] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20080109122101 > Generator: filtering: false > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: crawl/segments/20080109122101 > Fetcher: threads: 10 > fetching http://www.yahoo.com/ > fetch of http://www.yahoo.com/ <http://www.yahoo.com/> failed with: > Http code=407, url=http://www.yahoo.com/ > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20080109122101] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20080109122110 > Generator: filtering: false > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: crawl/segments/20080109122110 > Fetcher: threads: 10 > fetching http://www.yahoo.com/ > fetch of http://www.yahoo.com/ <http://www.yahoo.com/> failed with: > Http code=407, url=http://www.yahoo.com/ > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20080109122110] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > LinkDb: starting > LinkDb: linkdb: crawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: crawl/segments/20080109122052 > LinkDb: adding segment: crawl/segments/20080109122101 > LinkDb: adding segment: crawl/segments/20080109122110 > LinkDb: done > Indexer: starting > Indexer: linkdb: crawl/linkdb > Indexer: adding segment: crawl/segments/20080109122052 > Indexer: adding segment: crawl/segments/20080109122101 > Indexer: adding segment: crawl/segments/20080109122110 > Optimizing index. > Indexer: done > Dedup: starting > Dedup: adding indexes in: crawl/indexes > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at > org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43 > 9) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) > > =====================================================================>>> > >>>>>>>>>>>>> > > > ========>urls/urls.txt > http://www.yahoo.com <http://www.yahoo.com/> > > > =====================================================================>>> > >>>>>>>>>>>>> > > =======>conf/crawl-urlfilter.txt > > # The url filter file used by the crawl command. > > # Better for intranet crawling. > # Be sure to change MY.DOMAIN.NAME to your domain name. > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r > pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept hosts in MY.DOMAIN.NAME > +^http://([a-z0-9]*\.)*yahoo.com/ > > # skip everything else > -. > > > =====================================================================>>> > >>>>>>>>>>>>> > > ====> conf/nutch-site.xml > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > <property> > <name>http.agent.name</name> > <value>abc</value> > <description>Description</description> > </property> > > <property> > <name>http.agent.description</name> > <value>Naveen</value> > <description>Description </description> > </property> > > <property> > <name>http.agent.url</name> > <value>http://www.abc.com </value> > <description> Description </description> > </property> > > <property> > <name>http.agent.email</name> > <value>[EMAIL PROTECTED]</value> > <description> Description </description> > </property> > </configuration> > > =====================================================================>>> > >>>>>>>>>>>>> > > ========>conf/nutch-default.xml > > > ---------some default properties------------ > <property> > <name>http.agent.name</name> > <value>abc</value> > <description>HTTP 'User-Agent' request header. MUST NOT be empty - > please set this to a single word uniquely related to your > organization. > > NOTE: You should also check other related properties: > > http.robots.agents > http.agent.description > http.agent.url > http.agent.email > http.agent.version > > and set their values appropriately. > > </description> > </property> > > ---------some default properties------------ > > <property> > <name>http.agent.description</name> > <value>Naveen</value> > <description>Further description of our bot- this text is used in > the User-Agent header. It appears in parenthesis after the agent > name. > </description> > </property> > > <property> > <name>http.agent.url</name> > <value>http://www.abc.com</value> > <description>A URL to advertise in the User-Agent header. This will > appear in parenthesis after the agent name. Custom dictates that this > should be a URL of a page explaining the purpose and behavior of this > crawler. > </description> > </property> > > <property> > <name>http.agent.email</name> > <value>[EMAIL PROTECTED]</value> > <description>An email address to advertise in the HTTP 'From' request > header and User-Agent header. A good practice is to mangle this > address (e.g. 'info at example dot com') to avoid spamming. > </description> > </property> > > <property> > <name>http.agent.version</name> > <value>Nutch-0.9</value> > <description>A version string to advertise in the User-Agent > header.</description> > </property> > > ---------some default properties------------ > > <property> > <name>http.proxy.host</name> > <value>xyz.abc.com</value> > <description>The proxy hostname. If empty, no proxy is > used.</description> > </property> > > <property> > <name>http.proxy.port</name> > <value>8080</value> > <description>The proxy port.</description> > </property> > > ---------some default properties------------ > > <property> > <name>http.proxy.username</name> > <value>abc</value> > <description>Username for proxy. This will be used by > 'protocol-httpclient', if the proxy server requests basic, digest > and/or NTLM authentication. To use this, 'protocol-httpclient' must > be present in the value of 'plugin.includes' property. > NOTE: For NTLM authentication, do not prefix the username with the > domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect. > </description> > </property> > > <property> > <name>http.proxy.password</name> > <value>password</value> > <description>Password for proxy. This will be used by > 'protocol-httpclient', if the proxy server requests basic, digest > and/or NTLM authentication. To use this, 'protocol-httpclient' must > be present in the value of 'plugin.includes' property. > </description> > </property> > > <property> > <name>http.proxy.realm</name> > <value></value> > <description>Authentication realm for proxy. Do not define a value > if realm is not required or authentication should take place for any > realm. NTLM does not use the notion of realms. Specify the domain name > > of NTLM authentication as the value for this property. To use this, > 'protocol-httpclient' must be present in the value of > 'plugin.includes' property. > </description> > </property> > > <property> > <name>http.agent.host</name> > <value>10.105.115.89</value> > <description>Name or IP address of the host on which the Nutch crawler > > would be running. Currently this is used by 'protocol-httpclient' > plugin. > </description> > </property> > > ---------some default properties------------ > > <property> > <name>plugin.includes</name> > > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|o > o|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic| > site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</v > alue> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints > plugin. By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please > enable > protocol-httpclient, but be aware of possible intermittent problems > with the > underlying commons-httpclient library. > </description> > </property> > > ---------some default properties------------ > > > =====================================================================>>> > >>>>>>>>>>>>> > > > > Please help me out what i have to do to run nutch successfully(where i > have to put any entry to pass through Proxy Authentication) > > > Thanks & Regards, > Naveen Goswami > > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient should > check this email and any attachments for the presence of viruses. The company > accepts no liability for any damage caused by any virus transmitted by this > email. > > www.wipro.com > >