Hi, These type of questions should actually go into [email protected] (the nutch-user mailing list). So, I am sending my reply to the nutch-user list with you in the CC field.
Regarding your question, you haven't provided the logs for the authentication failure. You describe that you get "HTTP 407 error authentication failure" but your log shows permission denied for hadoop.log. The first error occurs because you have not set the proxy authentication details. You can do so in conf/nutch-site.xml by adding the following properties:- <property> <name>http.proxy.username</name> <value></value> <description>Username for proxy. This will be used by 'protocol-httpclient', if the proxy server requests basic, digest and/or NTLM authentication. To use this, 'protocol-httpclient' must be present in the value of 'plugin.includes' property. NOTE: For NTLM authentication, do not prefix the username with the domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect. </description> </property> <property> <name>http.proxy.password</name> <value></value> <description>Password for proxy. This will be used by 'protocol-httpclient', if the proxy server requests basic, digest and/or NTLM authentication. To use this, 'protocol-httpclient' must be present in the value of 'plugin.includes' property. </description> </property> <property> <name>http.proxy.realm</name> <value></value> <description>Authentication realm for proxy. Do not define a value if realm is not required or authentication should take place for any realm. NTLM does not use the notion of realms. Specify the domain name of NTLM authentication as the value for this property. To use this, 'protocol-httpclient' must be present in the value of 'plugin.includes' property. </description> </property> <property> <name>http.agent.host</name> <value></value> <description>Name or IP address of the host on which the Nutch crawler would be running. Currently this is used by 'protocol-httpclient' plugin. </description> </property> You have to use protocol-httpclient instead of protocol-http for proxy authentication to happen. For this, you have to override the plugin.includes property in conf/nutch-site.xml. Example:- <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> The second error seems to occur probably because you do not have permission over the log file, hadoop.log. Checking the permissions and setting the proper permissions might work. Regards, Susam Pal On Dec 28, 2007 4:58 PM, NIDHI MALIK <[EMAIL PROTECTED]> wrote to [EMAIL PROTECTED]: > > Hello, > I am facing problem in using Nutch to crawl data from web. I have > configured Nutch-site.XML and Nutch-default.XML but still "HTTP 407 > error authentication failure" message is displayed. I have also set > the http_proxies. > > I have also tried wget. at the time of local crawling The following msg is > displayed. > > ------------------------------ > log4j:ERROR setFile(null,true) call failed. > java.io.FileNotFoundException: > /home/nidhi/Nutch_Installation/nutch-0.8.1/logs/hadoop.log (Permission > denied) > at java.io.FileOutputStream.openAppend(Native Method) > at java.io.FileOutputStream.<init>(FileOutputStream.java:177) > at java.io.FileOutputStream.<init>(FileOutputStream.java:102) > at org.apache.log4j.FileAppender.setFile(FileAppender.java:289) > at > org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163) > at > org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:215) > at > org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256) > at > org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132) > at > org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96) > at > org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:654) > at > org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:612) > at > org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:509) > at > org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:415) > at > org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:441) > at > org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:468) > at org.apache.log4j.LogManager.<clinit>(LogManager.java:122) > at org.apache.log4j.Logger.getLogger(Logger.java:104) > at > org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229) > at > org.apache.commons.logging.impl.Log4JLogger.<init>(Log4JLogger.java:65) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:529) > at > org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:235) > at > org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:209) > at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:351) > at org.apache.nutch.crawl.Injector.<clinit>(Injector.java:40) > > > > ------------------------------ > > > Can anyone plz suggest the solution. > > > Thanks > > >
