Since it's an IO exception, there is a chance that your urls file is'nt in the correct format or you did'nt configure the options in craw-urlfilter.txt . Mail the options in these files so that we can look for errors if possible
On 8/21/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > Hi, > > No itried with different search string but it returned 0 results. > > Also one more thing while indexing Local file system. I changed the > nutch-site.xml as: > <configuration> > > <property> > <name>plugin.includes</name> > > <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value> > </property> > > <property> > <name>http.agent.name</name> > <value>localSystem</value> > <description>HTTP 'User-Agent' request header. MUST NOT be empty - > please set this to a single word uniquely related to your > organization. > .... > also changed the crawl-urlfilter.txt file however still it give errors, > > crawl started in: crawl.iiit > rootUrlDir = nutch-0.9/urls > threads = 10 > depth = 3 > topN = 50 > Injector: starting > Injector: crawlDb: crawl.iiit/crawldb > Injector: urlDir: nutch-0.9/urls > Injector: Converting injected urls to crawl db entries. > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at org.apache.nutch.crawl.Injector.inject(Injector.java:162) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:115) > > Kindly help... > > Regards, > Sachin. > > > You are searching for 'apache' in the search results. Are you sure the > > word 'apache' should exist in the search results? > > > > You can try some other string instead of 'apache' that you know would > > surely exist in one of the websites that you have crawled. > > > > There are a number of other things that could go wrong too like the > > crawl DB might be corrupt or incomplete, you might not have a 'crawl' > > directory present, etc. but first try out different search strings and > > see if it works fine. > > > > Regards, > > Susam Pal > > http://susam.in/ > > > > On 8/21/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> > > wrote: > >> Ya Thanks, That solved my problem. However, while checking for the > >> integrity of the indexes i execute the following command: > >> > >> bin/nutch org.apache.nutch.searcher.NutchBean apache > >> > >> but its returns me 0 Hits. Can u please tell me what i am missing? > >> > >> Thanks in Advance. > >> > >> Regards, > >> Sachin. > >> > >> > You need to set the following properties in 'conf/nutch-site.xml'. > >> > Though, in the example below, I have left the agent description, > agent > >> > url, etc. void but ideally you should set them so that the owner of a > >> > website can find out who is crawling the site and how to reach them. > >> > > >> > <property> > >> > <name>http.agent.name</name> > >> > <value>MySearch</value> > >> > <description>My Search Engine</description> > >> > </property> > >> > > >> > <property> > >> > <name>http.agent.description</name> > >> > <value></value> > >> > <description>Further description of our bot- this text is used in > >> > the User-Agent header. It appears in parenthesis after the agent > >> name. > >> > </description> > >> > </property> > >> > > >> > <property> > >> > <name>http.agent.url</name> > >> > <value></value> > >> > <description>A URL to advertise in the User-Agent header. This > will > >> > appear in parenthesis after the agent name. Custom dictates that > >> this > >> > should be a URL of a page explaining the purpose and behavior of > >> this > >> > crawler. > >> > </description> > >> > </property> > >> > > >> > <property> > >> > <name>http.agent.email</name> > >> > <value></value> > >> > <description>An email address to advertise in the HTTP 'From' > >> request > >> > header and User-Agent header. A good practice is to mangle this > >> > address (e.g. 'info at example dot com') to avoid spamming. > >> > </description> > >> > </property> > >> > > >> > Regards, > >> > Susam Pal > >> > http://susam.in/ > >> > > >> > On 8/21/07, [EMAIL PROTECTED] > >> <[EMAIL PROTECTED]> > >> > wrote: > >> >> Hi all, > >> >> I am new to Nutch. While trying to create indexes, i am getting > >> >> following > >> >> errors/exceptions: > >> >> . > >> >> . > >> >> . > >> >> fetching http://192.168.36.199/ > >> >> fetch of http://192.168.36.199/ failed with: > >> java.lang.RuntimeException: > >> >> Agent name not configured! > >> >> Fetcher: done > >> >> . > >> >> . > >> >> . > >> >> Indexer: done > >> >> Dedup: starting > >> >> Dedup: adding indexes in: crawl.iiit/indexes > >> >> Exception in thread "main" java.io.IOException: Job failed! > >> >> at > >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > >> >> at > >> >> org.apache.nutch.indexer.DeleteDuplicates.dedup( > DeleteDuplicates.java:439) > >> >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) > >> >> > >> >> > >> >> Where we have to configure this Agent Name.(I suppose in > >> >> conf/nutch-site.xml but wat to configure). > >> >> > >> >> Thanks in advance. > >> >> > >> >> Regards, > >> >> Sachin. > >> >> > >> >> > >> >> > >> > > >> > > >> > >> > >> > >> > > > > > > > >
