Re: Problem in creating Index

aditya naga hemanth kumar Wed, 22 Aug 2007 00:29:52 -0700

Since it's an IO exception, there is a chance that your urls file is'nt in
the correct format or you did'nt configure the options in craw-urlfilter.txt
.
Mail the options in these files so that we can look for errors if possible


On 8/21/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]>
wrote:
>
> Hi,
>
> No itried with different search string but it returned 0 results.
>
> Also one more thing while indexing Local file system. I changed the
> nutch-site.xml as:
> <configuration>
>
>       <property>
>         <name>plugin.includes</name>
>
> <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
>       </property>
>
>       <property>
>         <name>http.agent.name</name>
>         <value>localSystem</value>
>         <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>         please set this to a single word uniquely related to your
>         organization.
> ....
> also changed the crawl-urlfilter.txt file however still it give errors,
>
> crawl started in: crawl.iiit
> rootUrlDir = nutch-0.9/urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl.iiit/crawldb
> Injector: urlDir: nutch-0.9/urls
> Injector: Converting injected urls to crawl db entries.
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
>
> Kindly help...
>
> Regards,
> Sachin.
>
> > You are searching for 'apache' in the search results. Are you sure the
> > word 'apache' should exist in the search results?
> >
> > You can try some other string instead of 'apache' that you know would
> > surely exist in one of the websites that you have crawled.
> >
> > There are a number of other things that could go wrong too like the
> > crawl DB might be corrupt or incomplete, you might not have a 'crawl'
> > directory present, etc. but first try out different search strings and
> > see if it works fine.
> >
> > Regards,
> > Susam Pal
> > http://susam.in/
> >
> > On 8/21/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]>
> > wrote:
> >> Ya Thanks, That solved my problem. However, while checking for the
> >> integrity of the indexes i execute the following command:
> >>
> >> bin/nutch org.apache.nutch.searcher.NutchBean apache
> >>
> >> but its returns me 0 Hits. Can u please tell me what i am missing?
> >>
> >> Thanks in Advance.
> >>
> >> Regards,
> >> Sachin.
> >>
> >> > You need to set the following properties in 'conf/nutch-site.xml'.
> >> > Though, in the example below, I have left the agent description,
> agent
> >> > url, etc. void but ideally you should set them so that the owner of a
> >> > website can find out who is crawling the site and how to reach them.
> >> >
> >> > <property>
> >> >   <name>http.agent.name</name>
> >> >   <value>MySearch</value>
> >> >   <description>My Search Engine</description>
> >> > </property>
> >> >
> >> > <property>
> >> >   <name>http.agent.description</name>
> >> >   <value></value>
> >> >   <description>Further description of our bot- this text is used in
> >> >   the User-Agent header.  It appears in parenthesis after the agent
> >> name.
> >> >   </description>
> >> > </property>
> >> >
> >> > <property>
> >> >   <name>http.agent.url</name>
> >> >   <value></value>
> >> >   <description>A URL to advertise in the User-Agent header.  This
> will
> >> >    appear in parenthesis after the agent name. Custom dictates that
> >> this
> >> >    should be a URL of a page explaining the purpose and behavior of
> >> this
> >> >    crawler.
> >> >   </description>
> >> > </property>
> >> >
> >> > <property>
> >> >   <name>http.agent.email</name>
> >> >   <value></value>
> >> >   <description>An email address to advertise in the HTTP 'From'
> >> request
> >> >    header and User-Agent header. A good practice is to mangle this
> >> >    address (e.g. 'info at example dot com') to avoid spamming.
> >> >   </description>
> >> > </property>
> >> >
> >> > Regards,
> >> > Susam Pal
> >> > http://susam.in/
> >> >
> >> > On 8/21/07, [EMAIL PROTECTED]
> >> <[EMAIL PROTECTED]>
> >> > wrote:
> >> >> Hi all,
> >> >> I am new to Nutch. While trying to create indexes, i am getting
> >> >> following
> >> >> errors/exceptions:
> >> >> .
> >> >> .
> >> >> .
> >> >> fetching http://192.168.36.199/
> >> >> fetch of http://192.168.36.199/ failed with:
> >> java.lang.RuntimeException:
> >> >> Agent name not configured!
> >> >> Fetcher: done
> >> >> .
> >> >> .
> >> >> .
> >> >> Indexer: done
> >> >> Dedup: starting
> >> >> Dedup: adding indexes in: crawl.iiit/indexes
> >> >> Exception in thread "main" java.io.IOException: Job failed!
> >> >>         at
> >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> >> >>         at
> >> >> org.apache.nutch.indexer.DeleteDuplicates.dedup(
> DeleteDuplicates.java:439)
> >> >>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
> >> >>
> >> >>
> >> >> Where we have to configure this Agent Name.(I suppose in
> >> >> conf/nutch-site.xml but wat to configure).
> >> >>
> >> >> Thanks in advance.
> >> >>
> >> >> Regards,
> >> >> Sachin.
> >> >>
> >> >>
> >> >>
> >> >
> >> >
> >>
> >>
> >>
> >>
> >
> >
>
>
>
>

Re: Problem in creating Index

Reply via email to