Re: Problem in creating Index

Susam Pal Tue, 21 Aug 2007 05:03:25 -0700

You are searching for 'apache' in the search results. Are you sure the
word 'apache' should exist in the search results?


You can try some other string instead of 'apache' that you know would
surely exist in one of the websites that you have crawled.

There are a number of other things that could go wrong too like the
crawl DB might be corrupt or incomplete, you might not have a 'crawl'
directory present, etc. but first try out different search strings and
see if it works fine.

Regards,
Susam Pal
http://susam.in/

On 8/21/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> Ya Thanks, That solved my problem. However, while checking for the
> integrity of the indexes i execute the following command:
>
> bin/nutch org.apache.nutch.searcher.NutchBean apache
>
> but its returns me 0 Hits. Can u please tell me what i am missing?
>
> Thanks in Advance.
>
> Regards,
> Sachin.
>
> > You need to set the following properties in 'conf/nutch-site.xml'.
> > Though, in the example below, I have left the agent description, agent
> > url, etc. void but ideally you should set them so that the owner of a
> > website can find out who is crawling the site and how to reach them.
> >
> > <property>
> >   <name>http.agent.name</name>
> >   <value>MySearch</value>
> >   <description>My Search Engine</description>
> > </property>
> >
> > <property>
> >   <name>http.agent.description</name>
> >   <value></value>
> >   <description>Further description of our bot- this text is used in
> >   the User-Agent header.  It appears in parenthesis after the agent name.
> >   </description>
> > </property>
> >
> > <property>
> >   <name>http.agent.url</name>
> >   <value></value>
> >   <description>A URL to advertise in the User-Agent header.  This will
> >    appear in parenthesis after the agent name. Custom dictates that this
> >    should be a URL of a page explaining the purpose and behavior of this
> >    crawler.
> >   </description>
> > </property>
> >
> > <property>
> >   <name>http.agent.email</name>
> >   <value></value>
> >   <description>An email address to advertise in the HTTP 'From' request
> >    header and User-Agent header. A good practice is to mangle this
> >    address (e.g. 'info at example dot com') to avoid spamming.
> >   </description>
> > </property>
> >
> > Regards,
> > Susam Pal
> > http://susam.in/
> >
> > On 8/21/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]>
> > wrote:
> >> Hi all,
> >> I am new to Nutch. While trying to create indexes, i am getting
> >> following
> >> errors/exceptions:
> >> .
> >> .
> >> .
> >> fetching http://192.168.36.199/
> >> fetch of http://192.168.36.199/ failed with: java.lang.RuntimeException:
> >> Agent name not configured!
> >> Fetcher: done
> >> .
> >> .
> >> .
> >> Indexer: done
> >> Dedup: starting
> >> Dedup: adding indexes in: crawl.iiit/indexes
> >> Exception in thread "main" java.io.IOException: Job failed!
> >>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> >>         at
> >> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
> >>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
> >>
> >>
> >> Where we have to configure this Agent Name.(I suppose in
> >> conf/nutch-site.xml but wat to configure).
> >>
> >> Thanks in advance.
> >>
> >> Regards,
> >> Sachin.
> >>
> >>
> >>
> >
> >
>
>
>
>

Re: Problem in creating Index

Reply via email to