Re: Problem in creating Index

sachin_s Tue, 21 Aug 2007 06:54:51 -0700

Hi,

No itried with different search string but it returned 0 results.


Also one more thing while indexing Local file system. I changed the
nutch-site.xml as:
<configuration>

      <property>
        <name>plugin.includes</name>
 
<value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
      </property>

      <property>
        <name>http.agent.name</name>
        <value>localSystem</value>
        <description>HTTP 'User-Agent' request header. MUST NOT be empty -
        please set this to a single word uniquely related to your
        organization.
....
 also changed the crawl-urlfilter.txt file however still it give errors,

crawl started in: crawl.iiit
rootUrlDir = nutch-0.9/urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl.iiit/crawldb
Injector: urlDir: nutch-0.9/urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

Kindly help...

Regards,
Sachin.

> You are searching for 'apache' in the search results. Are you sure the
> word 'apache' should exist in the search results?
>
> You can try some other string instead of 'apache' that you know would
> surely exist in one of the websites that you have crawled.
>
> There are a number of other things that could go wrong too like the
> crawl DB might be corrupt or incomplete, you might not have a 'crawl'
> directory present, etc. but first try out different search strings and
> see if it works fine.
>
> Regards,
> Susam Pal
> http://susam.in/
>
> On 8/21/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]>
> wrote:
>> Ya Thanks, That solved my problem. However, while checking for the
>> integrity of the indexes i execute the following command:
>>
>> bin/nutch org.apache.nutch.searcher.NutchBean apache
>>
>> but its returns me 0 Hits. Can u please tell me what i am missing?
>>
>> Thanks in Advance.
>>
>> Regards,
>> Sachin.
>>
>> > You need to set the following properties in 'conf/nutch-site.xml'.
>> > Though, in the example below, I have left the agent description, agent
>> > url, etc. void but ideally you should set them so that the owner of a
>> > website can find out who is crawling the site and how to reach them.
>> >
>> > <property>
>> >   <name>http.agent.name</name>
>> >   <value>MySearch</value>
>> >   <description>My Search Engine</description>
>> > </property>
>> >
>> > <property>
>> >   <name>http.agent.description</name>
>> >   <value></value>
>> >   <description>Further description of our bot- this text is used in
>> >   the User-Agent header.  It appears in parenthesis after the agent
>> name.
>> >   </description>
>> > </property>
>> >
>> > <property>
>> >   <name>http.agent.url</name>
>> >   <value></value>
>> >   <description>A URL to advertise in the User-Agent header.  This will
>> >    appear in parenthesis after the agent name. Custom dictates that
>> this
>> >    should be a URL of a page explaining the purpose and behavior of
>> this
>> >    crawler.
>> >   </description>
>> > </property>
>> >
>> > <property>
>> >   <name>http.agent.email</name>
>> >   <value></value>
>> >   <description>An email address to advertise in the HTTP 'From'
>> request
>> >    header and User-Agent header. A good practice is to mangle this
>> >    address (e.g. 'info at example dot com') to avoid spamming.
>> >   </description>
>> > </property>
>> >
>> > Regards,
>> > Susam Pal
>> > http://susam.in/
>> >
>> > On 8/21/07, [EMAIL PROTECTED]
>> <[EMAIL PROTECTED]>
>> > wrote:
>> >> Hi all,
>> >> I am new to Nutch. While trying to create indexes, i am getting
>> >> following
>> >> errors/exceptions:
>> >> .
>> >> .
>> >> .
>> >> fetching http://192.168.36.199/
>> >> fetch of http://192.168.36.199/ failed with:
>> java.lang.RuntimeException:
>> >> Agent name not configured!
>> >> Fetcher: done
>> >> .
>> >> .
>> >> .
>> >> Indexer: done
>> >> Dedup: starting
>> >> Dedup: adding indexes in: crawl.iiit/indexes
>> >> Exception in thread "main" java.io.IOException: Job failed!
>> >>         at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>> >>         at
>> >> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
>> >>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>> >>
>> >>
>> >> Where we have to configure this Agent Name.(I suppose in
>> >> conf/nutch-site.xml but wat to configure).
>> >>
>> >> Thanks in advance.
>> >>
>> >> Regards,
>> >> Sachin.
>> >>
>> >>
>> >>
>> >
>> >
>>
>>
>>
>>
>
>

Re: Problem in creating Index

Reply via email to