Re: indexing word file

Susam Pal Fri, 16 Nov 2007 02:18:34 -0800

Your 'conf/crawl-urlfilter.txt' seems right. 'conf/nutch-site.xml' is
meant to override the properties defined in 'conf/nutch-default.xml'
file. To override a property, you just need to copy the same property
from nutch-default.xml into nutch-site.xml and change the value inside
the <value> tags.

To minimize confusion, I am including my 'conf/nutch-site.xml' so that
you can see and understand.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>
 <name>http.robots.agents</name>
 <value>MySearch,*</value>
 <description>The agent strings we'll look for in robots.txt files,
 comma-separated, in decreasing order of precedence. You should
 put the value of http.agent.name as the first agent name, and keep the
 default * at the end of the list. E.g.: BlurflDev,Blurfl,*
 </description>
</property>

<property>
 <name>http.agent.name</name>
 <value>MySearch</value>
 <description>My Search Engine</description>
</property>

<property>
 <name>http.agent.description</name>
 <value>My Search Engine</value>
 <description>Further description of our bot- this text is used in
 the User-Agent header.  It appears in parenthesis after the agent name.
 </description>
</property>

<property>
 <name>http.agent.url</name>
 <value>http://www.example.com/</value>
 <description>A URL to advertise in the User-Agent header.  This will
  appear in parenthesis after the agent name. Custom dictates that this
  should be a URL of a page explaining the purpose and behavior of this
  crawler.
 </description>
</property>

<property>
 <name>http.agent.email</name>
 <value>[EMAIL PROTECTED]</value>
 <description>An email address to advertise in the HTTP 'From' request
  header and User-Agent header. A good practice is to mangle this
  address (e.g. 'info at example dot com') to avoid spamming.
 </description>
</property>

<property>
 <name>plugin.includes</name>

<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
 <description>Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins. In order to use HTTPS please enable
 protocol-httpclient, but be aware of possible intermittent problems with the
 underlying commons-httpclient library.
 </description>
</property>

</configuration>

Apart from this please go through the tutorial at
http://lucene.apache.org/nutch/tutorial8.html if you are using Nutch
0.8 or above. If you still fail to resolve the problem, please include
the following information next time you send a mail:-

1. Version of Nutch are you using.
2. Command you enter to run the Nutch crawl.
3. The content of your seed URLs file.
4. Logs.

Regards,
Susam Pal

On Nov 16, 2007 3:18 PM, crazy <[EMAIL PROTECTED]> wrote:
>
>
> hi,
> tks for your answer but i don't understand what i should do exactly
> this is my file crawl-urlfilter.txt:
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in lucene.apache.org/nutch
> +^http://([a-z0-9]*\.)*localhost:8080/
>
> # skip everything else
> +.
> and what about nutch-site.xml this file is empty
> i have just the http.agent.name
> i should insert the plugin.includes in this file?
>
> tks a lot and i wish have a answer the rather possible
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> crazy wrote:
> >
> > Hi,
> > i install nutch for the first time and i want to index word and excel
> > document
> > even i change  the nutch-default.xml :
> > <property>
> >   <name>plugin.includes</name>
> > <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|swf|
> > msword|mspowerpoint|rss)|index-(basic|more)|query-(basic|site|url|
> > more)|subcollection|clustering-carrot2|summary-basic|scoring-opic</value>
> >
> >     <description>Regular expression naming plugin directory names to
> >   include.  Any plugin not matching this expression is excluded.
> >   In any case you need at least include the nutch-extensionpoints plugin.
> > By
> >   default Nutch includes crawling just HTML and plain text via HTTP,
> >   and basic indexing and search plugins. In order to use HTTPS please
> > enable
> >   protocol-httpclient, but be aware of possible intermittent problems with
> > the
> >   underlying commons-httpclient library.
> >   </description>
> > </property>
> > enven this modification i still have the following message
> > Generator: 0 records selected for fetching, exiting ...
> > Stopping at depth=0 - no more URLs to fetch.
> > No URLs to fetch - check your seed list and URL filters.
> > crawl finished: crawl
> > plz some one can help me its urgent
> >
>
> --
> View this message in context: 
> http://www.nabble.com/indexing-word-file-tf4819567.html#a13790069
>
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: indexing word file

Reply via email to