Re: indexing word file

Susam Pal Fri, 16 Nov 2007 02:16:42 -0800

I would like to mention that it is not a good practice to change
'conf/nutch-default.xml'. We always modify 'conf/nutch-site.xml' in
order to override the properties defined in 'conf/nutch-default.xml'.
However, this is not the real cause of your problem.


You can find the real cause of your problem from these three lines of your logs.

> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=0 - no more URLs to fetch.
> No URLs to fetch - check your seed list and URL filters.

As you can see no URLs were selected. Either your seed list is empty
or you haven't configured your url-filter properly. Since, you are
using it for the first time, I assume, you are using the 'bin/nutch
crawl' command to crawl. In that case, you should modify,
'conf/crawl-urlfilter.txt', especially this following line.

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

Change MY.DOMAIN.NAME to whatever your domain name is. If you are
crawling a single server, you can change it to something like:-

+^http://www.example.com/

or

+^http://192.168.101.3/

(where www.example.com or 192.168.101.3 is the address of the web server).

If you want to crawl everything, you can ignore this part and instead
change the last line of this file from:-

-. to +.

Regards,
Susam Pal

On Nov 16, 2007 1:45 PM, crazy <[EMAIL PROTECTED]> wrote:
>
> Hi,
> i install nutch for the first time and i want to index word and excel
> document
> even i change  the nutch-default.xml :
> <property>
>   <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|swf|
> msword|mspowerpoint|rss)|index-(basic|more)|query-(basic|site|url|
> more)|subcollection|clustering-carrot2|summary-basic|scoring-opic</value>
>
>     <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
> enven this modification i still have the following message
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=0 - no more URLs to fetch.
> No URLs to fetch - check your seed list and URL filters.
> crawl finished: crawl
> plz some one can help me its urgent
> --
> View this message in context: 
> http://www.nabble.com/indexing-word-file-tf4819567.html#a13788425
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: indexing word file

Reply via email to