hi,
tks for your answer but i don't understand what i should do exactly
this is my file crawl-urlfilter.txt:
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in lucene.apache.org/nutch
+^http://([a-z0-9]*\.)*localhost:8080/

# skip everything else
+.
and what about nutch-site.xml this file is empty
i have just the http.agent.name
i should insert the plugin.includes in this file?

tks a lot and i wish have a answer the rather possible















crazy wrote:
> 
> Hi,
> i install nutch for the first time and i want to index word and excel
> document
> even i change  the nutch-default.xml :
> <property>
>   <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|swf|
> msword|mspowerpoint|rss)|index-(basic|more)|query-(basic|site|url|
> more)|subcollection|clustering-carrot2|summary-basic|scoring-opic</value>
> 
>     <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please
> enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
> enven this modification i still have the following message
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=0 - no more URLs to fetch.
> No URLs to fetch - check your seed list and URL filters.
> crawl finished: crawl
> plz some one can help me its urgent 
> 

-- 
View this message in context: 
http://www.nabble.com/indexing-word-file-tf4819567.html#a13790069
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to