hi, tks for your answer but i don't understand what i should do exactly this is my file crawl-urlfilter.txt: # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in lucene.apache.org/nutch +^http://([a-z0-9]*\.)*localhost:8080/ # skip everything else +. and what about nutch-site.xml this file is empty i have just the http.agent.name i should insert the plugin.includes in this file? tks a lot and i wish have a answer the rather possible crazy wrote: > > Hi, > i install nutch for the first time and i want to index word and excel > document > even i change the nutch-default.xml : > <property> > <name>plugin.includes</name> > <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|swf| > msword|mspowerpoint|rss)|index-(basic|more)|query-(basic|site|url| > more)|subcollection|clustering-carrot2|summary-basic|scoring-opic</value> > > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. > By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please > enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > </description> > </property> > enven this modification i still have the following message > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=0 - no more URLs to fetch. > No URLs to fetch - check your seed list and URL filters. > crawl finished: crawl > plz some one can help me its urgent > -- View this message in context: http://www.nabble.com/indexing-word-file-tf4819567.html#a13790069 Sent from the Nutch - User mailing list archive at Nabble.com.
