Your 'conf/crawl-urlfilter.txt' seems right. 'conf/nutch-site.xml' is meant to override the properties defined in 'conf/nutch-default.xml' file. To override a property, you just need to copy the same property from nutch-default.xml into nutch-site.xml and change the value inside the <value> tags.
To minimize confusion, I am including my 'conf/nutch-site.xml' so that you can see and understand. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <!-- Agent properties --> <property> <name>http.robots.agents</name> <value>MySearch,*</value> <description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* </description> </property> <property> <name>http.agent.name</name> <value>MySearch</value> <description>My Search Engine</description> </property> <property> <name>http.agent.description</name> <value>My Search Engine</value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value>http://www.example.com/</value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value>[EMAIL PROTECTED]</value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property> <!-- Plugins --> <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> </configuration> Apart from this please go through the tutorial at http://lucene.apache.org/nutch/tutorial8.html if you are using Nutch 0.8 or above. If you still fail to resolve the problem, please include the following information next time you send a mail:- 1. Version of Nutch are you using. 2. Command you enter to run the Nutch crawl. 3. The content of your seed URLs file. 4. Logs. Regards, Susam Pal On Nov 16, 2007 3:18 PM, crazy <[EMAIL PROTECTED]> wrote: > > > hi, > tks for your answer but i don't understand what i should do exactly > this is my file crawl-urlfilter.txt: > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept hosts in lucene.apache.org/nutch > +^http://([a-z0-9]*\.)*localhost:8080/ > > # skip everything else > +. > and what about nutch-site.xml this file is empty > i have just the http.agent.name > i should insert the plugin.includes in this file? > > tks a lot and i wish have a answer the rather possible > > > > > > > > > > > > > > > > > crazy wrote: > > > > Hi, > > i install nutch for the first time and i want to index word and excel > > document > > even i change the nutch-default.xml : > > <property> > > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|swf| > > msword|mspowerpoint|rss)|index-(basic|more)|query-(basic|site|url| > > more)|subcollection|clustering-carrot2|summary-basic|scoring-opic</value> > > > > <description>Regular expression naming plugin directory names to > > include. Any plugin not matching this expression is excluded. > > In any case you need at least include the nutch-extensionpoints plugin. > > By > > default Nutch includes crawling just HTML and plain text via HTTP, > > and basic indexing and search plugins. In order to use HTTPS please > > enable > > protocol-httpclient, but be aware of possible intermittent problems with > > the > > underlying commons-httpclient library. > > </description> > > </property> > > enven this modification i still have the following message > > Generator: 0 records selected for fetching, exiting ... > > Stopping at depth=0 - no more URLs to fetch. > > No URLs to fetch - check your seed list and URL filters. > > crawl finished: crawl > > plz some one can help me its urgent > > > > -- > View this message in context: > http://www.nabble.com/indexing-word-file-tf4819567.html#a13790069 > > Sent from the Nutch - User mailing list archive at Nabble.com. > >
