*my nutch version is 0.9 *Command enter to run the Nutch crawl : bin/nutch crawl urls -dir crawl -depth *The content of my seed URLs file : www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc * Logs:2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) 2007-11-16 10:20:08,373 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2007-11-16 10:20:08,374 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2007-11-16 10:20:08,374 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2007-11-16 10:20:09,161 WARN crawl.Generator - Generator: 0 records selected for fetching, exiting ... 2007-11-16 10:20:09,162 INFO crawl.Crawl - Stopping at depth=0 - no more URLs to fetch. 2007-11-16 10:20:09,162 WARN crawl.Crawl - No URLs to fetch - check your seed list and URL filters.
when i do your configuration i have this error: Skipping www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc :java.net.MalformedURLException: no protocol: www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc and i want to indicate that when i index a html file or pdf i have no problem the problem just when i want to index a msword or msexcel .. document tks for help Susam Pal wrote: > > Your 'conf/crawl-urlfilter.txt' seems right. 'conf/nutch-site.xml' is > meant to override the properties defined in 'conf/nutch-default.xml' > file. To override a property, you just need to copy the same property > from nutch-default.xml into nutch-site.xml and change the value inside > the tags. > > To minimize confusion, I am including my 'conf/nutch-site.xml' so that > you can see and understand. > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > > > > > > > http.robots.agents > MySearch,* > The agent strings we'll look for in robots.txt files, > comma-separated, in decreasing order of precedence. You should > put the value of http.agent.name as the first agent name, and keep the > default * at the end of the list. E.g.: BlurflDev,Blurfl,* > > > > > http.agent.name > MySearch > My Search Engine > > > > http.agent.description > My Search Engine > Further description of our bot- this text is used in > the User-Agent header. It appears in parenthesis after the agent name. > > > > > http.agent.url > http://www.example.com/ > A URL to advertise in the User-Agent header. This will > appear in parenthesis after the agent name. Custom dictates that this > should be a URL of a page explaining the purpose and behavior of this > crawler. > > > > > http.agent.email > [EMAIL PROTECTED] > An email address to advertise in the HTTP 'From' request > header and User-Agent header. A good practice is to mangle this > address (e.g. 'info at example dot com') to avoid spamming. > > > > > plugin.includes > > protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) > Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. > By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please > enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > > > > > > Apart from this please go through the tutorial at > http://lucene.apache.org/nutch/tutorial8.html if you are using Nutch > 0.8 or above. If you still fail to resolve the problem, please include > the following information next time you send a mail:- > > 1. Version of Nutch are you using. > 2. Command you enter to run the Nutch crawl. > 3. The content of your seed URLs file. > 4. Logs. > > Regards, > Susam Pal > > On Nov 16, 2007 3:18 PM, crazy wrote: >> >> >> hi, >> tks for your answer but i don't understand what i should do exactly >> this is my file crawl-urlfilter.txt: >> # skip file:, ftp:, & mailto: urls >> -^(file|ftp|mailto): >> >> # skip image and other suffixes we can't yet parse >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ >> >> # skip URLs containing certain characters as probable queries, etc. >> [EMAIL PROTECTED] >> >> # skip URLs with slash-delimited segment that repeats 3+ times, to break >> loops >> -.*(/.+?)/.*?\1/.*?\1/ >> >> # accept hosts in lucene.apache.org/nutch >> +^http://([a-z0-9]*\.)*localhost:8080/ >> >> # skip everything else >> +. >> and what about nutch-site.xml this file is empty >> i have just the http.agent.name >> i should insert the plugin.includes in this file? >> >> tks a lot and i wish have a answer the rather possible >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> crazy wrote: >> > >> > Hi, >> > i install nutch for the first time and i want to index word and excel >> > document >> > even i change the nutch-default.xml : >> > >> > plugin.includes >> > protocol-http|urlfilter-regex|parse-(text|html|js|pdf|swf| >> > msword|mspowerpoint|rss)|index-(basic|more)|query-(basic|site|url| >> > more)|subcollection|clustering-carrot2|summary-basic|scoring-opic >> > >> > Regular expression naming plugin directory names to >> > include. Any plugin not matching this expression is excluded. >> > In any case you need at least include the nutch-extensionpoints >> plugin. >> > By >> > default Nutch includes crawling just HTML and plain text via HTTP, >> > and basic indexing and search plugins. In order to use HTTPS please >> > enable >> > protocol-httpclient, but be aware of possible intermittent problems >> with >> > the >> > underlying commons-httpclient library. >> > >> > >> > enven this modification i still have the following message >> > Generator: 0 records selected for fetching, exiting ... >> > Stopping at depth=0 - no more URLs to fetch. >> > No URLs to fetch - check your seed list and URL filters. >> > crawl finished: crawl >> > plz some one can help me its urgent >> > >> >> -- >> View this message in context: >> http://www.nabble.com/indexing-word-file-tf4819567.html#a13790069 >> >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/indexing-word-file-tf4819567.html#a13790726 Sent from the Nutch - User mailing list archive at Nabble.com.
