Please try mentioning the protocol in the seed URL file. For example:- http://www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc
I guess, it selects the protocol plugin according to the protocol specified in the URL. Regards, Susam Pal On Nov 16, 2007 4:07 PM, crazy <[EMAIL PROTECTED]> wrote: > > *my nutch version is 0.9 > *Command enter to run the Nutch crawl : bin/nutch crawl urls -dir crawl > -depth > *The content of my seed URLs file : > www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc > * Logs:2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer) > 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Protocol > (org.apache.nutch.protocol.Protocol) > 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Analysis > (org.apache.nutch.analysis.NutchAnalyzer) > 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch URL > Filter > (org.apache.nutch.net.URLFilter) > 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Indexing > Filter (org.apache.nutch.indexer.IndexingFilter) > 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Online > Search > Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) > 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - HTML Parse > Filter > (org.apache.nutch.parse.HtmlParseFilter) > 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Content > Parser (org.apache.nutch.parse.Parser) > 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch Query > Filter > (org.apache.nutch.searcher.QueryFilter) > 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Ontology Model > Loader (org.apache.nutch.ontology.Ontology) > 2007-11-16 10:20:08,373 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2007-11-16 10:20:08,374 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000 > 2007-11-16 10:20:08,374 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000 > 2007-11-16 10:20:09,161 WARN crawl.Generator - Generator: 0 records > selected for fetching, exiting ... > 2007-11-16 10:20:09,162 INFO crawl.Crawl - Stopping at depth=0 - no more > URLs to fetch. > 2007-11-16 10:20:09,162 WARN crawl.Crawl - No URLs to fetch - check your > seed list and URL filters. > > when i do your configuration i have this error: > Skipping www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc > :java.net.MalformedURLException: no protocol: > www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc > > and i want to indicate that when i index a html file or pdf i have no > problem > the problem just when i want to index a msword or msexcel .. document > > tks for help > > > > > > > > > > > > > > > > > > > > > > > > > > > > Susam Pal wrote: > > > > Your 'conf/crawl-urlfilter.txt' seems right. 'conf/nutch-site.xml' is > > meant to override the properties defined in 'conf/nutch-default.xml' > > file. To override a property, you just need to copy the same property > > from nutch-default.xml into nutch-site.xml and change the value inside > > the tags. > > > > To minimize confusion, I am including my 'conf/nutch-site.xml' so that > > you can see and understand. > > > > <?xml version="1.0"?> > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > > > > > > > > > > > > > > > http.robots.agents > > MySearch,* > > The agent strings we'll look for in robots.txt files, > > comma-separated, in decreasing order of precedence. You should > > put the value of http.agent.name as the first agent name, and keep the > > default * at the end of the list. E.g.: BlurflDev,Blurfl,* > > > > > > > > > > http.agent.name > > MySearch > > My Search Engine > > > > > > > > http.agent.description > > My Search Engine > > Further description of our bot- this text is used in > > the User-Agent header. It appears in parenthesis after the agent name. > > > > > > > > > > http.agent.url > > http://www.example.com/ > > A URL to advertise in the User-Agent header. This will > > appear in parenthesis after the agent name. Custom dictates that this > > should be a URL of a page explaining the purpose and behavior of this > > crawler. > > > > > > > > > > http.agent.email > > [EMAIL PROTECTED] > > An email address to advertise in the HTTP 'From' request > > header and User-Agent header. A good practice is to mangle this > > address (e.g. 'info at example dot com') to avoid spamming. > > > > > > > > > > plugin.includes > > > > protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) > > Regular expression naming plugin directory names to > > include. Any plugin not matching this expression is excluded. > > In any case you need at least include the nutch-extensionpoints plugin. > > By > > default Nutch includes crawling just HTML and plain text via HTTP, > > and basic indexing and search plugins. In order to use HTTPS please > > enable > > protocol-httpclient, but be aware of possible intermittent problems with > > the > > underlying commons-httpclient library. > > > > > > > > > > > > Apart from this please go through the tutorial at > > http://lucene.apache.org/nutch/tutorial8.html if you are using Nutch > > 0.8 or above. If you still fail to resolve the problem, please include > > the following information next time you send a mail:- > > > > 1. Version of Nutch are you using. > > 2. Command you enter to run the Nutch crawl. > > 3. The content of your seed URLs file. > > 4. Logs. > > > > Regards, > > Susam Pal > > > > On Nov 16, 2007 3:18 PM, crazy wrote: > >> > >> > >> hi, > >> tks for your answer but i don't understand what i should do exactly > >> this is my file crawl-urlfilter.txt: > >> # skip file:, ftp:, & mailto: urls > >> -^(file|ftp|mailto): > >> > >> # skip image and other suffixes we can't yet parse > >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > >> > >> # skip URLs containing certain characters as probable queries, etc. > >> [EMAIL PROTECTED] > >> > >> # skip URLs with slash-delimited segment that repeats 3+ times, to break > >> loops > >> -.*(/.+?)/.*?\1/.*?\1/ > >> > >> # accept hosts in lucene.apache.org/nutch > >> +^http://([a-z0-9]*\.)*localhost:8080/ > >> > >> # skip everything else > >> +. > >> and what about nutch-site.xml this file is empty > >> i have just the http.agent.name > >> i should insert the plugin.includes in this file? > >> > >> tks a lot and i wish have a answer the rather possible > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> crazy wrote: > >> > > >> > Hi, > >> > i install nutch for the first time and i want to index word and excel > >> > document > >> > even i change the nutch-default.xml : > >> > > >> > plugin.includes > >> > protocol-http|urlfilter-regex|parse-(text|html|js|pdf|swf| > >> > msword|mspowerpoint|rss)|index-(basic|more)|query-(basic|site|url| > >> > more)|subcollection|clustering-carrot2|summary-basic|scoring-opic > >> > > >> > Regular expression naming plugin directory names to > >> > include. Any plugin not matching this expression is excluded. > >> > In any case you need at least include the nutch-extensionpoints > >> plugin. > >> > By > >> > default Nutch includes crawling just HTML and plain text via HTTP, > >> > and basic indexing and search plugins. In order to use HTTPS please > >> > enable > >> > protocol-httpclient, but be aware of possible intermittent problems > >> with > >> > the > >> > underlying commons-httpclient library. > >> > > >> > > >> > enven this modification i still have the following message > >> > Generator: 0 records selected for fetching, exiting ... > >> > Stopping at depth=0 - no more URLs to fetch. > >> > No URLs to fetch - check your seed list and URL filters. > >> > crawl finished: crawl > >> > plz some one can help me its urgent > >> > > >> > >> -- > >> View this message in context: > >> http://www.nabble.com/indexing-word-file-tf4819567.html#a13790069 > >> > >> Sent from the Nutch - User mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/indexing-word-file-tf4819567.html#a13790726 > > Sent from the Nutch - User mailing list archive at Nabble.com. >
