i change my seed urls file to this http://www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc
and i have this like result: fetching http://www.frlii.org/IMG/doc/cactalogue_a_portail_27-09-2004.doc 16 nov. 2007 11:18:55 org.apache.tika.mime.MimeUtils load INFO: Loading [tika-mimetypes.xml] Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20071116111851] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20071116111859 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. what i can do now i feel that we are near the aim tksss Susam Pal wrote: > > Please try mentioning the protocol in the seed URL file. For example:- > > http://www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc > > I guess, it selects the protocol plugin according to the protocol > specified in the URL. > > Regards, > Susam Pal > > On Nov 16, 2007 4:07 PM, crazy <[EMAIL PROTECTED]> wrote: >> >> *my nutch version is 0.9 >> *Command enter to run the Nutch crawl : bin/nutch crawl urls -dir crawl >> -depth >> *The content of my seed URLs file : >> www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc >> * Logs:2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch >> Summarizer (org.apache.nutch.searcher.Summarizer) >> 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch URL >> Normalizer (org.apache.nutch.net.URLNormalizer) >> 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch >> Protocol >> (org.apache.nutch.protocol.Protocol) >> 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch >> Analysis >> (org.apache.nutch.analysis.NutchAnalyzer) >> 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch URL >> Filter >> (org.apache.nutch.net.URLFilter) >> 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch >> Indexing >> Filter (org.apache.nutch.indexer.IndexingFilter) >> 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch >> Online Search >> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) >> 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - HTML >> Parse Filter >> (org.apache.nutch.parse.HtmlParseFilter) >> 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch >> Content >> Parser (org.apache.nutch.parse.Parser) >> 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch >> Scoring >> (org.apache.nutch.scoring.ScoringFilter) >> 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Nutch >> Query Filter >> (org.apache.nutch.searcher.QueryFilter) >> 2007-11-16 10:20:08,356 INFO plugin.PluginRepository - Ontology >> Model >> Loader (org.apache.nutch.ontology.Ontology) >> 2007-11-16 10:20:08,373 INFO crawl.FetchScheduleFactory - Using >> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >> 2007-11-16 10:20:08,374 INFO crawl.AbstractFetchSchedule - >> defaultInterval=2592000 >> 2007-11-16 10:20:08,374 INFO crawl.AbstractFetchSchedule - >> maxInterval=7776000 >> 2007-11-16 10:20:09,161 WARN crawl.Generator - Generator: 0 records >> selected for fetching, exiting ... >> 2007-11-16 10:20:09,162 INFO crawl.Crawl - Stopping at depth=0 - no more >> URLs to fetch. >> 2007-11-16 10:20:09,162 WARN crawl.Crawl - No URLs to fetch - check your >> seed list and URL filters. >> >> when i do your configuration i have this error: >> Skipping www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc >> :java.net.MalformedURLException: no protocol: >> www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc >> >> and i want to indicate that when i index a html file or pdf i have no >> problem >> the problem just when i want to index a msword or msexcel .. document >> >> tks for help >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Susam Pal wrote: >> > >> > Your 'conf/crawl-urlfilter.txt' seems right. 'conf/nutch-site.xml' is >> > meant to override the properties defined in 'conf/nutch-default.xml' >> > file. To override a property, you just need to copy the same property >> > from nutch-default.xml into nutch-site.xml and change the value inside >> > the tags. >> > >> > To minimize confusion, I am including my 'conf/nutch-site.xml' so that >> > you can see and understand. >> > >> > <?xml version="1.0"?> >> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> > >> > >> > >> > >> > >> > >> > >> > http.robots.agents >> > MySearch,* >> > The agent strings we'll look for in robots.txt files, >> > comma-separated, in decreasing order of precedence. You should >> > put the value of http.agent.name as the first agent name, and keep the >> > default * at the end of the list. E.g.: BlurflDev,Blurfl,* >> > >> > >> > >> > >> > http.agent.name >> > MySearch >> > My Search Engine >> > >> > >> > >> > http.agent.description >> > My Search Engine >> > Further description of our bot- this text is used in >> > the User-Agent header. It appears in parenthesis after the agent >> name. >> > >> > >> > >> > >> > http.agent.url >> > http://www.example.com/ >> > A URL to advertise in the User-Agent header. This will >> > appear in parenthesis after the agent name. Custom dictates that this >> > should be a URL of a page explaining the purpose and behavior of this >> > crawler. >> > >> > >> > >> > >> > http.agent.email >> > [EMAIL PROTECTED] >> > An email address to advertise in the HTTP 'From' request >> > header and User-Agent header. A good practice is to mangle this >> > address (e.g. 'info at example dot com') to avoid spamming. >> > >> > >> > >> > >> > plugin.includes >> > >> > >> protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) >> > Regular expression naming plugin directory names to >> > include. Any plugin not matching this expression is excluded. >> > In any case you need at least include the nutch-extensionpoints >> plugin. >> > By >> > default Nutch includes crawling just HTML and plain text via HTTP, >> > and basic indexing and search plugins. In order to use HTTPS please >> > enable >> > protocol-httpclient, but be aware of possible intermittent problems >> with >> > the >> > underlying commons-httpclient library. >> > >> > >> > >> > >> > >> > Apart from this please go through the tutorial at >> > http://lucene.apache.org/nutch/tutorial8.html if you are using Nutch >> > 0.8 or above. If you still fail to resolve the problem, please include >> > the following information next time you send a mail:- >> > >> > 1. Version of Nutch are you using. >> > 2. Command you enter to run the Nutch crawl. >> > 3. The content of your seed URLs file. >> > 4. Logs. >> > >> > Regards, >> > Susam Pal >> > >> > On Nov 16, 2007 3:18 PM, crazy wrote: >> >> >> >> >> >> hi, >> >> tks for your answer but i don't understand what i should do exactly >> >> this is my file crawl-urlfilter.txt: >> >> # skip file:, ftp:, & mailto: urls >> >> -^(file|ftp|mailto): >> >> >> >> # skip image and other suffixes we can't yet parse >> >> >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ >> >> >> >> # skip URLs containing certain characters as probable queries, etc. >> >> [EMAIL PROTECTED] >> >> >> >> # skip URLs with slash-delimited segment that repeats 3+ times, to >> break >> >> loops >> >> -.*(/.+?)/.*?\1/.*?\1/ >> >> >> >> # accept hosts in lucene.apache.org/nutch >> >> +^http://([a-z0-9]*\.)*localhost:8080/ >> >> >> >> # skip everything else >> >> +. >> >> and what about nutch-site.xml this file is empty >> >> i have just the http.agent.name >> >> i should insert the plugin.includes in this file? >> >> >> >> tks a lot and i wish have a answer the rather possible >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> crazy wrote: >> >> > >> >> > Hi, >> >> > i install nutch for the first time and i want to index word and >> excel >> >> > document >> >> > even i change the nutch-default.xml : >> >> > >> >> > plugin.includes >> >> > protocol-http|urlfilter-regex|parse-(text|html|js|pdf|swf| >> >> > msword|mspowerpoint|rss)|index-(basic|more)|query-(basic|site|url| >> >> > more)|subcollection|clustering-carrot2|summary-basic|scoring-opic >> >> > >> >> > Regular expression naming plugin directory names to >> >> > include. Any plugin not matching this expression is excluded. >> >> > In any case you need at least include the nutch-extensionpoints >> >> plugin. >> >> > By >> >> > default Nutch includes crawling just HTML and plain text via HTTP, >> >> > and basic indexing and search plugins. In order to use HTTPS >> please >> >> > enable >> >> > protocol-httpclient, but be aware of possible intermittent >> problems >> >> with >> >> > the >> >> > underlying commons-httpclient library. >> >> > >> >> > >> >> > enven this modification i still have the following message >> >> > Generator: 0 records selected for fetching, exiting ... >> >> > Stopping at depth=0 - no more URLs to fetch. >> >> > No URLs to fetch - check your seed list and URL filters. >> >> > crawl finished: crawl >> >> > plz some one can help me its urgent >> >> > >> >> >> >> -- >> >> View this message in context: >> >> http://www.nabble.com/indexing-word-file-tf4819567.html#a13790069 >> >> >> >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/indexing-word-file-tf4819567.html#a13790726 >> >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > > -- View this message in context: http://www.nabble.com/indexing-word-file-tf4819567.html#a13791423 Sent from the Nutch - User mailing list archive at Nabble.com.
