Re: indexing word file

Susam Pal Fri, 16 Nov 2007 10:05:10 -0800

Please try mentioning the protocol in the seed URL file. For example:-

http://www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc


I guess, it selects the protocol plugin according to the protocol
specified in the URL.

Regards,
Susam Pal

On Nov 16, 2007 4:07 PM, crazy <[EMAIL PROTECTED]> wrote:
>
> *my nutch version is 0.9
> *Command  enter to run the Nutch crawl : bin/nutch crawl urls -dir crawl
> -depth
> *The content of my seed URLs file :
> www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc
> * Logs:2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -  Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch URL 
> Filter
> (org.apache.nutch.net.URLFilter)
> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Online 
> Search
> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         HTML Parse 
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Query 
> Filter
> (org.apache.nutch.searcher.QueryFilter)
> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-11-16 10:20:08,373 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2007-11-16 10:20:08,374 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2007-11-16 10:20:08,374 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
> 2007-11-16 10:20:09,161 WARN  crawl.Generator - Generator: 0 records
> selected for fetching, exiting ...
> 2007-11-16 10:20:09,162 INFO  crawl.Crawl - Stopping at depth=0 - no more
> URLs to fetch.
> 2007-11-16 10:20:09,162 WARN  crawl.Crawl - No URLs to fetch - check your
> seed list and URL filters.
>
> when i do your configuration i have this error:
> Skipping www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc
> :java.net.MalformedURLException: no protocol:
> www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc
>
> and i want to indicate that when i index a html file or pdf  i have no
> problem
> the problem just when i want to index a msword or msexcel .. document
>
> tks for help
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Susam Pal wrote:
> >
> > Your 'conf/crawl-urlfilter.txt' seems right. 'conf/nutch-site.xml' is
> > meant to override the properties defined in 'conf/nutch-default.xml'
> > file. To override a property, you just need to copy the same property
> > from nutch-default.xml into nutch-site.xml and change the value inside
> > the  tags.
> >
> > To minimize confusion, I am including my 'conf/nutch-site.xml' so that
> > you can see and understand.
> >
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> >
> >
> >
> >
> >
> >
> >  http.robots.agents
> >  MySearch,*
> >  The agent strings we'll look for in robots.txt files,
> >  comma-separated, in decreasing order of precedence. You should
> >  put the value of http.agent.name as the first agent name, and keep the
> >  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
> >
> >
> >
> >
> >  http.agent.name
> >  MySearch
> >  My Search Engine
> >
> >
> >
> >  http.agent.description
> >  My Search Engine
> >  Further description of our bot- this text is used in
> >  the User-Agent header.  It appears in parenthesis after the agent name.
> >
> >
> >
> >
> >  http.agent.url
> >  http://www.example.com/
> >  A URL to advertise in the User-Agent header.  This will
> >   appear in parenthesis after the agent name. Custom dictates that this
> >   should be a URL of a page explaining the purpose and behavior of this
> >   crawler.
> >
> >
> >
> >
> >  http.agent.email
> >  [EMAIL PROTECTED]
> >  An email address to advertise in the HTTP 'From' request
> >   header and User-Agent header. A good practice is to mangle this
> >   address (e.g. 'info at example dot com') to avoid spamming.
> >
> >
> >
> >
> >  plugin.includes
> >
> > protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
> >  Regular expression naming plugin directory names to
> >  include.  Any plugin not matching this expression is excluded.
> >  In any case you need at least include the nutch-extensionpoints plugin.
> > By
> >  default Nutch includes crawling just HTML and plain text via HTTP,
> >  and basic indexing and search plugins. In order to use HTTPS please
> > enable
> >  protocol-httpclient, but be aware of possible intermittent problems with
> > the
> >  underlying commons-httpclient library.
> >
> >
> >
> >
> >
> > Apart from this please go through the tutorial at
> > http://lucene.apache.org/nutch/tutorial8.html if you are using Nutch
> > 0.8 or above. If you still fail to resolve the problem, please include
> > the following information next time you send a mail:-
> >
> > 1. Version of Nutch are you using.
> > 2. Command you enter to run the Nutch crawl.
> > 3. The content of your seed URLs file.
> > 4. Logs.
> >
> > Regards,
> > Susam Pal
> >
> > On Nov 16, 2007 3:18 PM, crazy  wrote:
> >>
> >>
> >> hi,
> >> tks for your answer but i don't understand what i should do exactly
> >> this is my file crawl-urlfilter.txt:
> >> # skip file:, ftp:, & mailto: urls
> >> -^(file|ftp|mailto):
> >>
> >> # skip image and other suffixes we can't yet parse
> >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> >>
> >> # skip URLs containing certain characters as probable queries, etc.
> >> [EMAIL PROTECTED]
> >>
> >> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> >> loops
> >> -.*(/.+?)/.*?\1/.*?\1/
> >>
> >> # accept hosts in lucene.apache.org/nutch
> >> +^http://([a-z0-9]*\.)*localhost:8080/
> >>
> >> # skip everything else
> >> +.
> >> and what about nutch-site.xml this file is empty
> >> i have just the http.agent.name
> >> i should insert the plugin.includes in this file?
> >>
> >> tks a lot and i wish have a answer the rather possible
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> crazy wrote:
> >> >
> >> > Hi,
> >> > i install nutch for the first time and i want to index word and excel
> >> > document
> >> > even i change  the nutch-default.xml :
> >> >
> >> >   plugin.includes
> >> > protocol-http|urlfilter-regex|parse-(text|html|js|pdf|swf|
> >> > msword|mspowerpoint|rss)|index-(basic|more)|query-(basic|site|url|
> >> > more)|subcollection|clustering-carrot2|summary-basic|scoring-opic
> >> >
> >> >     Regular expression naming plugin directory names to
> >> >   include.  Any plugin not matching this expression is excluded.
> >> >   In any case you need at least include the nutch-extensionpoints
> >> plugin.
> >> > By
> >> >   default Nutch includes crawling just HTML and plain text via HTTP,
> >> >   and basic indexing and search plugins. In order to use HTTPS please
> >> > enable
> >> >   protocol-httpclient, but be aware of possible intermittent problems
> >> with
> >> > the
> >> >   underlying commons-httpclient library.
> >> >
> >> >
> >> > enven this modification i still have the following message
> >> > Generator: 0 records selected for fetching, exiting ...
> >> > Stopping at depth=0 - no more URLs to fetch.
> >> > No URLs to fetch - check your seed list and URL filters.
> >> > crawl finished: crawl
> >> > plz some one can help me its urgent
> >> >
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/indexing-word-file-tf4819567.html#a13790069
> >>
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context: 
> http://www.nabble.com/indexing-word-file-tf4819567.html#a13790726
>
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: indexing word file

Reply via email to