Re: indexing word file

crazy Fri, 16 Nov 2007 03:38:27 -0800

*my nutch version is 0.9
*Command  enter to run the Nutch crawl : bin/nutch crawl urls -dir crawl
-depth 
*The content of my seed URLs file :
www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc 
* Logs:2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -  Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Online 
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch Query 
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-11-16 10:20:08,373 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2007-11-16 10:20:08,374 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2007-11-16 10:20:08,374 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2007-11-16 10:20:09,161 WARN  crawl.Generator - Generator: 0 records
selected for fetching, exiting ...
2007-11-16 10:20:09,162 INFO  crawl.Crawl - Stopping at depth=0 - no more
URLs to fetch.
2007-11-16 10:20:09,162 WARN  crawl.Crawl - No URLs to fetch - check your
seed list and URL filters.


when i do your configuration i have this error:
Skipping www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc
:java.net.MalformedURLException: no protocol:
www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc

and i want to indicate that when i index a html file or pdf  i have no
problem
the problem just when i want to index a msword or msexcel .. document 

tks for help 



























Susam Pal wrote:
> 
> Your 'conf/crawl-urlfilter.txt' seems right. 'conf/nutch-site.xml' is
> meant to override the properties defined in 'conf/nutch-default.xml'
> file. To override a property, you just need to copy the same property
> from nutch-default.xml into nutch-site.xml and change the value inside
> the  tags.
> 
> To minimize confusion, I am including my 'conf/nutch-site.xml' so that
> you can see and understand.
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> 
> 
> 
> 
> 
> 
>  http.robots.agents
>  MySearch,*
>  The agent strings we'll look for in robots.txt files,
>  comma-separated, in decreasing order of precedence. You should
>  put the value of http.agent.name as the first agent name, and keep the
>  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>  
> 
> 
> 
>  http.agent.name
>  MySearch
>  My Search Engine
> 
> 
> 
>  http.agent.description
>  My Search Engine
>  Further description of our bot- this text is used in
>  the User-Agent header.  It appears in parenthesis after the agent name.
>  
> 
> 
> 
>  http.agent.url
>  http://www.example.com/
>  A URL to advertise in the User-Agent header.  This will
>   appear in parenthesis after the agent name. Custom dictates that this
>   should be a URL of a page explaining the purpose and behavior of this
>   crawler.
>  
> 
> 
> 
>  http.agent.email
>  [EMAIL PROTECTED]
>  An email address to advertise in the HTTP 'From' request
>   header and User-Agent header. A good practice is to mangle this
>   address (e.g. 'info at example dot com') to avoid spamming.
>  
> 
> 
> 
>  plugin.includes
> 
> protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>  Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.
>  In any case you need at least include the nutch-extensionpoints plugin.
> By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins. In order to use HTTPS please
> enable
>  protocol-httpclient, but be aware of possible intermittent problems with
> the
>  underlying commons-httpclient library.
>  
> 
> 
> 
> 
> Apart from this please go through the tutorial at
> http://lucene.apache.org/nutch/tutorial8.html if you are using Nutch
> 0.8 or above. If you still fail to resolve the problem, please include
> the following information next time you send a mail:-
> 
> 1. Version of Nutch are you using.
> 2. Command you enter to run the Nutch crawl.
> 3. The content of your seed URLs file.
> 4. Logs.
> 
> Regards,
> Susam Pal
> 
> On Nov 16, 2007 3:18 PM, crazy  wrote:
>>
>>
>> hi,
>> tks for your answer but i don't understand what i should do exactly
>> this is my file crawl-urlfilter.txt:
>> # skip file:, ftp:, & mailto: urls
>> -^(file|ftp|mailto):
>>
>> # skip image and other suffixes we can't yet parse
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> [EMAIL PROTECTED]
>>
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> -.*(/.+?)/.*?\1/.*?\1/
>>
>> # accept hosts in lucene.apache.org/nutch
>> +^http://([a-z0-9]*\.)*localhost:8080/
>>
>> # skip everything else
>> +.
>> and what about nutch-site.xml this file is empty
>> i have just the http.agent.name
>> i should insert the plugin.includes in this file?
>>
>> tks a lot and i wish have a answer the rather possible
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> crazy wrote:
>> >
>> > Hi,
>> > i install nutch for the first time and i want to index word and excel
>> > document
>> > even i change  the nutch-default.xml :
>> > 
>> >   plugin.includes
>> > protocol-http|urlfilter-regex|parse-(text|html|js|pdf|swf|
>> > msword|mspowerpoint|rss)|index-(basic|more)|query-(basic|site|url|
>> > more)|subcollection|clustering-carrot2|summary-basic|scoring-opic
>> >
>> >     Regular expression naming plugin directory names to
>> >   include.  Any plugin not matching this expression is excluded.
>> >   In any case you need at least include the nutch-extensionpoints
>> plugin.
>> > By
>> >   default Nutch includes crawling just HTML and plain text via HTTP,
>> >   and basic indexing and search plugins. In order to use HTTPS please
>> > enable
>> >   protocol-httpclient, but be aware of possible intermittent problems
>> with
>> > the
>> >   underlying commons-httpclient library.
>> >   
>> > 
>> > enven this modification i still have the following message
>> > Generator: 0 records selected for fetching, exiting ...
>> > Stopping at depth=0 - no more URLs to fetch.
>> > No URLs to fetch - check your seed list and URL filters.
>> > crawl finished: crawl
>> > plz some one can help me its urgent
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/indexing-word-file-tf4819567.html#a13790069
>>
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/indexing-word-file-tf4819567.html#a13790726
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: indexing word file

Reply via email to