Re: indexing word file

crazy Fri, 16 Nov 2007 04:30:16 -0800

i change my seed urls file to this 
http://www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc


and i have this like result:
fetching http://www.frlii.org/IMG/doc/cactalogue_a_portail_27-09-2004.doc
16 nov. 2007 11:18:55 org.apache.tika.mime.MimeUtils load
INFO: Loading [tika-mimetypes.xml]
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20071116111851]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20071116111859
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.

what i can do now i feel that we are near the aim

tksss

















Susam Pal wrote:
> 
> Please try mentioning the protocol in the seed URL file. For example:-
> 
> http://www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc
> 
> I guess, it selects the protocol plugin according to the protocol
> specified in the URL.
> 
> Regards,
> Susam Pal
> 
> On Nov 16, 2007 4:07 PM, crazy <[EMAIL PROTECTED]> wrote:
>>
>> *my nutch version is 0.9
>> *Command  enter to run the Nutch crawl : bin/nutch crawl urls -dir crawl
>> -depth
>> *The content of my seed URLs file :
>> www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc
>> * Logs:2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -  Nutch
>> Summarizer (org.apache.nutch.searcher.Summarizer)
>> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch
>> Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch
>> Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch URL
>> Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch
>> Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch
>> Online Search
>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         HTML
>> Parse Filter
>> (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch
>> Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch
>> Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Nutch
>> Query Filter
>> (org.apache.nutch.searcher.QueryFilter)
>> 2007-11-16 10:20:08,356 INFO  plugin.PluginRepository -         Ontology
>> Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-11-16 10:20:08,373 INFO  crawl.FetchScheduleFactory - Using
>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2007-11-16 10:20:08,374 INFO  crawl.AbstractFetchSchedule -
>> defaultInterval=2592000
>> 2007-11-16 10:20:08,374 INFO  crawl.AbstractFetchSchedule -
>> maxInterval=7776000
>> 2007-11-16 10:20:09,161 WARN  crawl.Generator - Generator: 0 records
>> selected for fetching, exiting ...
>> 2007-11-16 10:20:09,162 INFO  crawl.Crawl - Stopping at depth=0 - no more
>> URLs to fetch.
>> 2007-11-16 10:20:09,162 WARN  crawl.Crawl - No URLs to fetch - check your
>> seed list and URL filters.
>>
>> when i do your configuration i have this error:
>> Skipping www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc
>> :java.net.MalformedURLException: no protocol:
>> www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc
>>
>> and i want to indicate that when i index a html file or pdf  i have no
>> problem
>> the problem just when i want to index a msword or msexcel .. document
>>
>> tks for help
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Susam Pal wrote:
>> >
>> > Your 'conf/crawl-urlfilter.txt' seems right. 'conf/nutch-site.xml' is
>> > meant to override the properties defined in 'conf/nutch-default.xml'
>> > file. To override a property, you just need to copy the same property
>> > from nutch-default.xml into nutch-site.xml and change the value inside
>> > the  tags.
>> >
>> > To minimize confusion, I am including my 'conf/nutch-site.xml' so that
>> > you can see and understand.
>> >
>> > <?xml version="1.0"?>
>> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >  http.robots.agents
>> >  MySearch,*
>> >  The agent strings we'll look for in robots.txt files,
>> >  comma-separated, in decreasing order of precedence. You should
>> >  put the value of http.agent.name as the first agent name, and keep the
>> >  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>> >
>> >
>> >
>> >
>> >  http.agent.name
>> >  MySearch
>> >  My Search Engine
>> >
>> >
>> >
>> >  http.agent.description
>> >  My Search Engine
>> >  Further description of our bot- this text is used in
>> >  the User-Agent header.  It appears in parenthesis after the agent
>> name.
>> >
>> >
>> >
>> >
>> >  http.agent.url
>> >  http://www.example.com/
>> >  A URL to advertise in the User-Agent header.  This will
>> >   appear in parenthesis after the agent name. Custom dictates that this
>> >   should be a URL of a page explaining the purpose and behavior of this
>> >   crawler.
>> >
>> >
>> >
>> >
>> >  http.agent.email
>> >  [EMAIL PROTECTED]
>> >  An email address to advertise in the HTTP 'From' request
>> >   header and User-Agent header. A good practice is to mangle this
>> >   address (e.g. 'info at example dot com') to avoid spamming.
>> >
>> >
>> >
>> >
>> >  plugin.includes
>> >
>> >
>> protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>> >  Regular expression naming plugin directory names to
>> >  include.  Any plugin not matching this expression is excluded.
>> >  In any case you need at least include the nutch-extensionpoints
>> plugin.
>> > By
>> >  default Nutch includes crawling just HTML and plain text via HTTP,
>> >  and basic indexing and search plugins. In order to use HTTPS please
>> > enable
>> >  protocol-httpclient, but be aware of possible intermittent problems
>> with
>> > the
>> >  underlying commons-httpclient library.
>> >
>> >
>> >
>> >
>> >
>> > Apart from this please go through the tutorial at
>> > http://lucene.apache.org/nutch/tutorial8.html if you are using Nutch
>> > 0.8 or above. If you still fail to resolve the problem, please include
>> > the following information next time you send a mail:-
>> >
>> > 1. Version of Nutch are you using.
>> > 2. Command you enter to run the Nutch crawl.
>> > 3. The content of your seed URLs file.
>> > 4. Logs.
>> >
>> > Regards,
>> > Susam Pal
>> >
>> > On Nov 16, 2007 3:18 PM, crazy  wrote:
>> >>
>> >>
>> >> hi,
>> >> tks for your answer but i don't understand what i should do exactly
>> >> this is my file crawl-urlfilter.txt:
>> >> # skip file:, ftp:, & mailto: urls
>> >> -^(file|ftp|mailto):
>> >>
>> >> # skip image and other suffixes we can't yet parse
>> >>
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>> >>
>> >> # skip URLs containing certain characters as probable queries, etc.
>> >> [EMAIL PROTECTED]
>> >>
>> >> # skip URLs with slash-delimited segment that repeats 3+ times, to
>> break
>> >> loops
>> >> -.*(/.+?)/.*?\1/.*?\1/
>> >>
>> >> # accept hosts in lucene.apache.org/nutch
>> >> +^http://([a-z0-9]*\.)*localhost:8080/
>> >>
>> >> # skip everything else
>> >> +.
>> >> and what about nutch-site.xml this file is empty
>> >> i have just the http.agent.name
>> >> i should insert the plugin.includes in this file?
>> >>
>> >> tks a lot and i wish have a answer the rather possible
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> crazy wrote:
>> >> >
>> >> > Hi,
>> >> > i install nutch for the first time and i want to index word and
>> excel
>> >> > document
>> >> > even i change  the nutch-default.xml :
>> >> >
>> >> >   plugin.includes
>> >> > protocol-http|urlfilter-regex|parse-(text|html|js|pdf|swf|
>> >> > msword|mspowerpoint|rss)|index-(basic|more)|query-(basic|site|url|
>> >> > more)|subcollection|clustering-carrot2|summary-basic|scoring-opic
>> >> >
>> >> >     Regular expression naming plugin directory names to
>> >> >   include.  Any plugin not matching this expression is excluded.
>> >> >   In any case you need at least include the nutch-extensionpoints
>> >> plugin.
>> >> > By
>> >> >   default Nutch includes crawling just HTML and plain text via HTTP,
>> >> >   and basic indexing and search plugins. In order to use HTTPS
>> please
>> >> > enable
>> >> >   protocol-httpclient, but be aware of possible intermittent
>> problems
>> >> with
>> >> > the
>> >> >   underlying commons-httpclient library.
>> >> >
>> >> >
>> >> > enven this modification i still have the following message
>> >> > Generator: 0 records selected for fetching, exiting ...
>> >> > Stopping at depth=0 - no more URLs to fetch.
>> >> > No URLs to fetch - check your seed list and URL filters.
>> >> > crawl finished: crawl
>> >> > plz some one can help me its urgent
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/indexing-word-file-tf4819567.html#a13790069
>> >>
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/indexing-word-file-tf4819567.html#a13790726
>>
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/indexing-word-file-tf4819567.html#a13791423
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: indexing word file

Reply via email to