Hi micheal,

me too, sorry for delay, yesterday i am on leave.

i have added the plugins as follows

<property>
  <name>plugin.includes</name>
 
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|msword|zip|mspowerpoint|msexcel)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin
directory names to
  include.  Any plugin not matching this expression is
excluded.  By
  default Nutch includes crawling just HTML and plain
text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>



thanks,
Ayyanar

--- Michael Nebel <[EMAIL PROTECTED]> wrote:

> Hi Ayyanar,
> 
> sorry for the delay, but I've been out of office for
> some hours.
> 
> Have you activated the plugins? You need to extend
> the plugin.includes. 
> Mne look for example:
> 
> <property>
>    <name>plugin.includes</name>
>  
>
<value>nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf|rtf|rss|js|msexcel|mspowerpoint|zip)|index-(basic|more)|query-(basic|site|url)|language-identifier|clustering-carrot2</value>
>    <description>Regular expression naming plugin
> directory names to
>    include.  Any plugin not matching this expression
> is excluded.
>    In any case you need at least include the
> nutch-extensionpoints
>    plugin. By default Nutch includes crawling just
> HTML and plain text
>    via HTTP,  and basic indexing and search plugins.
> boost-urlpattern|
>    </description>
> </property>
> 
> Regards
> 
>       Michael
> 
> 
> Ayyanar Inbamohan wrote:
> 
> > Hi Michael,
> > 
> > I have enabled the ppt extension from the
> > crawl-urlfilter.txt, Now it is fetching the
> powerpoint
> > files,
> > 
> > But i am getting the following error, bcos  ppt
> files
> > content type is not taken by nutch..
> > 
> > 
> > 
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal3.ppt
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/testpdf.pdf
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal10.ppt
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/testdoc.doc
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal2.ppt
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal4.ppt
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal6.ppt
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/testexcel.xls
> > 050906 175342 fetching
> >
>
http://localhost:8080/search_sample/javaCertStudyNotes.pdf
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal7.ppt
> > 050906 175342 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal3.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal8.ppt
> > 050906 175343 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal8.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175343 fetching
> > http://localhost:8080/search_sample/kmportal9.ppt
> > 050906 175344 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal9.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175344 fetching
> > http://localhost:8080/search_sample/kmportal11.ppt
> > 050906 175347 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal4.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175348 fetching
> > http://localhost:8080/search_sample/kmportal5.ppt
> > 050906 175348 fetching
> > http://localhost:8080/search_sample/kmportal1.ppt
> > 050906 175350 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal7.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175351 fetch okay, but can't parse
> >
> http://localhost:8080/search_sample/kmportal10.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175353 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal6.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175354 fetch okay, but can't parse
> > http://localhost:8080/search_sample/testexcel.xls,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/vnd.ms-excel
> > 050906 175355 fetch okay, but can't parse
> >
> http://localhost:8080/search_sample/kmportal11.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175356 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal5.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175358 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal1.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175359 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal2.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 
> > 
> > thanks,
> > Ayyanar..
> > 
> > --- Michael Nebel <[EMAIL PROTECTED]> wrote:
> > 
> > 
> >>Hi,
> >>
> >>have you checked the filters? (regex-urlfilter or
> >>crawl-urlfilter)? The 
> >>ending ".ppt" ist disabled by default.
> >>
> >>Regards
> >>
> >>    Michael
> >>
> >>Ayyanar Inbamohan wrote:
> >>
> >>
> >>>Hi all,
> >>>
> >>>I am using the powerpoint plugin from JIRA, and
> >>
> >>when i
> >>
> >>>crawl my application having link to the ppt,
> nutch
> >>
> >>7.0
> >>
> >>>is not at all fetching the powerpoint files.
> >>>
> >>>i am crawling my local appliation 
> >>>
> >>>http://localhost:8080/search_sample/index.html
> >>>
> >>>this url, i have given in the url.intranet, 
> >>>
> >>>i gave some href to powerpoint file in
> index.html,
> >>
> >>>and then started but it is not crawling
> >>>
> >>>
> >>>
> >>>Thanks in advance..
> >>>
> >>>thanks,
> >>>Ayyanar....
> >>>
> >>
> -- 
> Michael Nebel
> http://www.nebel.de/
> http://www.netluchs.de/
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to