Hi Ayyanar,
sorry for the delay, but I've been out of office for some hours.
Have you activated the plugins? You need to extend the plugin.includes.
Mne look for example:
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf|rtf|rss|js|msexcel|mspowerpoint|zip)|index-(basic|more)|query-(basic|site|url)|language-identifier|clustering-carrot2</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By default Nutch includes crawling just HTML and plain text
via HTTP, and basic indexing and search plugins. boost-urlpattern|
</description>
</property>
Regards
Michael
Ayyanar Inbamohan wrote:
Hi Michael,
I have enabled the ppt extension from the
crawl-urlfilter.txt, Now it is fetching the powerpoint
files,
But i am getting the following error, bcos ppt files
content type is not taken by nutch..
050906 175342 fetching
http://localhost:8080/search_sample/kmportal3.ppt
050906 175342 fetching
http://localhost:8080/search_sample/testpdf.pdf
050906 175342 fetching
http://localhost:8080/search_sample/kmportal10.ppt
050906 175342 fetching
http://localhost:8080/search_sample/testdoc.doc
050906 175342 fetching
http://localhost:8080/search_sample/kmportal2.ppt
050906 175342 fetching
http://localhost:8080/search_sample/kmportal4.ppt
050906 175342 fetching
http://localhost:8080/search_sample/kmportal6.ppt
050906 175342 fetching
http://localhost:8080/search_sample/testexcel.xls
050906 175342 fetching
http://localhost:8080/search_sample/javaCertStudyNotes.pdf
050906 175342 fetching
http://localhost:8080/search_sample/kmportal7.ppt
050906 175342 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal3.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175342 fetching
http://localhost:8080/search_sample/kmportal8.ppt
050906 175343 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal8.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175343 fetching
http://localhost:8080/search_sample/kmportal9.ppt
050906 175344 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal9.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175344 fetching
http://localhost:8080/search_sample/kmportal11.ppt
050906 175347 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal4.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175348 fetching
http://localhost:8080/search_sample/kmportal5.ppt
050906 175348 fetching
http://localhost:8080/search_sample/kmportal1.ppt
050906 175350 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal7.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175351 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal10.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175353 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal6.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175354 fetch okay, but can't parse
http://localhost:8080/search_sample/testexcel.xls,
reason: failed(2,203): Content-Type not
application/msword: application/vnd.ms-excel
050906 175355 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal11.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175356 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal5.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175358 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal1.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175359 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal2.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
thanks,
Ayyanar..
--- Michael Nebel <[EMAIL PROTECTED]> wrote:
Hi,
have you checked the filters? (regex-urlfilter or
crawl-urlfilter)? The
ending ".ppt" ist disabled by default.
Regards
Michael
Ayyanar Inbamohan wrote:
Hi all,
I am using the powerpoint plugin from JIRA, and
when i
crawl my application having link to the ppt, nutch
7.0
is not at all fetching the powerpoint files.
i am crawling my local appliation
http://localhost:8080/search_sample/index.html
this url, i have given in the url.intranet,
i gave some href to powerpoint file in index.html,
and then started but it is not crawling
Thanks in advance..
thanks,
Ayyanar....
--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general