Hi Michael, I have enabled the ppt extension from the crawl-urlfilter.txt, Now it is fetching the powerpoint files,
But i am getting the following error, bcos ppt files content type is not taken by nutch.. 050906 175342 fetching http://localhost:8080/search_sample/kmportal3.ppt 050906 175342 fetching http://localhost:8080/search_sample/testpdf.pdf 050906 175342 fetching http://localhost:8080/search_sample/kmportal10.ppt 050906 175342 fetching http://localhost:8080/search_sample/testdoc.doc 050906 175342 fetching http://localhost:8080/search_sample/kmportal2.ppt 050906 175342 fetching http://localhost:8080/search_sample/kmportal4.ppt 050906 175342 fetching http://localhost:8080/search_sample/kmportal6.ppt 050906 175342 fetching http://localhost:8080/search_sample/testexcel.xls 050906 175342 fetching http://localhost:8080/search_sample/javaCertStudyNotes.pdf 050906 175342 fetching http://localhost:8080/search_sample/kmportal7.ppt 050906 175342 fetch okay, but can't parse http://localhost:8080/search_sample/kmportal3.ppt, reason: failed(2,203): Content-Type not application/msword: application/powerpoint 050906 175342 fetching http://localhost:8080/search_sample/kmportal8.ppt 050906 175343 fetch okay, but can't parse http://localhost:8080/search_sample/kmportal8.ppt, reason: failed(2,203): Content-Type not application/msword: application/powerpoint 050906 175343 fetching http://localhost:8080/search_sample/kmportal9.ppt 050906 175344 fetch okay, but can't parse http://localhost:8080/search_sample/kmportal9.ppt, reason: failed(2,203): Content-Type not application/msword: application/powerpoint 050906 175344 fetching http://localhost:8080/search_sample/kmportal11.ppt 050906 175347 fetch okay, but can't parse http://localhost:8080/search_sample/kmportal4.ppt, reason: failed(2,203): Content-Type not application/msword: application/powerpoint 050906 175348 fetching http://localhost:8080/search_sample/kmportal5.ppt 050906 175348 fetching http://localhost:8080/search_sample/kmportal1.ppt 050906 175350 fetch okay, but can't parse http://localhost:8080/search_sample/kmportal7.ppt, reason: failed(2,203): Content-Type not application/msword: application/powerpoint 050906 175351 fetch okay, but can't parse http://localhost:8080/search_sample/kmportal10.ppt, reason: failed(2,203): Content-Type not application/msword: application/powerpoint 050906 175353 fetch okay, but can't parse http://localhost:8080/search_sample/kmportal6.ppt, reason: failed(2,203): Content-Type not application/msword: application/powerpoint 050906 175354 fetch okay, but can't parse http://localhost:8080/search_sample/testexcel.xls, reason: failed(2,203): Content-Type not application/msword: application/vnd.ms-excel 050906 175355 fetch okay, but can't parse http://localhost:8080/search_sample/kmportal11.ppt, reason: failed(2,203): Content-Type not application/msword: application/powerpoint 050906 175356 fetch okay, but can't parse http://localhost:8080/search_sample/kmportal5.ppt, reason: failed(2,203): Content-Type not application/msword: application/powerpoint 050906 175358 fetch okay, but can't parse http://localhost:8080/search_sample/kmportal1.ppt, reason: failed(2,203): Content-Type not application/msword: application/powerpoint 050906 175359 fetch okay, but can't parse http://localhost:8080/search_sample/kmportal2.ppt, reason: failed(2,203): Content-Type not application/msword: application/powerpoint thanks, Ayyanar.. --- Michael Nebel <[EMAIL PROTECTED]> wrote: > Hi, > > have you checked the filters? (regex-urlfilter or > crawl-urlfilter)? The > ending ".ppt" ist disabled by default. > > Regards > > Michael > > Ayyanar Inbamohan wrote: > > > Hi all, > > > > I am using the powerpoint plugin from JIRA, and > when i > > crawl my application having link to the ppt, nutch > 7.0 > > is not at all fetching the powerpoint files. > > > > i am crawling my local appliation > > > > http://localhost:8080/search_sample/index.html > > > > this url, i have given in the url.intranet, > > > > i gave some href to powerpoint file in index.html, > > > > > and then started but it is not crawling > > > > > > > > Thanks in advance.. > > > > thanks, > > Ayyanar.... > > > > -- > Michael Nebel > http://www.nebel.de/ > http://www.netluchs.de/ > > ______________________________________________________ Click here to donate to the Hurricane Katrina relief effort. http://store.yahoo.com/redcross-donate3/ ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
