Hi Ayyanar,

sorry for the delay, but I've been out of office for some hours.

Have you activated the plugins? You need to extend the plugin.includes. Mne look for example:

<property>
  <name>plugin.includes</name>

<value>nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf|rtf|rss|js|msexcel|mspowerpoint|zip)|index-(basic|more)|query-(basic|site|url)|language-identifier|clustering-carrot2</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints
  plugin. By default Nutch includes crawling just HTML and plain text
  via HTTP,  and basic indexing and search plugins. boost-urlpattern|
  </description>
</property>

Regards

        Michael


Ayyanar Inbamohan wrote:

Hi Michael,

I have enabled the ppt extension from the
crawl-urlfilter.txt, Now it is fetching the powerpoint
files,

But i am getting the following error, bcos  ppt files
content type is not taken by nutch..



050906 175342 fetching
http://localhost:8080/search_sample/kmportal3.ppt
050906 175342 fetching
http://localhost:8080/search_sample/testpdf.pdf
050906 175342 fetching
http://localhost:8080/search_sample/kmportal10.ppt
050906 175342 fetching
http://localhost:8080/search_sample/testdoc.doc
050906 175342 fetching
http://localhost:8080/search_sample/kmportal2.ppt
050906 175342 fetching
http://localhost:8080/search_sample/kmportal4.ppt
050906 175342 fetching
http://localhost:8080/search_sample/kmportal6.ppt
050906 175342 fetching
http://localhost:8080/search_sample/testexcel.xls
050906 175342 fetching
http://localhost:8080/search_sample/javaCertStudyNotes.pdf
050906 175342 fetching
http://localhost:8080/search_sample/kmportal7.ppt
050906 175342 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal3.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175342 fetching
http://localhost:8080/search_sample/kmportal8.ppt
050906 175343 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal8.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175343 fetching
http://localhost:8080/search_sample/kmportal9.ppt
050906 175344 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal9.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175344 fetching
http://localhost:8080/search_sample/kmportal11.ppt
050906 175347 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal4.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175348 fetching
http://localhost:8080/search_sample/kmportal5.ppt
050906 175348 fetching
http://localhost:8080/search_sample/kmportal1.ppt
050906 175350 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal7.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175351 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal10.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175353 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal6.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175354 fetch okay, but can't parse
http://localhost:8080/search_sample/testexcel.xls,
reason: failed(2,203): Content-Type not
application/msword: application/vnd.ms-excel
050906 175355 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal11.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175356 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal5.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175358 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal1.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175359 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal2.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint


thanks,
Ayyanar..

--- Michael Nebel <[EMAIL PROTECTED]> wrote:


Hi,

have you checked the filters? (regex-urlfilter or
crawl-urlfilter)? The ending ".ppt" ist disabled by default.

Regards

        Michael

Ayyanar Inbamohan wrote:


Hi all,

I am using the powerpoint plugin from JIRA, and

when i

crawl my application having link to the ppt, nutch

7.0

is not at all fetching the powerpoint files.

i am crawling my local appliation
http://localhost:8080/search_sample/index.html

this url, i have given in the url.intranet,
i gave some href to powerpoint file in index.html,

and then started but it is not crawling



Thanks in advance..

thanks,
Ayyanar....


--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to