Is there a plugin of some sort that I need in order to take a web site (which serves up a collection of xml documents) and crawl it's non html files?

I have tried to crawl an apache server of mine that has a directory listing of several hundred xml files but it failed with:

051108 113634 fetching http://www.example.com/example.xml.43779
051108 113640 fetch okay, but can't parse http://www.example.com/example.xml.43792, reason: failed(2,203): Content-Type not text/html: application/xml
051108 113640 fetching http://buildhost.kozoru.com/example.xml.43812

Now when I stripped out all the xml and left just raw text, I recieved the following error:

051108 113634 fetching http://www.example.com/example.xml.43779-txt
051108 113640 fetch okay, but can't parse http://www.example.com/example.xml.43792-txt, reason: failed(2,203): Content-Type not text/html: application/xml
051108 113640 fetching http://www.example.com/example.xml.43812-txt

So you can see that niether are parsing correctly, and I'm not entirely sure why? Is there any way I can parse a collection of non-html files and be able to search it?

I guess I'm confused as to the fundamentals of Nutch. If someone could please point me in the right direction, that'd be greatly appreciated. Thanks.

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Reply via email to