Is there a plugin of some sort that I need in order to take a web site
(which serves up a collection of xml documents) and crawl it's non html
files?
I have tried to crawl an apache server of mine that has a directory listing
of several hundred xml files but it failed with:
051108 113634 fetching http://www.example.com/example.xml.43779
051108 113640 fetch okay, but can't parse
http://www.example.com/example.xml.43792, reason: failed(2,203):
Content-Type not text/html: application/xml
051108 113640 fetching http://buildhost.kozoru.com/example.xml.43812
Now when I stripped out all the xml and left just raw text, I recieved the
following error:
051108 113634 fetching http://www.example.com/example.xml.43779-txt
051108 113640 fetch okay, but can't parse
http://www.example.com/example.xml.43792-txt, reason: failed(2,203):
Content-Type not text/html: application/xml
051108 113640 fetching http://www.example.com/example.xml.43812-txt
So you can see that niether are parsing correctly, and I'm not entirely sure
why? Is there any way I can parse a collection of non-html files and be able
to search it?
I guess I'm confused as to the fundamentals of Nutch. If someone could
please point me in the right direction, that'd be greatly appreciated.
Thanks.
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/