Parsing XML files

Mike Reynols Tue, 08 Nov 2005 09:49:50 -0800

Is there a plugin of some sort that I need in order to take a web site(which serves up a collection of xml documents) and crawl it's non htmlfiles?

I have tried to crawl an apache server of mine that has a directory listingof several hundred xml files but it failed with:


051108 113634 fetching http://www.example.com/example.xml.43779

051108 113640 fetch okay, but can't parsehttp://www.example.com/example.xml.43792, reason: failed(2,203):Content-Type not text/html: application/xml

051108 113640 fetching http://buildhost.kozoru.com/example.xml.43812

Now when I stripped out all the xml and left just raw text, I recieved thefollowing error:


051108 113634 fetching http://www.example.com/example.xml.43779-txt

051108 113640 fetch okay, but can't parsehttp://www.example.com/example.xml.43792-txt, reason: failed(2,203):Content-Type not text/html: application/xml

051108 113640 fetching http://www.example.com/example.xml.43812-txt

So you can see that niether are parsing correctly, and I'm not entirely surewhy? Is there any way I can parse a collection of non-html files and be ableto search it?

I guess I'm confused as to the fundamentals of Nutch. If someone couldplease point me in the right direction, that'd be greatly appreciated.Thanks.


_________________________________________________________________

Express yourself instantly with MSN Messenger! Download today - it's FREE!http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Parsing XML files

Reply via email to