Thanks for the help bud.

Yes, I changed that name of my files, but that was not the only change that was made. All xml nodes/tags were removed from the documents.

I figured out that Nutch would not even attempt to parse these documents while 'xml' was in the file name. So I simply renamed them (again), excluding the .xml. portion and it parsed just fine.

Again, thanks for the help.

From: Jérôme Charron <[EMAIL PROTECTED]>
Reply-To: [email protected]
To: [email protected]
Subject: Re: Parsing XML files
Date: Tue, 8 Nov 2005 21:33:34 +0100

On 11/8/05, Mike Reynols <[EMAIL PROTECTED]> wrote:
>
> Is there a plugin of some sort that I need in order to take a web site
> (which serves up a collection of xml documents) and crawl it's non html
> files?

Hi Mike,

First of all, which nutch version are you using?
Concerning a xml, there's actually no parse-xml plugin in nutch.
We have currently some discussion with two other nutch developpers to
provide such plugin... but it is still in early stages.

Now when I stripped out all the xml and left just raw text, I recieved the
> following error:

Ok, you renamed your documents... but what is the mime-type returned by your
server?
It seems it is application/xml => there's no parse plugin that handle such
content-type.
Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

_________________________________________________________________
Don’t just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/

Reply via email to