Thanks for the help bud.
Yes, I changed that name of my files, but that was not the only change that
was made. All xml nodes/tags were removed from the documents.
I figured out that Nutch would not even attempt to parse these documents
while 'xml' was in the file name. So I simply renamed them (again),
excluding the .xml. portion and it parsed just fine.
Again, thanks for the help.
From: Jérôme Charron <[EMAIL PROTECTED]>
Reply-To: [email protected]
To: [email protected]
Subject: Re: Parsing XML files
Date: Tue, 8 Nov 2005 21:33:34 +0100
On 11/8/05, Mike Reynols <[EMAIL PROTECTED]> wrote:
>
> Is there a plugin of some sort that I need in order to take a web site
> (which serves up a collection of xml documents) and crawl it's non html
> files?
Hi Mike,
First of all, which nutch version are you using?
Concerning a xml, there's actually no parse-xml plugin in nutch.
We have currently some discussion with two other nutch developpers to
provide such plugin... but it is still in early stages.
Now when I stripped out all the xml and left just raw text, I recieved the
> following error:
Ok, you renamed your documents... but what is the mime-type returned by
your
server?
It seems it is application/xml => there's no parse plugin that handle such
content-type.
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
_________________________________________________________________
Dont just search. Find. Check out the new MSN Search!
http://search.msn.click-url.com/go/onm00200636ave/direct/01/