> One part of fixing this problem is correct mime type identification for > document types, which I know that Jerome is working on an update to, and > will soon have a new mime type registry committed to Nutch.
The futur Mime Type Registry will be compatible with the FreeDesktop Shared Mime Info specification. http://standards.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-0.13.html As you can notice, this specification provides some XML recognition mechanism with a *root-XML* elements that provides a way to identify the precise mime-type of a XML document based on its nameSpaceURI or/and its localName. This part of the specification is not yet implemented (but planned), so that, in a near futur (I hope!!) the Mime Type Registry will be able to solve your use case. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
