> > I remember having played with that a wee bit, but the problem was that > the plugins themselves are riddled with pieces of code like the one > below, found in MSWordParser in release 0.7:
Yes, it's true, each parse plugin checks in its code the content-type of the provided content. As you notice it, there's a real synchro problem between the allowed content-type specified in the plugin.xml file and the one checked within the code. I propose two solutions: 1. No default behavior in the ParserFactory: ie if it doesn't found a suitable plugin for a content-type, it must not parse the content (what is exact behavior to have in such a case is to defined: throw an exception, simply ignore the content....???) 2. Provides in the plugin repository a way to retrieve the content-types associated to a plugin: somethin like: public static MimeType[] getAllowedMimeTypes(String pluginid); It's open for comments... and contributions too ;-) > 2. Remember that powerpoint plugin is not part of the Nutch-0.7 > > release... > Now, you'll have to find a better one than that, Jerome! :) I would have tested ;-) Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
