> 
> I remember having played with that a wee bit, but the problem was that
> the plugins themselves are riddled with pieces of code like the one
> below, found in MSWordParser in release 0.7:

Yes, it's true, each parse plugin checks in its code the content-type of the 
provided content.
As you notice it, there's a real synchro problem between the allowed 
content-type specified in
the plugin.xml file and the one checked within the code.
I propose two solutions:

1. No default behavior in the ParserFactory: ie if it doesn't found a 
suitable plugin for a content-type, it must not parse the content (what is 
exact behavior to have in such a case is to defined: throw an exception, 
simply ignore the content....???)

2. Provides in the plugin repository a way to retrieve the content-types 
associated to a plugin: somethin like: 
public static MimeType[] getAllowedMimeTypes(String pluginid);

It's open for comments... and contributions too ;-)

> 2. Remember that powerpoint plugin is not part of the Nutch-0.7
> > release...
> Now, you'll have to find a better one than that, Jerome! :)

I would have tested
;-)

Regards 

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Reply via email to