Jérôme Charron wrote:
I remember having played with that a wee bit, but the problem was that
the plugins themselves are riddled with pieces of code like the one
below, found in MSWordParser in release 0.7:


Yes, it's true, each parse plugin checks in its code the content-type of the provided content. As you notice it, there's a real synchro problem between the allowed content-type specified in
the plugin.xml file and the one checked within the code.
I propose two solutions:

1. No default behavior in the ParserFactory: ie if it doesn't found a suitable plugin for a content-type, it must not parse the content (what is exact behavior to have in such a case is to defined: throw an exception, simply ignore the content....???)

2. Provides in the plugin repository a way to retrieve the content-types associated to a plugin: somethin like: public static MimeType[] getAllowedMimeTypes(String pluginid);

Yes, that will definitely be needed sooner or later.


It's open for comments... and contributions too ;-)

3. implement a catch-all plugin, which is equivalent to a Unix command strings(1) (I have an implementation of that which I can contribute). And turn it off/on in the config, if it's off, then the unknown content is skipped and logged, if it's on - then make the best effort to extract text.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to