Jérôme Charron wrote:
I remember having played with that a wee bit, but the problem was that
the plugins themselves are riddled with pieces of code like the one
below, found in MSWordParser in release 0.7:
Yes, it's true, each parse plugin checks in its code the content-type of the
provided content.
As you notice it, there's a real synchro problem between the allowed
content-type specified in
the plugin.xml file and the one checked within the code.
I propose two solutions:
1. No default behavior in the ParserFactory: ie if it doesn't found a
suitable plugin for a content-type, it must not parse the content (what is
exact behavior to have in such a case is to defined: throw an exception,
simply ignore the content....???)
2. Provides in the plugin repository a way to retrieve the content-types
associated to a plugin: somethin like:
public static MimeType[] getAllowedMimeTypes(String pluginid);
Yes, that will definitely be needed sooner or later.
It's open for comments... and contributions too ;-)
3. implement a catch-all plugin, which is equivalent to a Unix command
strings(1) (I have an implementation of that which I can contribute).
And turn it off/on in the config, if it's off, then the unknown content
is skipped and logged, if it's on - then make the best effort to extract
text.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general