Andrzej Bialecki wrote:
3. implement a catch-all plugin, which is equivalent to a Unix command
strings(1) (I have an implementation of that which I can contribute).
And turn it off/on in the config, if it's off, then the unknown content
is skipped and logged, if it's on - then make the best effort to extract
text.
This is possible now by simply configuring a catch-all plugin to match
the empty suffix and removing the empty suffix from other plugins. So
it seems the problem is not that this is currently impossible, but
rather that it would be better to alter the configuration than the
plugin definitions.
So we might have ParserFactory read a config file that maps content
types and url suffixes to plugins. Folks can edit this file instead of
modifying the plugin declarations. It can also define default handlers
for unknown content types and unknown suffixes. This could either
augment or entirely replace the specifications in the plugins
themselves. Does this make sense?
Doug
- Re: nutch 7.0 not fetching powerpoint, plugin is p... Doug Cutting
-