Sebastian Nagel created NUTCH-3090:
--------------------------------------
Summary: Plugin for MIME type detection
Key: NUTCH-3090
URL: https://issues.apache.org/jira/browse/NUTCH-3090
Project: Nutch
Issue Type: Improvement
Components: plugin
Affects Versions: 1.21
Reporter: Sebastian Nagel
Fix For: 1.21
(suggested by [~hiranchaudhuri] in NUTCH-3089)
- introduce a new plugin extension point
-- allow to provide (and try) different MIME detection tools
-- but we'll start moving the Tika Mime Magic detector from Nutch core into a
plugin
--- reduce the Nutch core dependencies
--- would allow to include the [container aware
detection|https://tika.apache.org/3.0.0/detection.html#Container_Aware_Detection)
into the plugin without adding the tika-parsers-standard and its dependencies
to the Nutch core dependencies. Cf. NUTCH-3089.
-- although maybe not two of them at the same time, or we'd need to define
how results are weighted / combined
- provide a simple fall-back (cleansed HTTP Content-Type header) in case no
mime-identifier plugin is activated per plugin-includes
- sharing Tika modules between parse-tika, the mime-identifier-tika or
language-identifier is possible if we create a lib-tika plugin - plugins can
depend on other plugins. Might be even: lib-tika-core and lib-tika-parsers, or
anything else.
- one remark: Content objects are created in protocol plugins as part of the
ProtocolResponse. That is, we'll call a plugin from within a plugin. But this
is no problem, also the parse filter plugins are called from within a parser
plugins.
Comments are welcome! This idea needs some specification.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)