Sebastian Nagel created NUTCH-3090: -------------------------------------- Summary: Plugin for MIME type detection Key: NUTCH-3090 URL: https://issues.apache.org/jira/browse/NUTCH-3090 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.21 Reporter: Sebastian Nagel Fix For: 1.21
(suggested by [~hiranchaudhuri] in NUTCH-3089) - introduce a new plugin extension point -- allow to provide (and try) different MIME detection tools -- but we'll start moving the Tika Mime Magic detector from Nutch core into a plugin --- reduce the Nutch core dependencies --- would allow to include the [container aware detection|https://tika.apache.org/3.0.0/detection.html#Container_Aware_Detection) into the plugin without adding the tika-parsers-standard and its dependencies to the Nutch core dependencies. Cf. NUTCH-3089. -- although maybe not two of them at the same time, or we'd need to define how results are weighted / combined - provide a simple fall-back (cleansed HTTP Content-Type header) in case no mime-identifier plugin is activated per plugin-includes - sharing Tika modules between parse-tika, the mime-identifier-tika or language-identifier is possible if we create a lib-tika plugin - plugins can depend on other plugins. Might be even: lib-tika-core and lib-tika-parsers, or anything else. - one remark: Content objects are created in protocol plugins as part of the ProtocolResponse. That is, we'll call a plugin from within a plugin. But this is no problem, also the parse filter plugins are called from within a parser plugins. Comments are welcome! This idea needs some specification. -- This message was sent by Atlassian Jira (v8.20.10#820010)