Sebastian Nagel created NUTCH-3090:
--------------------------------------

             Summary: Plugin for MIME type detection
                 Key: NUTCH-3090
                 URL: https://issues.apache.org/jira/browse/NUTCH-3090
             Project: Nutch
          Issue Type: Improvement
          Components: plugin
    Affects Versions: 1.21
            Reporter: Sebastian Nagel
             Fix For: 1.21


(suggested by [~hiranchaudhuri] in NUTCH-3089)

- introduce a new plugin extension point
  -- allow to provide (and try) different MIME detection tools
  -- but we'll start moving the Tika Mime Magic detector from Nutch core into a 
plugin
     --- reduce the Nutch core dependencies
     --- would allow to include the [container aware 
detection|https://tika.apache.org/3.0.0/detection.html#Container_Aware_Detection)
 into the plugin without adding the tika-parsers-standard and its dependencies 
to the Nutch core dependencies. Cf. NUTCH-3089.
  -- although maybe not two of them at the same time, or we'd need to define 
how results are weighted / combined
- provide a simple fall-back (cleansed HTTP Content-Type header) in case no 
mime-identifier plugin is activated per plugin-includes
- sharing Tika modules between parse-tika, the mime-identifier-tika or 
language-identifier is possible if we create a lib-tika plugin - plugins can 
depend on other plugins. Might be even: lib-tika-core and lib-tika-parsers, or 
anything else.
- one remark: Content objects are created in protocol plugins as part of the 
ProtocolResponse. That is, we'll call a plugin from within a plugin. But this 
is no problem, also the parse filter plugins are called from within a parser 
plugins.

Comments are welcome! This idea needs some specification.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to