[ 
https://issues.apache.org/jira/browse/TIKA-527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated TIKA-527:
-----------------------------

    Description: 
Background
-----------------
As of Tika 0.7, tika-config.xml is not longer mandatory and loading 3rd party 
parsers as plugins through service architecture is supported.

This introduces great flexibility, and even allows for extending Tika's file 
format support by simply dropping in jar's on the classpath. This is great for 
configuring Tika when it's embedded as part of another application such as Solr 
or Nutch. You can easily add support for e.g. a commercial document filter with 
Tika wrapper without changing Tika or the consuming application, or even 
maintaining a tika-config.xml.

This serves the majority of all use cases.

Problem
------------
However, as the variety of 3rd party document parsers increases, we'll start 
seeing an overlap of parsers supporting the same mime-types. A very likely 
scenario is a company specialized in document filters packaging their parsers 
as a Tika plugin, under whatever license they choose.

In this scenario, a system integrator (working with e.g. Solr) wants to gather 
all the parsers that the particular customer needs, and then choose which 
parser should handle each mime-type. She may want to let a 3rd party parser 
plugin handle Word files but the Tika supplied POI parser handle Excel.

Today, the last parser plugin that gets loaded by the class-loader happens to 
"win" the mime-types it supports. As it is not uncommon for one parser to 
register multiple mime-types, re-claiming a subset of the types is not possible 
unless you are consuming Tika directly.

We thus need an "override" mime-to-parser mapping by configuration, and Tika 
needs to look for this config by default when starting.


  was:
h2. Background
As of Tika 0.7, tika-config.xml is not longer mandatory and loading 3rd party 
parsers as plugins through service architecture is supported.

This introduces great flexibility, and even allows for extending Tika's file 
format support by simply dropping in jar's on the classpath. This is great for 
configuring Tika when it's embedded as part of another application such as Solr 
or Nutch. You can easily add support for e.g. a commercial document filter with 
Tika wrapper without changing Tika or the consuming application, or even 
maintaining a tika-config.xml.

This serves the majority of all use cases.

h2. Problem
However, as the variety of 3rd party document parsers increases, we'll start 
seeing an overlap of parsers supporting the same mime-types. A very likely 
scenario is a company specialized in document filters packaging their parsers 
as a Tika plugin, under whatever license they choose.

In this scenario, a system integrator (working with e.g. Solr) wants to gather 
all the parsers that the particular customer needs, and then choose which 
parser should handle each mime-type. She may want to let a 3rd party parser 
plugin handle Word files but the Tika supplied POI parser handle Excel.

Today, the last parser plugin that gets loaded by the class-loader happens to 
"win" the mime-types it supports. As it is not uncommon for one parser to 
register multiple mime-types, re-claiming a subset of the types is not possible 
unless you are consuming Tika directly.

.h2 Solution
Allow for an "override" style mime-to-parser mapping by configuration. To keep 
the number of config files down, this is probably best done as an extension of 
the tika-config.xml syntax, allowing for specifying only the *changes*, without 
repeating all the mappings. Tika should look for tika-config.xml on class-path 
by default even if it's not bundled by default.

Say we add a parser plugin which supports a bunch of Office formats, but their 
Excel parser sucks. We want to explicitly give control over 
application/vnd.ms-excel to the POI parser. Here's how that could be done by 
adding support for an "append" attribute on the <parsers> and <parser> tags, 
instead of repeating all mime types:

{code:xml} 
<properties>
    <parsers append="true">
        <parser name="parse-office" 
class="org.apache.tika.parser.microsoft.OfficeParser" append="true">
                <mime>application/vnd.ms-excel</mime>
                
<mime>application/vnd.ms-excel.sheet.binary.macroenabled.12</mime>
        </parser>
    </parsers>
<properties>
{code}

When Tika sees append="true", it will first initialize everything as default, 
and then re-do the mappings explicitly specified. If you want to remove support 
for a parser by config, you could specify a <parser...> tag without an append 
attribute and with no <mime> sub-tags specified.


> Allow override mapping mime<-->parsers through config
> -----------------------------------------------------
>
>                 Key: TIKA-527
>                 URL: https://issues.apache.org/jira/browse/TIKA-527
>             Project: Tika
>          Issue Type: Improvement
>          Components: config
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>
> Background
> -----------------
> As of Tika 0.7, tika-config.xml is not longer mandatory and loading 3rd party 
> parsers as plugins through service architecture is supported.
> This introduces great flexibility, and even allows for extending Tika's file 
> format support by simply dropping in jar's on the classpath. This is great 
> for configuring Tika when it's embedded as part of another application such 
> as Solr or Nutch. You can easily add support for e.g. a commercial document 
> filter with Tika wrapper without changing Tika or the consuming application, 
> or even maintaining a tika-config.xml.
> This serves the majority of all use cases.
> Problem
> ------------
> However, as the variety of 3rd party document parsers increases, we'll start 
> seeing an overlap of parsers supporting the same mime-types. A very likely 
> scenario is a company specialized in document filters packaging their parsers 
> as a Tika plugin, under whatever license they choose.
> In this scenario, a system integrator (working with e.g. Solr) wants to 
> gather all the parsers that the particular customer needs, and then choose 
> which parser should handle each mime-type. She may want to let a 3rd party 
> parser plugin handle Word files but the Tika supplied POI parser handle Excel.
> Today, the last parser plugin that gets loaded by the class-loader happens to 
> "win" the mime-types it supports. As it is not uncommon for one parser to 
> register multiple mime-types, re-claiming a subset of the types is not 
> possible unless you are consuming Tika directly.
> We thus need an "override" mime-to-parser mapping by configuration, and Tika 
> needs to look for this config by default when starting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to