[ 
https://issues.apache.org/jira/browse/TIKA-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919647#action_12919647
 ] 

Jukka Zitting commented on TIKA-527:
------------------------------------

In revision 1006336 I added an org.apache.tika.parser.DefaultParser class that 
can be used to achieve most of this use case together with the existing 
TikaConfig functionality. Here's an example:

<properties>
    <parsers>
        <!-- Load all available parsers -->
        <parser class="org.apache.tika.parser.DefaultParser"/>

        <!-- Override parsing of all types supported by CustomParser -->
        <parser class="org.example.CustomParser"/>

        <!-- Explicitly disable parsing of Zip archives -->
        <parser class="org.apache.tika.parser.EmptyParser">
            <mime>application/zip</mime>
        </parser>
    </parsers>
</properties>


> Allow override mapping mime<-->parsers through config
> -----------------------------------------------------
>
>                 Key: TIKA-527
>                 URL: https://issues.apache.org/jira/browse/TIKA-527
>             Project: Tika
>          Issue Type: Improvement
>          Components: config
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>
> h2. Background
> As of Tika 0.7, tika-config.xml is not longer mandatory and loading 3rd party 
> parsers as plugins through service architecture is supported.
> This introduces great flexibility, and even allows for extending Tika's file 
> format support by simply dropping in jar's on the classpath. This is great 
> for configuring Tika when it's embedded as part of another application such 
> as Solr or Nutch. You can easily add support for e.g. a commercial document 
> filter with Tika wrapper without changing Tika or the consuming application, 
> or even maintaining a tika-config.xml.
> This serves the majority of all use cases.
> h2. Problem
> However, as the variety of 3rd party document parsers increases, we'll start 
> seeing an overlap of parsers supporting the same mime-types. A very likely 
> scenario is a company specialized in document filters packaging their parsers 
> as a Tika plugin, under whatever license they choose.
> In this scenario, a system integrator (working with e.g. Solr) wants to 
> gather all the parsers that the particular customer needs, and then choose 
> which parser should handle each mime-type. She may want to let a 3rd party 
> parser plugin handle Word files but the Tika supplied POI parser handle Excel.
> Today, the last parser plugin that gets loaded by the class-loader happens to 
> "win" the mime-types it supports. As it is not uncommon for one parser to 
> register multiple mime-types, re-claiming a subset of the types is not 
> possible unless you are consuming Tika directly.
> .h2 Solution
> Allow for an "override" style mime-to-parser mapping by configuration. To 
> keep the number of config files down, this is probably best done as an 
> extension of the tika-config.xml syntax, allowing for specifying only the 
> *changes*, without repeating all the mappings. Tika should look for 
> tika-config.xml on class-path by default even if it's not bundled by default.
> Say we add a parser plugin which supports a bunch of Office formats, but 
> their Excel parser sucks. We want to explicitly give control over 
> application/vnd.ms-excel to the POI parser. Here's how that could be done by 
> adding support for an "append" attribute on the <parsers> and <parser> tags, 
> instead of repeating all mime types:
> {code:xml} 
> <properties>
>     <parsers append="true">
>         <parser name="parse-office" 
> class="org.apache.tika.parser.microsoft.OfficeParser" append="true">
>                 <mime>application/vnd.ms-excel</mime>
>                 
> <mime>application/vnd.ms-excel.sheet.binary.macroenabled.12</mime>
>         </parser>
>     </parsers>
> <properties>
> {code}
> When Tika sees append="true", it will first initialize everything as default, 
> and then re-do the mappings explicitly specified. If you want to remove 
> support for a parser by config, you could specify a <parser...> tag without 
> an append attribute and with no <mime> sub-tags specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to