[
https://issues.apache.org/jira/browse/TIKA-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919647#action_12919647
]
Jukka Zitting commented on TIKA-527:
------------------------------------
In revision 1006336 I added an org.apache.tika.parser.DefaultParser class that
can be used to achieve most of this use case together with the existing
TikaConfig functionality. Here's an example:
<properties>
<parsers>
<!-- Load all available parsers -->
<parser class="org.apache.tika.parser.DefaultParser"/>
<!-- Override parsing of all types supported by CustomParser -->
<parser class="org.example.CustomParser"/>
<!-- Explicitly disable parsing of Zip archives -->
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/zip</mime>
</parser>
</parsers>
</properties>
> Allow override mapping mime<-->parsers through config
> -----------------------------------------------------
>
> Key: TIKA-527
> URL: https://issues.apache.org/jira/browse/TIKA-527
> Project: Tika
> Issue Type: Improvement
> Components: config
> Affects Versions: 0.7
> Reporter: Jan Høydahl
>
> h2. Background
> As of Tika 0.7, tika-config.xml is not longer mandatory and loading 3rd party
> parsers as plugins through service architecture is supported.
> This introduces great flexibility, and even allows for extending Tika's file
> format support by simply dropping in jar's on the classpath. This is great
> for configuring Tika when it's embedded as part of another application such
> as Solr or Nutch. You can easily add support for e.g. a commercial document
> filter with Tika wrapper without changing Tika or the consuming application,
> or even maintaining a tika-config.xml.
> This serves the majority of all use cases.
> h2. Problem
> However, as the variety of 3rd party document parsers increases, we'll start
> seeing an overlap of parsers supporting the same mime-types. A very likely
> scenario is a company specialized in document filters packaging their parsers
> as a Tika plugin, under whatever license they choose.
> In this scenario, a system integrator (working with e.g. Solr) wants to
> gather all the parsers that the particular customer needs, and then choose
> which parser should handle each mime-type. She may want to let a 3rd party
> parser plugin handle Word files but the Tika supplied POI parser handle Excel.
> Today, the last parser plugin that gets loaded by the class-loader happens to
> "win" the mime-types it supports. As it is not uncommon for one parser to
> register multiple mime-types, re-claiming a subset of the types is not
> possible unless you are consuming Tika directly.
> .h2 Solution
> Allow for an "override" style mime-to-parser mapping by configuration. To
> keep the number of config files down, this is probably best done as an
> extension of the tika-config.xml syntax, allowing for specifying only the
> *changes*, without repeating all the mappings. Tika should look for
> tika-config.xml on class-path by default even if it's not bundled by default.
> Say we add a parser plugin which supports a bunch of Office formats, but
> their Excel parser sucks. We want to explicitly give control over
> application/vnd.ms-excel to the POI parser. Here's how that could be done by
> adding support for an "append" attribute on the <parsers> and <parser> tags,
> instead of repeating all mime types:
> {code:xml}
> <properties>
> <parsers append="true">
> <parser name="parse-office"
> class="org.apache.tika.parser.microsoft.OfficeParser" append="true">
> <mime>application/vnd.ms-excel</mime>
>
> <mime>application/vnd.ms-excel.sheet.binary.macroenabled.12</mime>
> </parser>
> </parsers>
> <properties>
> {code}
> When Tika sees append="true", it will first initialize everything as default,
> and then re-do the mappings explicitly specified. If you want to remove
> support for a parser by config, you could specify a <parser...> tag without
> an append attribute and with no <mime> sub-tags specified.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.