Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "CompositeParserDiscussion" page has been changed by NickBurch: https://wiki.apache.org/tika/CompositeParserDiscussion?action=diff&rev1=2&rev2=3 Comment: Start on config The right strategy for one user may not be the right for another. The right strategy for one file may not be the right one for another. We therefore need to allow users to pick their strategy, on an overall basis, and on a per-file basis == From TikaConfig == - ''TODO'' + Currently, a great many Tika users just call {{{TikaConfig.getDefaultConfig()}}} and go with that. + + It might be nice if they could also do things like {{{TikaConfig.getMaxiumMetadataConfig()}}} or {{{TikaConfig.getTryEachInTurnConfig()}}} to pick a different strategy + + (Naming TBC, align with above) == With a Tika Configuration file == - ''TODO'' + Users may wish to have full control over what parsers are used, what strategies are used for which mime types etc + + For example, they might want default behaviour for most types, but to send XML through a fallback parser, and combine Image + GDAL + OCR for jpeg. The configuration file needs to support this + + {{{ + <parsers> + <!-- Most things can use the default --> + <parser class="org.apache.tika.parser.DefaultParser"> + <mime-exclude>image/jpeg</mime-exclude> + <mime-exclude>application/xml</mime-exclude> + <mime-exclude>application/pdf</mime-exclude> + </parser> + + <!-- No PDF, thank you! --> + <parser class="org.apache.tika.parser.EmptyParser"> + <mime>application/pdf</mime> + </parser> + + <!-- JPEG needs special handling --> + <!-- XML needs special handling --> + </parsers> + }}} == In Code == ''TODO''
