[ 
https://issues.apache.org/jira/browse/JCR-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved JCR-2642.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 2.2.0
         Assignee: Jukka Zitting

We need the custom tika-config.xml file since we want to by default disable 
text extraction of package and image file formats to avoid excess resources 
being spent.

However, in revision 1038125 I modified our custom tika-config.xml file to use 
the new DefaultParser class in Tika 0.8 to automatically pick up all available 
parser classes through the service provider mechanism used by Tika. The 
selected package and image formats are still disabled by explicitly mapping 
them to the dummy EmptyParser class.

> JackrabbitParser and tika 0.7 parser
> ------------------------------------
>
>                 Key: JCR-2642
>                 URL: https://issues.apache.org/jira/browse/JCR-2642
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.1.0
>            Reporter: Dan Ducar
>            Assignee: Jukka Zitting
>             Fix For: 2.2.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Hi,
> I was trying to implement a custom parser and found the following problem.
> Since tika 0.7 it is possible to implement your custom parser and specify it 
> into a service provider configuration file 
> (META-INF/services/org.apache.tika.parser.Parser). In this way there would be 
> no need to maintain a custom tika-config.xml file if you'd like to implement 
> a custom parser.
> The problem that I had was in the JackrabbitParser because I wasn't able to 
> instantiate the AutoDetectParser with the default constructor is will be 
> instantiated using the default TikaConfig constructor.
> Basically from tika 0.7, the TikaConfig.getTikaConfig() is instantiating the 
> TikaConfig using the default constructor instead of accessing the 
> tika-config.xml file from withing the package, and reads the service provider 
> configuration files and populate the parsers map.
> What I'm proposing is to change the JackrabbitParser to instantiate the 
> AutoDetectParser using the default constructor, in this way the using tika 
> version >= 0.7 we could easily implement our own parsers and there won't be a 
> reason to maintain the tika-config.xml, also a sort of "backward" 
> compatibility would be maintained because using the AutoDetectParser default 
> constructor the TikaConfig is instantiated using TikaConfig.getTikaConfig() 
> wich for tika versions < 0.7 calls the TikaConfig(InputStream) constructor 
> whcih reads the configuration directly from the package.
> Basically the JackrabbitParser should look like this:
>     public JackrabbitParser() {
>               parser = new AutoDetectParser();
>     }
>  
> Thanks,
> Dan

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to