[
https://issues.apache.org/jira/browse/JCR-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting reopened JCR-1878:
--------------------------------
We need the ooxml-schemas dependency in any case if we want to support
Microsoft Office 2007 files (see JCR-1887). I think that's a pretty important
improvement, that's definitely worth keeping even if it notably increases the
standalone jar size.
I'll ping the POI people on whether the ooxml-schemas jar could be trimmed down
somehow.
Also, in Tika we could perhaps find some ways to reduce the size of the
dependencies, as not all of the included functionality is really needed (text
extraction is typically just a part of the functionality included in the parser
libraries).
Anyway, I'm reopening this issue until we have a solution that satisfies
everyone.
> Use Apache Tika for text extraction
> -----------------------------------
>
> Key: JCR-1878
> URL: https://issues.apache.org/jira/browse/JCR-1878
> Project: Jackrabbit Content Repository
> Issue Type: Improvement
> Components: jackrabbit-text-extractors
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Fix For: 1.6.0
>
>
> Once Apache Tika is released with a resolution to TIKA-175 (making Tika
> available to Java 1.4 projects), we should replace our direct parser library
> dependencies with Tika parsers. Ideally we'd just use the Tika
> AutoDetectParser that'll automatically detect the type of a binary and parse
> it accordingly, solving JCR-728.
> I guess we should keep some level of backwards compatibility with existing
> textFilterClasses="..." configurations, perhaps by keeping the existing
> TextExtractor classes as wrappers around respective Tika parsers.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.