[jira] Reopened: (JCR-1878) Use Apache Tika for text extraction

Jukka Zitting (JIRA) Thu, 16 Apr 2009 04:13:39 -0700

     [ 
https://issues.apache.org/jira/browse/JCR-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting reopened JCR-1878:
--------------------------------


We need the ooxml-schemas dependency in any case if we want to support 
Microsoft Office 2007 files (see JCR-1887). I think that's a pretty important 
improvement, that's definitely worth keeping even if it notably increases the 
standalone jar size.

I'll ping the POI people on whether the ooxml-schemas jar could be trimmed down 
somehow.

Also, in Tika we could perhaps find some ways to reduce the size of the 
dependencies, as not all of the included functionality is really needed (text 
extraction is typically just a part of the functionality included in the parser 
libraries).

Anyway, I'm reopening this issue until we have a solution that satisfies 
everyone.

> Use Apache Tika for text extraction
> -----------------------------------
>
>                 Key: JCR-1878
>                 URL: https://issues.apache.org/jira/browse/JCR-1878
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-text-extractors
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 1.6.0
>
>
> Once Apache Tika is released with a resolution to TIKA-175 (making Tika 
> available to Java 1.4 projects), we should replace our direct parser library 
> dependencies with Tika parsers. Ideally we'd just use the Tika 
> AutoDetectParser that'll automatically detect the type of a binary and parse 
> it accordingly, solving JCR-728.
> I guess we should keep some level of backwards compatibility with existing 
> textFilterClasses="..." configurations, perhaps by keeping the existing 
> TextExtractor classes as wrappers around respective Tika parsers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (JCR-1878) Use Apache Tika for text extraction

Reply via email to