[
https://issues.apache.org/jira/browse/JCR-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587895#action_12587895
]
Marcel Reutegger commented on JCR-1530:
---------------------------------------
There are at least two minor issues with using Tika in Jackrabbit.
- Tika is still in incubation. I'd prefer to only introduce a dependency to it
when it is out of incubation.
- Tika requires Java 1.5, whereas Jackrabbit currently is fine with 1.4.
We might want to provide an adapter, which implements the Jackrabbit
TextExtractor interface and uses Tika to extract the text. Users then can
decide if they want to use it and therefore need to use Java 1.5.
> MsPowerPointTextExtractor does not extract from PPTs with € sign
> ----------------------------------------------------------------
>
> Key: JCR-1530
> URL: https://issues.apache.org/jira/browse/JCR-1530
> Project: Jackrabbit
> Issue Type: Bug
> Components: jackrabbit-text-extractors
> Affects Versions: 1.4
> Reporter: Dirk Feufel
>
> The MsPowerPointTextExtractor class has a problem when reading PPTs when an €
> sign is contained. All text following that sign is ignored. Perhaps the POI
> PowerPointExtractor should be used instead of parsing the data by hand. As a
> side effect, this would simply the code. Extracting could be done as follows:
> public Reader extractText(InputStream stream, String type, String
> encoding) throws IOException {
> try {
> PowerPointExtractor extractor = new
> PowerPointExtractor(stream);
> return new StringReader(extractor.getText(true,true));
> } catch (RuntimeException e) {
> logger.warn("Failed to extract PowerPoint text
> content", e);
> return new StringReader("");
> } finally {
> try { stream.close(); } catch (IOException ignored) {}
> }
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.