[jira] Commented: (JCR-1530) MsPowerPointTextE xtractor does not extract from PPTs with € sign

Alexander Klimetschek (JIRA) Thu, 10 Apr 2008 05:20:58 -0700

    [ 
https://issues.apache.org/jira/browse/JCR-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587611#action_12587611
 ]


Alexander Klimetschek commented on JCR-1530:
--------------------------------------------

Hmm, IMHO it shouldn't be Jackrabbit's concern to handle such "details", 
especially as text extraction from binary files is not a mandatory aspect of 
the JCR API.

What about using Apache Tika? It aims to collect all the various extraction 
libraries and self-built classes of the Apache project and to build a proper 
re-usable framework. It recently pushed out its first release. Jukka, you 
probably know more about it - is it already useful for Jackrabbit? You 
mentioned in JCR-1290 that this could be a task for Jackrabbit 1.5.

http://incubator.apache.org/tika/

> MsPowerPointTextExtractor does not extract from PPTs with € sign
> ----------------------------------------------------------------
>
>                 Key: JCR-1530
>                 URL: https://issues.apache.org/jira/browse/JCR-1530
>             Project: Jackrabbit
>          Issue Type: Bug
>          Components: jackrabbit-text-extractors
>    Affects Versions: 1.4
>            Reporter: Dirk Feufel
>
> The MsPowerPointTextExtractor class has a problem when reading PPTs when an € 
> sign is contained. All text following that sign is ignored. Perhaps the POI 
> PowerPointExtractor should be used instead of parsing the data by hand. As a 
> side effect, this would simply the code. Extracting could be done as follows:
>       public Reader extractText(InputStream stream, String type, String 
> encoding) throws IOException {
>               try {
>                       PowerPointExtractor extractor = new 
> PowerPointExtractor(stream);
>                       return new StringReader(extractor.getText(true,true));
>               } catch (RuntimeException e) {
>                       logger.warn("Failed to extract PowerPoint text 
> content", e);
>                       return new StringReader("");
>               } finally {
>                       try { stream.close(); } catch (IOException ignored) {}
>               }
>       }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-1530) MsPowerPointTextE xtractor does not extract from PPTs with € sign

Reply via email to