Frank Refol created TIKA-2148:
---------------------------------

             Summary: Tika app is unable to parse a password protected 
PowerPoint (97-2003) document 
                 Key: TIKA-2148
                 URL: https://issues.apache.org/jira/browse/TIKA-2148
             Project: Tika
          Issue Type: Bug
          Components: cli
    Affects Versions: 1.13
         Environment: Windows console.
            Reporter: Frank Refol


Using the Tika command-line application to extract text from a PowerPoint 
97-2003 document fails. Here's the basic command that was used:
{quote}
java -jar tika-app-1.13.jar -t --password=password "This is password protected 
(Created with MS 2003).ppt"
{quote}

The following exception is thrown on the console:
{noformat}
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@62204612
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
Caused by: org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException: 
PowerPoint file is encrypted. The correct password needs to be set via 
Biff8EncryptionKey.setCurrentUserPassword()
        at 
org.apache.poi.hslf.usermodel.HSLFSlideShowEncrypted.<init>(HSLFSlideShowEncrypted.java:106)
        at 
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(HSLFSlideShowImpl.java:284)
        at 
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(HSLFSlideShowImpl.java:275)
        at 
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.<init>(HSLFSlideShowImpl.java:179)
        at 
org.apache.poi.hslf.usermodel.HSLFSlideShow.<init>(HSLFSlideShow.java:182)
        at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 5 more
{noformat}

Note that this happens with a PPT file that is created using Office 2010, 
Office 2007, or Office 2003.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to