[ 
https://issues.apache.org/jira/browse/TIKA-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952801#comment-14952801
 ] 

Andriy Budzinskyy commented on TIKA-1761:
-----------------------------------------

Well, I would expect that we do not need password for extracting text if file 
(doc or ppt) was protected for modification.
The thing is that my attached files were created with the same protected 
setting but using different MS Office version.

> Error Parsing PPT (97-2003) files with password protection against 
> modification which were created using Office 2013
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1761
>                 URL: https://issues.apache.org/jira/browse/TIKA-1761
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.7, 1.10
>            Reporter: Andriy Budzinskyy
>            Assignee: Tim Allison
>         Attachments: test-2007.ppt, test-2013.ppt
>
>
> PPT documents created (or saved) as Powerpoint 97-2003 format and protected 
> with password against modification using Office 2013 fail during extracting 
> text.
> But it works fine Powerpoint 97-2003 format using Office 2007
> {noformat}
> java -jar tika-app-1.10.jar --text test_2003.ppt
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@22b0f5af
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:185)
>         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:489)
>         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> Caused by: org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException: 
> PowerPoint file is encrypted. The correct password needs to be set via 
> Biff8EncryptionKey.setCurrentUserPassword()
>         at 
> org.apache.poi.hslf.EncryptedSlideShow.<init>(EncryptedSlideShow.java:102)
>         at org.apache.poi.hslf.HSLFSlideShow.read(HSLFSlideShow.java:259)
>         at 
> org.apache.poi.hslf.HSLFSlideShow.buildRecords(HSLFSlideShow.java:250)
>         at org.apache.poi.hslf.HSLFSlideShow.<init>(HSLFSlideShow.java:165)
>         at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
>         at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>         at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         ... 5 more
> {noformat}
> I've debugged Tika library and found that it fails 
> UserEditAtom.encryptSessionPersistIdRef property. This property is empty in 
> files created with Office 2007 and no-empty with Office 2013.
> I've defragmented PPT files as described in 
> https://social.msdn.microsoft.com/Forums/en-US/e33189a5-0b00-44b7-b084-f2757e9b7536/powerpoint-binary-file-format-decryption?forum=os_binaryfile
> Is this bug of Tika or POI library? 
> Should be it supported per Apache POI [encryption 
> support|https://poi.apache.org/encryption.html]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to