[ https://issues.apache.org/jira/browse/TIKA-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950239#comment-14950239 ]
Tim Allison commented on TIKA-1761: ----------------------------------- And the other question...if we add support for pw protection for doc and ppt, is that really an unrelated issue? That is, would you expect to need a password to extract text if the doc is protected from edits instead of the usual pw protection to open/read the document? For PDFs, depending on the settings, you are allowed to extract text even if the document permissions don't allow edits. I haven't looked at the MS spec or even broken out a hex editor on your files yet. > Error Parsing PPT (97-2003) files with password protection against > modification which were created using Office 2013 > -------------------------------------------------------------------------------------------------------------------- > > Key: TIKA-1761 > URL: https://issues.apache.org/jira/browse/TIKA-1761 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.7, 1.10 > Reporter: Andriy Budzinskyy > Assignee: Tim Allison > Attachments: test-2007.ppt, test-2013.ppt > > > PPT documents created (or saved) as Powerpoint 97-2003 format and protected > with password against modification using Office 2013 fail during extracting > text. > But it works fine Powerpoint 97-2003 format using Office 2007 > {noformat} > java -jar tika-app-1.10.jar --text test_2003.ppt > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@22b0f5af > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:185) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:489) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > Caused by: org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException: > PowerPoint file is encrypted. The correct password needs to be set via > Biff8EncryptionKey.setCurrentUserPassword() > at > org.apache.poi.hslf.EncryptedSlideShow.<init>(EncryptedSlideShow.java:102) > at org.apache.poi.hslf.HSLFSlideShow.read(HSLFSlideShow.java:259) > at > org.apache.poi.hslf.HSLFSlideShow.buildRecords(HSLFSlideShow.java:250) > at org.apache.poi.hslf.HSLFSlideShow.<init>(HSLFSlideShow.java:165) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 5 more > {noformat} > I've debugged Tika library and found that it fails > UserEditAtom.encryptSessionPersistIdRef property. This property is empty in > files created with Office 2007 and no-empty with Office 2013. > I've defragmented PPT files as described in > https://social.msdn.microsoft.com/Forums/en-US/e33189a5-0b00-44b7-b084-f2757e9b7536/powerpoint-binary-file-format-decryption?forum=os_binaryfile > Is this bug of Tika or POI library? > Should be it supported per Apache POI [encryption > support|https://poi.apache.org/encryption.html]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)