[ https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292203#comment-14292203 ]
Tim Allison edited comment on TIKA-1489 at 1/26/15 7:17 PM: ------------------------------------------------------------ I haven't been able to find standards in XMP or elsewhere. DC's [[accessRights]] and [[rights]] are as close as I could find, but they aren't a good fit. Has anyone had any luck finding a standard? I did just open up MSWord to see what is available there with the current document format. I don't have Information Rights Management (IRM) set up so I can't see exactly what that offers, but it looks like, MSWord has these options: * Read only * Restricted editing ** tracked changes ** comments ** filling in forms ** read only (yes, again) * Restricted Access (this is what I can't experiment with) ** Edit permission ** Copy permission ** Print permission LibreOffice's Writer appears to have: * Read Only * Record Changes There are clearly some overlaps with the permissions allowed in PDF, but there are also some differences. For most of Tika's use cases (I think), we'd want to set a general Tika Metadata key/value for "do not extract text" if both pdf fields were false or if the MSOffice CopyPermission were false??? ||Application||Permission Name|| |PDF|CanExtractContent| |PDF|CanExtractForAccessibility| |MSOffice|Copy permission| Should we start with PDFBox's AccessPermission as a model and add where necessary from there? was (Author: talli...@mitre.org): I haven't been able to find standards in XMP or elsewhere. Has anyone had any luck? I did just open up MSWord to see what is available there with the current document format. I don't have Information Rights Management (IRM) set up so I can't see exactly what that offers, but it looks like, MSWord has these options: * Read only * Restricted editing ** tracked changes ** comments ** filling in forms ** read only (yes, again) * Restricted Access (this is what I can't experiment with) ** Edit permission ** Copy permission ** Print permission LibreOffice's Writer appears to have: * Read Only * Record Changes There are clearly some overlaps with the permissions allowed in PDF, but there are also some differences. For most of Tika's use cases (I think), we'd want to set a general Tika Metadata key/value for "do not extract text" if both pdf fields were false or if the MSOffice CopyPermission were false??? ||Application||Permission Name|| |PDF|CanExtractContent| |PDF|CanExtractForAccessibility| |MSOffice|Copy permission| > PDF Text extraction without permission > -------------------------------------- > > Key: TIKA-1489 > URL: https://issues.apache.org/jira/browse/TIKA-1489 > Project: Tika > Issue Type: Bug > Affects Versions: 1.7 > Reporter: Tilman Hausherr > > In TIKA-1442 text extraction from files like 717226.pdf that don't have text > extraction permission works. The permissions in PDF files are only enforced > by the application (i.e. PDFBox), i.e. the text information isn't stored > separately in encrypted form. > PDFBox ExtractText command line does throw an exception. > So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call > used bypasses the permission checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)