[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890605#comment-13890605 ] Tim Allison commented on TIKA-1228: --- Not sure I understand. Is this the snippet that you refer to in PDNameTreeNode: {noformat} public MapString, COSObjectable getNames() throws IOException { COSArray namesArray = (COSArray)node.getDictionaryObject( COSName.NAMES ); {noformat} The above throws a class cast exception, but the code that you show doesn't? Are you getting a class cast exception on the document that you submitted with this issue or is it a different document? Thank you, again. Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890607#comment-13890607 ] Jason Sherman commented on TIKA-1228: - Tim, I saw you already added a test and fix to the codebase. Thanks! I'm going to clone it and use it if you don't mind. Jason Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890610#comment-13890610 ] Tim Allison commented on TIKA-1228: --- Y. That's the point of open source. :) Enjoy! Now that I'm looking at this issue again, I dragged out some of my pre-Tika code for pdf attachments using a different pdf library. It looks like the pdf files I was coding against could have the file name in a parent node and the actual bytes in a child or more distant descendant node. Will see if I can dig up the triggering files and see if Tika needs any more mods on PDF attachment extraction. {noformat} private MyPDFAttachment lookForByteStream(COSDictionary dict, MyPDFAttachment attach, int recursiveDepth){ COSName fCOSName = COSName.create(F); COSName efCOSName = COSName.create(EF); COSObject fObj = dict.get(fCOSName); COSObject efObj = dict.get(efCOSName); if (null != fObj){ if (fObj.getClass() == COSString.class){ attach.setName(fObj.stringValue()); } else if (fObj.getClass() == COSStream.class){ attach.setBytes(((COSStream)fObj).getDecodedBytes()); return attach; } } if (null != efObj efObj.getClass() == COSDictionary.class){ int tmpI = recursiveDepth; tmpI++; return lookForByteStream((COSDictionary)efObj, attach, tmpI); } return null; } {noformat} Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890608#comment-13890608 ] Jason Sherman commented on TIKA-1228: - Tim, Dang. During my troubleshooting, I first updated pdfbox to 1.8.3 and was using that source to step through the code. After the weirdness with the exception in code, but not in my expression evaluator, I reverted to the original tika code, but failed to revert the pdfbox code. I apologize for the confusion. Thanks again for your fast responses. Jason Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890613#comment-13890613 ] Tim Allison commented on TIKA-1228: --- Ok, to confirm, the PDNameTreeNode class cast exception is a non-issue? Thanks again. Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890639#comment-13890639 ] Jason Sherman commented on TIKA-1228: - Correct. PDNameTreeNode clas cast exception is a non-issue. Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697 ] Tim Allison commented on TIKA-1228: --- I won't have time to fix this for a week or so, but it looks like the client (Tika) needs to look through the kids of embeddedFiles recursively (well, in this file, just one level down) to get the non-null embeddedFileNames. Something like this does pull out the .doc file: {no-format} MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames(); ListPDNameTreeNode kids = embeddedFiles.getKids(); for (PDNameTreeNode n : kids){ MapString, COSObjectable embeddedFileNames = n.getNames(); processEmbedded(embeddedFileNames, embeddedExtractor); {no-format} where processEmbedded is shorthand for the existing code: {no-format} if (embeddedFileNames != null){ ... } {no-format} We can fix this at the Tika level in the short term. I'm not sure if this is the expected behavior in PDFBox. At the least we might want to request that this line in the javadoc to PDDocumentNameDictionary: (The value in this name tree will be PDComplexFileSpecification objects.) be changed to The value in this name tree or its children will be PDComplexFileSpecification objects.) Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889889#comment-13889889 ] Jason Sherman commented on TIKA-1228: - Thanks for the help. Another possibly related issue is: When I was stepping through the pdfbox code, line 286 throws an exception when running, but processes properly in my evaluation dialog (Intellij 13) namesArray = (COSArray)((COSDictionary)((COSArray)node.getDictionaryObject(COSName.KIDS)).get(0)).getDictionaryObject(COSName.NAMES); Throws: org.apache.pdfbox.cos.COSObject cannot be cast to org.apache.pdfbox.cos.COSDictionary Do you want to pass that on to the pdfbox folks, or should I report it separately? Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)