[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890610#comment-13890610 ]
Tim Allison commented on TIKA-1228: ----------------------------------- Y. That's the point of open source. :) Enjoy! Now that I'm looking at this issue again, I dragged out some of my pre-Tika code for pdf attachments using a different pdf library. It looks like the pdf files I was coding against could have the file name in a parent node and the actual bytes in a child or more distant descendant node. Will see if I can dig up the triggering files and see if Tika needs any more mods on PDF attachment extraction. {noformat} private MyPDFAttachment lookForByteStream(COSDictionary dict, MyPDFAttachment attach, int recursiveDepth){ COSName fCOSName = COSName.create("F"); COSName efCOSName = COSName.create("EF"); COSObject fObj = dict.get(fCOSName); COSObject efObj = dict.get(efCOSName); if (null != fObj){ if (fObj.getClass() == COSString.class){ attach.setName(fObj.stringValue()); } else if (fObj.getClass() == COSStream.class){ attach.setBytes(((COSStream)fObj).getDecodedBytes()); return attach; } } if (null != efObj && efObj.getClass() == COSDictionary.class){ int tmpI = recursiveDepth; tmpI++; return lookForByteStream((COSDictionary)efObj, attach, tmpI); } return null; } {noformat} > Embedded files not extracted properly from PDF > ---------------------------------------------- > > Key: TIKA-1228 > URL: https://issues.apache.org/jira/browse/TIKA-1228 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.4 > Environment: CentOS 6.5 VM > Reporter: Jason Sherman > Labels: easyfix > Fix For: 1.5 > > Attachments: pdf_with_doc_and_text_attached.pdf > > > IAW pdfbox example here: > http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java > the PDF parser does not check for additional entries under Kids node when > Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)