[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890605#comment-13890605
 ] 

Tim Allison commented on TIKA-1228:
---

Not sure I understand.  Is this the snippet that you refer to in PDNameTreeNode:
{noformat}
public MapString, COSObjectable getNames() throws IOException
{
COSArray namesArray = (COSArray)node.getDictionaryObject( COSName.NAMES 
);
{noformat}

The above throws a class cast exception, but the code that you show doesn't?

Are you getting a class cast exception on the document that you submitted with 
this issue or is it a different document?

Thank you, again.

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Jason Sherman (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890607#comment-13890607
 ] 

Jason Sherman commented on TIKA-1228:
-

Tim,

I saw you already added a test and fix to the codebase.  Thanks!  I'm going to 
clone it and use it if you don't mind. 

Jason

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890610#comment-13890610
 ] 

Tim Allison commented on TIKA-1228:
---

Y.  That's the point of open source. :)  Enjoy!

Now that I'm looking at this issue again, I dragged out some of my pre-Tika 
code for pdf attachments using a different pdf library.  It looks like the pdf 
files I was coding against could have the file name in a parent node and the 
actual bytes in a child or more distant descendant node.

Will see if I can dig up the triggering files and see if Tika needs any more 
mods on PDF attachment extraction.

{noformat}
private MyPDFAttachment lookForByteStream(COSDictionary dict, MyPDFAttachment 
attach, int recursiveDepth){

COSName fCOSName = COSName.create(F);
COSName efCOSName = COSName.create(EF);
COSObject fObj = dict.get(fCOSName);
COSObject efObj = dict.get(efCOSName);
if (null != fObj){
if (fObj.getClass() == COSString.class){
attach.setName(fObj.stringValue());
} else if (fObj.getClass() == COSStream.class){
attach.setBytes(((COSStream)fObj).getDecodedBytes());
return attach;
}
} 
if (null != efObj  efObj.getClass() == COSDictionary.class){ 
int tmpI = recursiveDepth;
tmpI++;
return lookForByteStream((COSDictionary)efObj, attach, tmpI);   
}
return null;
}
{noformat}

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Jason Sherman (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890608#comment-13890608
 ] 

Jason Sherman commented on TIKA-1228:
-

Tim,

Dang.  During my troubleshooting, I first updated pdfbox to 1.8.3 and was using 
that source to step through the code.  After the weirdness with the exception 
in code, but not in my expression evaluator, I reverted to the original tika 
code, but failed to revert the pdfbox code.  I apologize for the confusion.  
Thanks again for your fast responses.

Jason

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890613#comment-13890613
 ] 

Tim Allison commented on TIKA-1228:
---

Ok, to confirm, the PDNameTreeNode class cast exception is a non-issue?

Thanks again.

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Jason Sherman (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890639#comment-13890639
 ] 

Jason Sherman commented on TIKA-1228:
-

Correct.  PDNameTreeNode clas cast exception is a non-issue.

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697
 ] 

Tim Allison commented on TIKA-1228:
---

I won't have time to fix this for a week or so, but it looks like the client 
(Tika) needs to look through the kids of embeddedFiles recursively (well, in 
this file, just one level down) to get the non-null embeddedFileNames.

Something like this does pull out the .doc file:

{no-format}
MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames();
ListPDNameTreeNode kids = embeddedFiles.getKids();
for (PDNameTreeNode n : kids){
MapString, COSObjectable embeddedFileNames = n.getNames();
processEmbedded(embeddedFileNames, embeddedExtractor);

{no-format}

where processEmbedded is shorthand for the existing code:
{no-format}
if (embeddedFileNames != null){
...
}
{no-format}

We can fix this at the Tika level in the short term.  I'm not sure if this is 
the expected behavior in PDFBox.  At the least we might want to request that 
this line in the javadoc to PDDocumentNameDictionary: (The value in this name 
tree will be PDComplexFileSpecification objects.) be changed to The value in 
this name tree or its children will be PDComplexFileSpecification objects.)

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Jason Sherman (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889889#comment-13889889
 ] 

Jason Sherman commented on TIKA-1228:
-

Thanks for the help.  Another possibly related issue is:
When I was stepping through the pdfbox code, line 286 throws an exception when 
running, but processes properly in my evaluation dialog (Intellij 13)

namesArray = 
(COSArray)((COSDictionary)((COSArray)node.getDictionaryObject(COSName.KIDS)).get(0)).getDictionaryObject(COSName.NAMES);

Throws:
org.apache.pdfbox.cos.COSObject cannot be cast to 
org.apache.pdfbox.cos.COSDictionary

Do you want to pass that on to the pdfbox folks, or should I report it 
separately?

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)