[jira] [Commented] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

Nick Burch (JIRA) Tue, 06 Aug 2013 04:06:01 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13730649#comment-13730649
 ]


Nick Burch commented on TIKA-1124:
----------------------------------

I don't know the PDF code well, but at first glance the patch looks good

One thing that might be good would be to expand the test a little bit, to check 
that all the correct parts were found, and in the right order. 
POIContainerExtractionTest has some examples of doing that for other embedded 
resources, so might be worth a look for a guide
                
> Nested documents not extracted if a PDF file is in the chain
> ------------------------------------------------------------
>
>                 Key: TIKA-1124
>                 URL: https://issues.apache.org/jira/browse/TIKA-1124
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.3
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: pdf_attachment_issues.zip, TIKA-1124.patch
>
>
> Tika 1.3 is not able to get attachments from the attached PDF.
> The trunk is able to get attachments from the PDF.  However, if that PDF is 
> then embedded in another document, the docs embedded in the PDF are not 
> extracted.
> I'm not sure of a solution, but I found two things that might help with the 
> diagnosis:
> 1) If you modify the code in PDFParser so that it doesn't wrap the handler in 
> a BodyContentHandler, everything works (in trunk).
> 2) If you modify BodyContentHandler to use my toy 
> SimpleBodyMatchingContentHandler, the problem is also solved.
> The cause may be in the MatchingContentHandler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

Reply via email to