[ 
https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097212#comment-13097212
 ] 

Jukka Zitting commented on TIKA-704:
------------------------------------

See also revisions 1165230 and 1165259 for followup work.

Jeremy, I notice you didn't select the "Grant license to ASF for inclusion in 
ASF works (as per the Apache License ยง5)" when uploading the test documents. 
Was this on purpose, or is it OK if we include at least the testWithPdf.docx 
document as a test case in Tika?

The email attachment in testWithOutlook.docx contains a PDF under Yamaha 
copyright, so we probably can't use that in any case. It would be great if you 
could create an alternative test file without any external content that we 
could include in Tika.

> PDF and Outlook docs embedded in MS Word documents not parsed
> -------------------------------------------------------------
>
>                 Key: TIKA-704
>                 URL: https://issues.apache.org/jira/browse/TIKA-704
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows 7 64-bit
>            Reporter: Jeremy Anderson
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>         Attachments: TestWithOutlook.docx, TestWithPdf.docx, 
> recursiveUsage.txt
>
>
> Currently there appear to be issues with embedded pdf's and outlook Msg files 
> contained in MS Word documents. I'll attach a sample for each and my 
> recursive parser (incase the problem lies in there).
> From what I see, when these embedded objects are parsed, they're initially 
> identified as vnd.openxmlformats-officedocument.oleObject in the metadata's 
> Content-Type field. After a call to the RecurciveParsers super parse class 
> the Content-Types update to the following:
> PDF's: application/vnd.ms-works
> .MSG: application/x-tika-msoffice
> The internal AutoDetectParser is unable to properly identify these PDF's and 
> therfore does not call the PDFParser on them.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to