Hi, 2011/9/5 Maxim Valyanskiy <[email protected]>: > 05.09.2011, в 16:23, Jukka Zitting написал(а): >> This was my attempt at properly handling the embedded PDF in >> TestWithPdf.docx. It was included in an OLE object with the PDF >> document as it's "CONTENTS" entry. I restored this functionality with >> some more specific checks in revision 1165259, and the resulting code >> should now work correctly with all the test documents we have. > > Hm, that is strange - current version of > OfficeParser.POIFSDocumentType.detectType() > thinks that "CONTENTS" part identifies POI filesystem as MS Works document. > Maybe this is not right.
I think we have some MS Works test files that do contain the "CONTENTS" entry, though I'm not sure if that's the best possible heuristic for detecting MS Works documents. My fix in revision 1165259 also checks for the presence of explicit OLE entries, which I believe should help prevent collisions with actual embedded MS Works documents. > Please add unit test with that TestWithPdf.docx. The file was uploaded without the "grant license" option (and I couldn't create a similar document myself) so I unfortunately couldn't add the test case along with my original commit. I asked for the required license grant in TIKA-704 and will add the test case if approved. BR, Jukka Zitting
