https://issues.apache.org/bugzilla/show_bug.cgi?id=51891
--- Comment #4 from Daniel Bonniot <[email protected]> --- I think I might have cracked this nut. The two inputs from Bug 46392 have the same storage class ID: 0003000C-0000-0000-C000-000000000046 It's hard to find much information about this class ID, but it seems to be associated with some kind of "Package" (see for instance http://www.lookas.net/ftp/Software/Tools/LitWin_98/regist~1.reg). This in turn seems to suggest that the parsing done by Ole10Native might actually be valid only for this specific kind of content. If that's indeed the case, we can change the logic to use always "plain", except for content with exactly this storage class ID. This still passes all the known test cases, and feels much more right than the previous attempt. I'll attach a new patch. It uses this new logic, and also adds one more test case from https://issues.apache.org/jira/browse/TIKA-1072 which is also fixed by this. Note that this suggest the "structured" parsing done by Ole10Native might not belong here at all, since it is tied to a specific content, but would logically belong to the client of POI. However I might be wrong here, and it also does not cost much to keep providing this feature, instead of breaking it and the corresponding API. -- You are receiving this mail because: You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
