https://issues.apache.org/bugzilla/show_bug.cgi?id=51891

--- Comment #4 from Daniel Bonniot <[email protected]> ---
I think I might have cracked this nut. The two inputs from Bug 46392 have the
same storage class ID:
0003000C-0000-0000-C000-000000000046

It's hard to find much information about this class ID, but it seems to be
associated with some kind of "Package" (see for instance
http://www.lookas.net/ftp/Software/Tools/LitWin_98/regist~1.reg). This in turn
seems to suggest that the parsing done by Ole10Native might actually be valid
only for this specific kind of content. If that's indeed the case, we can
change the logic to use always "plain", except for content with exactly this
storage class ID. This still passes all the known test cases, and feels much
more right than the previous attempt.

I'll attach a new patch. It uses this new logic, and also adds one more test
case from https://issues.apache.org/jira/browse/TIKA-1072 which is also fixed
by this.

Note that this suggest the "structured" parsing done by Ole10Native might not
belong here at all, since it is tied to a specific content, but would logically
belong to the client of POI. However I might be wrong here, and it also does
not cost much to keep providing this feature, instead of breaking it and the
corresponding API.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to