https://bz.apache.org/bugzilla/show_bug.cgi?id=50955

--- Comment #9 from Tim Allison <[email protected]> ---
I figured out how to read the old font table which includes codepage info. 
This doesn't solve all of our problems, but it helps.  Via testing with
OpenOffice, I found that I can't have two different codepages in one
document...that may be a feature of OpenOffice and not reality, but this
hack/heuristic works with all files attached here, TIKA-2313 and files
generated with OpenOffice.

So, the current temporary solution is to read through the font table and pick
the codepage that isn't "default" or "symbol."

Ideally, we'd be able to map each run to a font table.  If anyone has
recommendations, let me know.


Side note:
I also fixed a bug in PapInTable:

-   if ( papx.getGrpprl() == null || papx.getGrpprl().length == 0 )
+   if ( papx.getGrpprl() == null || papx.getGrpprl().length <= 2 )

The issue is that there were some grpprls with size 1 in the old docs, and this
caused an array out of bounds exception when copying because we start at offset
2.

Commit to come shortly.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to