https://bz.apache.org/bugzilla/show_bug.cgi?id=50955
--- Comment #9 from Tim Allison <[email protected]> --- I figured out how to read the old font table which includes codepage info. This doesn't solve all of our problems, but it helps. Via testing with OpenOffice, I found that I can't have two different codepages in one document...that may be a feature of OpenOffice and not reality, but this hack/heuristic works with all files attached here, TIKA-2313 and files generated with OpenOffice. So, the current temporary solution is to read through the font table and pick the codepage that isn't "default" or "symbol." Ideally, we'd be able to map each run to a font table. If anyone has recommendations, let me know. Side note: I also fixed a bug in PapInTable: - if ( papx.getGrpprl() == null || papx.getGrpprl().length == 0 ) + if ( papx.getGrpprl() == null || papx.getGrpprl().length <= 2 ) The issue is that there were some grpprls with size 1 in the old docs, and this caused an array out of bounds exception when copying because we start at offset 2. Commit to come shortly. -- You are receiving this mail because: You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
