https://bz.apache.org/bugzilla/show_bug.cgi?id=60936
Bug ID: 60936
Summary: Figure out charset in Word 6.0 files
Product: POI
Version: 3.16-dev
Hardware: PC
Status: NEW
Severity: normal
Priority: P2
Component: HWPF
Assignee: [email protected]
Reporter: [email protected]
Target Milestone: ---
On TIKA-2313, Steven Hall submitted an example Word 6.0 file whose extracted
text is garbage.
>From what I can tell so far, our more modern code to check for isUnicode in
TextPieceTable should not be used on Word 6.0 files. If I disable that, the
text is correctly extracted.
We should figure out what mechanism was used in Word 6.0 files to determine
codepage, and we should look into disabling the isUnicode check for Word 6.0
files.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]