https://bz.apache.org/bugzilla/show_bug.cgi?id=60936

            Bug ID: 60936
           Summary: Figure out charset in Word 6.0 files
           Product: POI
           Version: 3.16-dev
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HWPF
          Assignee: [email protected]
          Reporter: [email protected]
  Target Milestone: ---

On TIKA-2313, Steven Hall submitted an example Word 6.0 file whose extracted
text is garbage.

>From what I can tell so far, our more modern code to check for isUnicode in
TextPieceTable should not be used on Word 6.0 files.  If I disable that, the
text is correctly extracted.

We should figure out what mechanism was used in Word 6.0 files to determine
codepage, and we should look into disabling the isUnicode check for Word 6.0
files.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to