https://bz.apache.org/bugzilla/show_bug.cgi?id=50955
--- Comment #11 from Tim Allison <[email protected]> --- Turns out that 51944.doc is not UTF-16LE. It looks from this file and 2 other files from our common crawl corpus like this is actually Big5, but MS appears to zero-pad ascii characters. Has anyone worked with this? Do we have something in our codebase that deals with this already? If not, we may need some extra code to imitate MS's big5 en/decoding...not within the scope of this ticket. It looks from ~1300 Word 6.0 files in our corpus, that the proposed solution works. Unfortunately, there are only a few handfuls of files that aren't encoded with WIN-1252. -- You are receiving this mail because: You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
