Re: [sw-discussion] What is the best way to prevent data loss by choosing filters which does not support unicode?

Jeongkyu Kim Fri, 12 Dec 2008 12:04:37 -0800

On Thu, Nov 20, 2008 at 5:31 PM, Caolán McNamara <[email protected]> wrote:
> On Thu, 2008-11-20 at 01:38 +0900, Jeongkyu Kim wrote:
>> For instance, MS Word 6.0 and 95 format filters
>> are almost useless and even dangerous for Korean users (unless we
>> implement application-level character set handling).
>
> It might be worth seeing if you can improve the handling of Korean text
> on ww6/ww95 export. See getScriptClass in writerwordglue.cxx which
> maps character ranges to 8bit encodings for best-pick on export to
> 8 bit ww6 and ww95 formats. GetPseudoCharRuns uses that to create
> sections of text that share the same properties and character encoding
> and they get written out with SwWW8Writer::OutSwString using the
> encoding guessed with getScriptClass. Some tweaking around
> getScriptClass might improve the situation.
>
> C.
>


Hello Caolán,

It certainly improves the situation to simply add support for
Hangul(Korean character) scripts in getScriptClass() like the
following.

        static ScriptTypeList aScripts[] =
        {
            { UnicodeScript_kBasicLatin, UnicodeScript_kBasicLatin,
RTL_TEXTENCODING_MS_1252},
            { UnicodeScript_kLatin1Supplement,
UnicodeScript_kLatin1Supplement, RTL_TEXTENCODING_MS_1252},
            ...
+           { UnicodeScript_kHangulJamo, UnicodeScript_kHangulJamo,
RTL_TEXTENCODING_MS_949},
+           { UnicodeScript_kHangulCompatibilityJamo,
UnicodeScript_kHangulCompatibilityJamo, RTL_TEXTENCODING_MS_949},
+           { UnicodeScript_kHangulSyllable,
UnicodeScript_kHangulSyllable, RTL_TEXTENCODING_MS_949},
            { UnicodeScript_kScriptCount, UnicodeScript_kScriptCount,
RTL_TEXTENCODING_MS_1252}

When I tested this fix, the exported files looked OK in MS Word 2007.
The font name was still '???', but it does not matter here because
preserving data is my main concern. I am very happy with the result
and thank you very much for your help.

However, there is one more homework left for me. When I opened the
exported file with OO.o Writer, Korean characters were broken. I am
playing around some functions in ww8par.cxx such as ReadPlainChars(),
GetCurrentCharSet(), and Custom8BitToUnicode(), but I have no luck
yet. Now, I need some hints on how to handle Korean characters
correctly in importing filter. FYI, Korean character uses DBCS and its
encoding is MS949.

Thanks,
Jeongkyu
-- 
Jeongkyu Kim
OpenOffice.org Korean community lead

Community website http://openoffice.or.kr
Personal blog     http://openoffice.or.kr/gomme

Re: [sw-discussion] What is the best way to prevent data loss by choosing filters which does not support unicode?

Reply via email to