Hi,
I am trying to decode mime words (the original string is in Chinese characters)
using DecoderUtil.decodeEncodedWords().
Following is the sample code :
@Test
public void testEncoding() throws UnsupportedEncodingException,
IOException{
String str = "=?gb2312?B?ztKyu8rH1tCH+LmyrmEudHh0?=";
str = str + "\r\n ";
str = str + "=?gb2312?B?ztLKx9bQufrIyy50eHQ=?=";
str = DecoderUtil.decodeEncodedWords(str);
File file = new File("C://chinese2.txt");
FileOutputStream fileOut = new FileOutputStream(file);
fileOut.write(str.getBytes("gb2312"));
fileOut.flush();
fileOut.close();
}
In above code the characters would seem to be corrupted.
Here the problem is with the character set, most of the mail clients set the
char sets to be GB2312, but actually to decode the chars correctly I had to use
GB18030 in the above code. (Refer this for more info:
http://stackoverflow.com/questions/3856920/character-corruption-for-chinese-simple-and-traditional-and-korean-texts)
Following is the generalization that I had made to replace character sets sent
by mail clients for correct decoding of characters :
1. For any of following Chinese char set:
iso-ir-58,chinese,gbk,cn-gb,csgb2312,csiso58gb231280,euc-cn,euc_cn,euccn,gb2312,gb_2312-80,x-EUC-CN,gb2312-1980,gb2312-80
replace it with : GB18030
2. For any of the following Korean char set:
5601,ksc5601-1987,ksc5601_1987,euckr,ksc5601,ksc_5601,euc_kr,csEUCKR,ks_c_5601-1987
replace it with :EUC-KR
3. for any of the following Taiwanese char set:
ms-874\,ms874\,windows-874\,cp874\,874\,cs874\,ibm874
replace it with : TIS-620
I suggest that in the "DecoderUtil.decodeEncodedWords()" method itself charset
fallback should be provided.
For more info, refer http://wiki.whatwg.org/wiki/Web_Encodings also.
Please reply your comments.
Thanks
Ashish Sharma