1) If I copy/paste these CJK chars from Google Groups in two of my interactive interpreters (no "dos/cmd console"), I have no problem.
>>> import unicodedata as ud >>> ud.name('工') 'CJK UNIFIED IDEOGRAPH-5DE5' >>> ud.name('具') 'CJK UNIFIED IDEOGRAPH-5177' >>> hex(ord(('工'))) '0x5de5' >>> hex(ord('具')) '0x5177' >>> 2) It semms the mbcs codec has some difficulties with these chars. >>> '\u5de5'.encode('mbcs') Traceback (most recent call last): File "<eta last command>", line 1, in <module> UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character >>> '\u5de5'.encode('utf-8') b'\xe5\xb7\xa5' >>> '\u5de5'.encode('utf-32-be') b'\x00\x00]\xe5' 3) On the usage of mbcs in files IO interaction --> core devs. My conclusion. The bottle neck is on the mbcs side. jmf -- http://mail.python.org/mailman/listinfo/python-list