On Jan 13, 2020, at 19:32, Stephen J. Turnbull 
<turnbull.stephen...@u.tsukuba.ac.jp> wrote:
> 
>  There is still tons of data in legacy
> applications, both as text files and in various application data
> formats, that use legacy encodings (in Japanese, that means MBCS).
> Sadly, it's not as simple as running "iconv -f shift_jis -t utf-8" on
> all the .txt files in sight.  That WFM (well, I had to do a few .tex
> and .rst files too ;-), but most people are dependent on Word, Excel,
> and other application formats, and it's a PITA

But those are binary formats, not something you can just read in as text in 
Python even if you know the encoding. And surely nobody is extracting the text 
out of those file formats manually anyway? Unless you’re actually working on a 
tool like wv or antiword, you just use one of those tools (or the Python 
wrappers around them), or talk to Word or Word Viewer over COM, and either way 
you just get Unicode. (And even if you are working on a tool like wv, while 
PITA is an understatement, the hard part is navigating the insane formats to 
find and order the text chunks; once you can do that, dealing with UCS2 vs. 
ANSI vs. ALTANSI chunks and knowing where to find the code pages for the latter 
two in the structure is the comparatively easy part.)

I can think of one place where mbcs would be useful and you can’t just iconv. 
IIRC, with old-school RTF export you have to unescape and then convert and then 
reescape? But everything else I remember dealing with is either a binary format 
you need a library for, or plain text like TXT and CSV.)

But even then, can you even rely on mbcs? I know it used to be a problem (early 
00s) in many Japanese shops that you had Shift-JIS docs on Windows boxes or in 
Notes servers or whatever where the default codepage wasn’t Shift-JIS.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/6LX3ZQX4HCYWJ7OVDAZXKX5ZVRQEEBVK/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to