On Jan 13, 2020, at 19:32, Stephen J. Turnbull <turnbull.stephen...@u.tsukuba.ac.jp> wrote: > > There is still tons of data in legacy > applications, both as text files and in various application data > formats, that use legacy encodings (in Japanese, that means MBCS). > Sadly, it's not as simple as running "iconv -f shift_jis -t utf-8" on > all the .txt files in sight. That WFM (well, I had to do a few .tex > and .rst files too ;-), but most people are dependent on Word, Excel, > and other application formats, and it's a PITA
But those are binary formats, not something you can just read in as text in Python even if you know the encoding. And surely nobody is extracting the text out of those file formats manually anyway? Unless you’re actually working on a tool like wv or antiword, you just use one of those tools (or the Python wrappers around them), or talk to Word or Word Viewer over COM, and either way you just get Unicode. (And even if you are working on a tool like wv, while PITA is an understatement, the hard part is navigating the insane formats to find and order the text chunks; once you can do that, dealing with UCS2 vs. ANSI vs. ALTANSI chunks and knowing where to find the code pages for the latter two in the structure is the comparatively easy part.) I can think of one place where mbcs would be useful and you can’t just iconv. IIRC, with old-school RTF export you have to unescape and then convert and then reescape? But everything else I remember dealing with is either a binary format you need a library for, or plain text like TXT and CSV.) But even then, can you even rely on mbcs? I know it used to be a problem (early 00s) in many Japanese shops that you had Shift-JIS docs on Windows boxes or in Notes servers or whatever where the default codepage wasn’t Shift-JIS. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6LX3ZQX4HCYWJ7OVDAZXKX5ZVRQEEBVK/ Code of Conduct: http://python.org/psf/codeofconduct/