Andrew Barnert writes: > On Jan 13, 2020, at 19:32, Stephen J. Turnbull > <turnbull.stephen...@u.tsukuba.ac.jp> wrote: > > > > There is still tons of data in legacy > > applications, both as text files and in various application data > > formats, that use legacy encodings (in Japanese, that means MBCS). > > Sadly, it's not as simple as running "iconv -f shift_jis -t utf-8" on > > all the .txt files in sight. That WFM (well, I had to do a few .tex > > and .rst files too ;-), but most people are dependent on Word, Excel, > > and other application formats, and it's a PITA > > But those are binary formats, not something you can just read in as > text in Python even if you know the encoding. And surely nobody is > extracting the text out of those file formats manually anyway?
"My Name Is Nobody", aka The Exception to Prove the Rule. ;-) > [Y]ou just use one of those tools (or the Python wrappers around > them), or talk to Word or Word Viewer over COM, and either way you > just get Unicode. True, but I'm really mostly talking about those "other application formats" such as Ichitaro, email clients I will not dirty the keyboard by typing their names, and more specialized stuff (stats packages, archivers, etc) whose names would not be familiar here. > But even then, can you even rely on mbcs? No (we are discussing a nation that even today you may run into 5 different "native" encodings in the same day, after all), but on Windows what else were you going to use for default, if not UTF-8? > I know it used to be a problem (early 00s) in many Japanese shops > that you had Shift-JIS docs on Windows boxes or in Notes servers or > whatever where the default codepage wasn’t Shift-JIS. Never saw one of those. If the box was running Windows, IME the default code page was 932, until the late noughties, when UTF-8 to become common (although file systems were still mostly Shift JIS, to the enjoyment of all who weren't using Python 3 with PEP 393 ;-). If the box was running Unix, the encoding was usually packed EUC-JP, but I never dealt with Notes (banzai! the gods smiled on me). OTOH, I have seen filesystem paths on Sun boxen and in zipfiles with multiple encodings in them. :-) Steve _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/OG45ZJEO3VK4WWGBBOAL6IRFQKARKNPM/ Code of Conduct: http://python.org/psf/codeofconduct/