Andrew Barnert writes:

 > On Jan 13, 2020, at 19:32, Stephen J. Turnbull 
 > <> wrote:
 > > 
 > >  There is still tons of data in legacy
 > > applications, both as text files and in various application data
 > > formats, that use legacy encodings (in Japanese, that means MBCS).
 > > Sadly, it's not as simple as running "iconv -f shift_jis -t utf-8" on
 > > all the .txt files in sight.  That WFM (well, I had to do a few .tex
 > > and .rst files too ;-), but most people are dependent on Word, Excel,
 > > and other application formats, and it's a PITA
 > But those are binary formats, not something you can just read in as
 > text in Python even if you know the encoding. And surely nobody is
 > extracting the text out of those file formats manually anyway?

"My Name Is Nobody", aka The Exception to Prove the Rule. ;-)

 > [Y]ou just use one of those tools (or the Python wrappers around
 > them), or talk to Word or Word Viewer over COM, and either way you
 > just get Unicode.

True, but I'm really mostly talking about those "other application
formats" such as Ichitaro, email clients I will not dirty the keyboard
by typing their names, and more specialized stuff (stats packages,
archivers, etc) whose names would not be familiar here.

 > But even then, can you even rely on mbcs?

No (we are discussing a nation that even today you may run into 5
different "native" encodings in the same day, after all), but on
Windows what else were you going to use for default, if not UTF-8?

 > I know it used to be a problem (early 00s) in many Japanese shops
 > that you had Shift-JIS docs on Windows boxes or in Notes servers or
 > whatever where the default codepage wasn’t Shift-JIS.

Never saw one of those.  If the box was running Windows, IME the
default code page was 932, until the late noughties, when UTF-8 to
become common (although file systems were still mostly Shift JIS, to
the enjoyment of all who weren't using Python 3 with PEP 393 ;-).  If
the box was running Unix, the encoding was usually packed EUC-JP, but
I never dealt with Notes (banzai! the gods smiled on me).  OTOH, I
have seen filesystem paths on Sun boxen and in zipfiles with multiple
encodings in them. :-)

Python-ideas mailing list --
To unsubscribe send an email to
Message archived at
Code of Conduct:

Reply via email to