Re: [PLUG] Ascii versus UTF-8 woes

Loren M. Lang Sun, 28 Dec 2025 02:59:20 -0800

On Fri, Dec 26, 2025 at 05:29:14PM -0800, American Citizen wrote:
> Loren:
> 
> All your comments are well spoken and I have come to the same viewpoints as
> your's.
> 
> Lots of websites on the internet seem to be too brief and succinct but yet
> pretend to be informed when it comes to file encoding conversions.
> 
> I am not sure that all the bugs are shaken from TexStudio yet. I did have
> one "mysterious" crash, and their error reporter wanted permission from me
> to email them about this crash.
> 
> One more comment, using the "file" command to determine the encoding does
> NOT always work, at least for long files 100K or more in size. I did find
> that "uchardet" does work.


Yes, file only looks at a sample of any one file and uses heuristics
which sometimes can be wrong. One of the tricks I do just to see if a
file is pure ASCII or UTF-8 is to use iconv and see if it finds any
errors. Using this:

$ iconv -f ascii -o /dev/null cool-characters.txt 
iconv: illegal input sequence at position 0
$ iconv -f utf-8 -o /dev/null cool-characters.txt 

I can see that the file cool-characters.txt contains non-ASCII
characters because it through an error trying to convert from ASCII, but
it is 100% valid UTF-8. Most non-ASCII encodings will appear as
malformed if decoded as UTF-8.

Do note this test doesn't work for all encodings. For example, the
classic 8-bit encoding used in the US for UNIX systems is ISO-8859-1
also known as latin1 is a single-byte encoding with a defined meaning
for all 256 values. This means that even an arbitrary binary file can be
converted from latin1 even though it will produce meaningless results.
And, really, if you have a file using a lot of control characters in the
0x00-0x1f and 0x80-0x9f ranges, it's probably not latin1 text even
though iconv will happily convert it.

> 
> Thanks for the posts. I appreciate them.
> 
> Randall
> 
> 
> On 12/26/25 13:57, Loren M. Lang wrote:
> > On Wed, Dec 24, 2025 at 07:40:12PM -0800, American Citizen wrote:
> > > Hi:
> > > 
> > > I have a set of tex files which are in pure ascii format. Unfortunately 
> > > when
> > > I copy material from the internet (Mozilla Firefox browser) it is in UTF-8
> > > format, not ascii. This appears to be standard behavior for the internet
> > > browsers.
> > > 
> > > When I paste the material into the tex document (using TexStudio) the 
> > > paste
> > > goes okay. It only blows up when I try to save the newer file. The UTF-8
> > > characters cannot be saved in ascii format and for some bizarre reason Tex
> > > Studio wont' change the encoding to UTF-8 even though I have the option 
> > > set
> > > that the editor is working with UTF-8 character set.
> > > 
> > > iconv won't work either, I do the "iconv -f ASCII -t UTF-8 input_file -o
> > > output_file and the file remains ascii.
> > This is because US-ASCII is a strict subset of UTF-8. This was by design
> > in the UTF-8 design. All ASCII files are 100% valid UTF-8 as well. A
> > file that is strictly ASCII characters is also UTF-8. To make a file
> > that shows as UTF-8 and not ASCII requires that you add some character
> > to it that is beyond simple ASCII. Some software does use a special
> > Unicode character called the BOM or Byte Order Mark as the first UTF-8
> > character in a file to tell the reader that this file is indeed UTF-8
> > and not some other encoding. This is Unicode character U+FEFF which is
> > encoded into UTF-8 as the 3-byte sequence 0xEF 0xBB 0xBF. Other software
> > use other clues to identify it. For example, XML files often start with
> > this:
> > 
> > <?xml version="1.0" encoding="utf-8"?>
> > 
> > If the XML file only contains characters available in US-ASCII, then the
> > file will still be 100% ASCII bytes. Only where there are characters
> > beyond ASCII will you notice any UTF-8 encoding. As a side note, you can
> > identify the different by looking at the most-significant bit of the
> > byte. ASCII characters only use 7 bits and the most significant bit is
> > always clear. When you have characters beyond ASCII in UTF-8 encoding,
> > then those characters will be in a multi-byte sequence where all bytes
> > in the sequence will have the most significant bit set.
> > 
> > Perl and Python use something similar. In Perl, if I have the line:
> > 
> > use utf8;
> > 
> > At the top, it will tell the interpreter that this file is saved in
> > UTF-8. Since UTF-8 is compatible with ASCII, it doesn't doesn't
> > interfere with the hashbang line at the top. In Python, I would use this
> > at the top of my file:
> > 
> > #!/usr/bin/env python
> > # vim: set fileencoding=utf-8 :
> > 
> > Incidentally, this also tells Vim to open up the file in the UTF-8
> > encoding while editing. Now, I can including any Unicode characters I
> > want in strings.
> > 
> > Also, while this command won't ever do anything:
> > 
> > iconv -f ASCII -t UTF-8 input_file -o output_file
> > 
> > Going in the other direction with this command:
> > 
> > iconv -f UTF-8 -t ASCII input_file -o output_file
> > 
> > Will only ever either throw an error if the UTF-8 file contains any
> > characters not representable in ASCII or it will pass the file through
> > unchanges since all ASCII characters use the same byte representation in
> > UTF-8. It's useful as a check whether a file is within the subset of
> > ASCII or not, but not much more.
> > 
> > > Does anyone have an idea of how I can get TexStudio to wake up and change
> > > the file encoding on the current ascii file to UTF-8?
> > I am no familiar with TexStudio, but it should just come down to making
> > sure you tell the editor the correct encoding to save the file as. From
> > a quick Google, there seems to be a settings screen for it. Also, you
> > can try adding this to your TeX file:
> > 
> > % !TEX TS-program = lualatex
> > % !TEX encoding = UTF-8 Unicode
> > % !TEX spellcheck = en_US
> > 
> > You also have to make sure you use a TeX engine that supports UTF-8. Any
> > engine based on ε-TeX should which includes Luatex.
> > 
> > > I cannot get iconv to change the ascii file to UTF-8, so I am stuck 
> > > between
> > > the devil and the deep blue sea.
> > A file will be ASCII until the first non-ASCII character is added to it.
> > The key is more just to make sure that your editor saves it as UTF-8 and
> > not something else. Vim, for example, has traditionally defaulted to
> > latin1 and will through errors when adding characters beyond latin1 if
> > you don't have :set encoding=utf-8 set-up in your Vim environment.
> > 
> > > Randall
> > > 
> > > 

-- 
Loren M. Lang
[email protected]
http://www.north-winds.org/
IRC: penguin359


Public Key: http://www.north-winds.org/lorenl_pubkey.asc
Fingerprint: 7896 E099 9FC7 9F6C E0ED  E103 222D F356 A57A 98FA

signature.asc
Description: PGP signature

Re: [PLUG] Ascii versus UTF-8 woes

Reply via email to