Loren:

All your comments are well spoken and I have come to the same viewpoints as your's.

Lots of websites on the internet seem to be too brief and succinct but yet pretend to be informed when it comes to file encoding conversions.

I am not sure that all the bugs are shaken from TexStudio yet. I did have one "mysterious" crash, and their error reporter wanted permission from me to email them about this crash.

One more comment, using the "file" command to determine the encoding does NOT always work, at least for long files 100K or more in size. I did find that "uchardet" does work.

Thanks for the posts. I appreciate them.

Randall


On 12/26/25 13:57, Loren M. Lang wrote:
On Wed, Dec 24, 2025 at 07:40:12PM -0800, American Citizen wrote:
Hi:

I have a set of tex files which are in pure ascii format. Unfortunately when
I copy material from the internet (Mozilla Firefox browser) it is in UTF-8
format, not ascii. This appears to be standard behavior for the internet
browsers.

When I paste the material into the tex document (using TexStudio) the paste
goes okay. It only blows up when I try to save the newer file. The UTF-8
characters cannot be saved in ascii format and for some bizarre reason Tex
Studio wont' change the encoding to UTF-8 even though I have the option set
that the editor is working with UTF-8 character set.

iconv won't work either, I do the "iconv -f ASCII -t UTF-8 input_file -o
output_file and the file remains ascii.
This is because US-ASCII is a strict subset of UTF-8. This was by design
in the UTF-8 design. All ASCII files are 100% valid UTF-8 as well. A
file that is strictly ASCII characters is also UTF-8. To make a file
that shows as UTF-8 and not ASCII requires that you add some character
to it that is beyond simple ASCII. Some software does use a special
Unicode character called the BOM or Byte Order Mark as the first UTF-8
character in a file to tell the reader that this file is indeed UTF-8
and not some other encoding. This is Unicode character U+FEFF which is
encoded into UTF-8 as the 3-byte sequence 0xEF 0xBB 0xBF. Other software
use other clues to identify it. For example, XML files often start with
this:

<?xml version="1.0" encoding="utf-8"?>

If the XML file only contains characters available in US-ASCII, then the
file will still be 100% ASCII bytes. Only where there are characters
beyond ASCII will you notice any UTF-8 encoding. As a side note, you can
identify the different by looking at the most-significant bit of the
byte. ASCII characters only use 7 bits and the most significant bit is
always clear. When you have characters beyond ASCII in UTF-8 encoding,
then those characters will be in a multi-byte sequence where all bytes
in the sequence will have the most significant bit set.

Perl and Python use something similar. In Perl, if I have the line:

use utf8;

At the top, it will tell the interpreter that this file is saved in
UTF-8. Since UTF-8 is compatible with ASCII, it doesn't doesn't
interfere with the hashbang line at the top. In Python, I would use this
at the top of my file:

#!/usr/bin/env python
# vim: set fileencoding=utf-8 :

Incidentally, this also tells Vim to open up the file in the UTF-8
encoding while editing. Now, I can including any Unicode characters I
want in strings.

Also, while this command won't ever do anything:

iconv -f ASCII -t UTF-8 input_file -o output_file

Going in the other direction with this command:

iconv -f UTF-8 -t ASCII input_file -o output_file

Will only ever either throw an error if the UTF-8 file contains any
characters not representable in ASCII or it will pass the file through
unchanges since all ASCII characters use the same byte representation in
UTF-8. It's useful as a check whether a file is within the subset of
ASCII or not, but not much more.

Does anyone have an idea of how I can get TexStudio to wake up and change
the file encoding on the current ascii file to UTF-8?
I am no familiar with TexStudio, but it should just come down to making
sure you tell the editor the correct encoding to save the file as. From
a quick Google, there seems to be a settings screen for it. Also, you
can try adding this to your TeX file:

% !TEX TS-program = lualatex
% !TEX encoding = UTF-8 Unicode
% !TEX spellcheck = en_US

You also have to make sure you use a TeX engine that supports UTF-8. Any
engine based on ε-TeX should which includes Luatex.

I cannot get iconv to change the ascii file to UTF-8, so I am stuck between
the devil and the deep blue sea.
A file will be ASCII until the first non-ASCII character is added to it.
The key is more just to make sure that your editor saves it as UTF-8 and
not something else. Vim, for example, has traditionally defaulted to
latin1 and will through errors when adding characters beyond latin1 if
you don't have :set encoding=utf-8 set-up in your Vim environment.

Randall


Reply via email to