On Wed, Dec 24, 2025 at 07:40:12PM -0800, American Citizen wrote:
> Hi:
> 
> I have a set of tex files which are in pure ascii format. Unfortunately when
> I copy material from the internet (Mozilla Firefox browser) it is in UTF-8
> format, not ascii. This appears to be standard behavior for the internet
> browsers.
> 
> When I paste the material into the tex document (using TexStudio) the paste
> goes okay. It only blows up when I try to save the newer file. The UTF-8
> characters cannot be saved in ascii format and for some bizarre reason Tex
> Studio wont' change the encoding to UTF-8 even though I have the option set
> that the editor is working with UTF-8 character set.
> 
> iconv won't work either, I do the "iconv -f ASCII -t UTF-8 input_file -o
> output_file and the file remains ascii.

This is because US-ASCII is a strict subset of UTF-8. This was by design
in the UTF-8 design. All ASCII files are 100% valid UTF-8 as well. A
file that is strictly ASCII characters is also UTF-8. To make a file
that shows as UTF-8 and not ASCII requires that you add some character
to it that is beyond simple ASCII. Some software does use a special
Unicode character called the BOM or Byte Order Mark as the first UTF-8
character in a file to tell the reader that this file is indeed UTF-8
and not some other encoding. This is Unicode character U+FEFF which is
encoded into UTF-8 as the 3-byte sequence 0xEF 0xBB 0xBF. Other software
use other clues to identify it. For example, XML files often start with
this:

<?xml version="1.0" encoding="utf-8"?>

If the XML file only contains characters available in US-ASCII, then the
file will still be 100% ASCII bytes. Only where there are characters
beyond ASCII will you notice any UTF-8 encoding. As a side note, you can
identify the different by looking at the most-significant bit of the
byte. ASCII characters only use 7 bits and the most significant bit is
always clear. When you have characters beyond ASCII in UTF-8 encoding,
then those characters will be in a multi-byte sequence where all bytes
in the sequence will have the most significant bit set.

Perl and Python use something similar. In Perl, if I have the line:

use utf8;

At the top, it will tell the interpreter that this file is saved in
UTF-8. Since UTF-8 is compatible with ASCII, it doesn't doesn't
interfere with the hashbang line at the top. In Python, I would use this
at the top of my file:

#!/usr/bin/env python
# vim: set fileencoding=utf-8 :

Incidentally, this also tells Vim to open up the file in the UTF-8
encoding while editing. Now, I can including any Unicode characters I
want in strings.

Also, while this command won't ever do anything:

iconv -f ASCII -t UTF-8 input_file -o output_file

Going in the other direction with this command:

iconv -f UTF-8 -t ASCII input_file -o output_file

Will only ever either throw an error if the UTF-8 file contains any
characters not representable in ASCII or it will pass the file through
unchanges since all ASCII characters use the same byte representation in
UTF-8. It's useful as a check whether a file is within the subset of
ASCII or not, but not much more.

> 
> Does anyone have an idea of how I can get TexStudio to wake up and change
> the file encoding on the current ascii file to UTF-8?

I am no familiar with TexStudio, but it should just come down to making
sure you tell the editor the correct encoding to save the file as. From
a quick Google, there seems to be a settings screen for it. Also, you
can try adding this to your TeX file:

% !TEX TS-program = lualatex
% !TEX encoding = UTF-8 Unicode
% !TEX spellcheck = en_US

You also have to make sure you use a TeX engine that supports UTF-8. Any
engine based on ε-TeX should which includes Luatex.

> 
> I cannot get iconv to change the ascii file to UTF-8, so I am stuck between
> the devil and the deep blue sea.

A file will be ASCII until the first non-ASCII character is added to it.
The key is more just to make sure that your editor saves it as UTF-8 and
not something else. Vim, for example, has traditionally defaulted to
latin1 and will through errors when adding characters beyond latin1 if
you don't have :set encoding=utf-8 set-up in your Vim environment.

> 
> Randall
> 
> 

-- 
Loren M. Lang
[email protected]
http://www.north-winds.org/
IRC: penguin359


Public Key: http://www.north-winds.org/lorenl_pubkey.asc
Fingerprint: 7896 E099 9FC7 9F6C E0ED  E103 222D F356 A57A 98FA

Attachment: signature.asc
Description: PGP signature

Reply via email to