On Wed, Dec 24, 2025 at 07:40:12PM -0800, American Citizen wrote: > Hi: > > I have a set of tex files which are in pure ascii format. Unfortunately when > I copy material from the internet (Mozilla Firefox browser) it is in UTF-8 > format, not ascii. This appears to be standard behavior for the internet > browsers. > > When I paste the material into the tex document (using TexStudio) the paste > goes okay. It only blows up when I try to save the newer file. The UTF-8 > characters cannot be saved in ascii format and for some bizarre reason Tex > Studio wont' change the encoding to UTF-8 even though I have the option set > that the editor is working with UTF-8 character set. > > iconv won't work either, I do the "iconv -f ASCII -t UTF-8 input_file -o > output_file and the file remains ascii.
This is because US-ASCII is a strict subset of UTF-8. This was by design in the UTF-8 design. All ASCII files are 100% valid UTF-8 as well. A file that is strictly ASCII characters is also UTF-8. To make a file that shows as UTF-8 and not ASCII requires that you add some character to it that is beyond simple ASCII. Some software does use a special Unicode character called the BOM or Byte Order Mark as the first UTF-8 character in a file to tell the reader that this file is indeed UTF-8 and not some other encoding. This is Unicode character U+FEFF which is encoded into UTF-8 as the 3-byte sequence 0xEF 0xBB 0xBF. Other software use other clues to identify it. For example, XML files often start with this: <?xml version="1.0" encoding="utf-8"?> If the XML file only contains characters available in US-ASCII, then the file will still be 100% ASCII bytes. Only where there are characters beyond ASCII will you notice any UTF-8 encoding. As a side note, you can identify the different by looking at the most-significant bit of the byte. ASCII characters only use 7 bits and the most significant bit is always clear. When you have characters beyond ASCII in UTF-8 encoding, then those characters will be in a multi-byte sequence where all bytes in the sequence will have the most significant bit set. Perl and Python use something similar. In Perl, if I have the line: use utf8; At the top, it will tell the interpreter that this file is saved in UTF-8. Since UTF-8 is compatible with ASCII, it doesn't doesn't interfere with the hashbang line at the top. In Python, I would use this at the top of my file: #!/usr/bin/env python # vim: set fileencoding=utf-8 : Incidentally, this also tells Vim to open up the file in the UTF-8 encoding while editing. Now, I can including any Unicode characters I want in strings. Also, while this command won't ever do anything: iconv -f ASCII -t UTF-8 input_file -o output_file Going in the other direction with this command: iconv -f UTF-8 -t ASCII input_file -o output_file Will only ever either throw an error if the UTF-8 file contains any characters not representable in ASCII or it will pass the file through unchanges since all ASCII characters use the same byte representation in UTF-8. It's useful as a check whether a file is within the subset of ASCII or not, but not much more. > > Does anyone have an idea of how I can get TexStudio to wake up and change > the file encoding on the current ascii file to UTF-8? I am no familiar with TexStudio, but it should just come down to making sure you tell the editor the correct encoding to save the file as. From a quick Google, there seems to be a settings screen for it. Also, you can try adding this to your TeX file: % !TEX TS-program = lualatex % !TEX encoding = UTF-8 Unicode % !TEX spellcheck = en_US You also have to make sure you use a TeX engine that supports UTF-8. Any engine based on ε-TeX should which includes Luatex. > > I cannot get iconv to change the ascii file to UTF-8, so I am stuck between > the devil and the deep blue sea. A file will be ASCII until the first non-ASCII character is added to it. The key is more just to make sure that your editor saves it as UTF-8 and not something else. Vim, for example, has traditionally defaulted to latin1 and will through errors when adding characters beyond latin1 if you don't have :set encoding=utf-8 set-up in your Vim environment. > > Randall > > -- Loren M. Lang [email protected] http://www.north-winds.org/ IRC: penguin359 Public Key: http://www.north-winds.org/lorenl_pubkey.asc Fingerprint: 7896 E099 9FC7 9F6C E0ED E103 222D F356 A57A 98FA
signature.asc
Description: PGP signature
