There are a couple of postings about this topic, and I have certainly spent many hours fighting with special character issues and stray ? Åö ☐ characters. A couple weeks back I prepared an explanation for a question I received about a subscript3 ³ getting messed up in a complex processing flow through many programs, spreadsheets, databases, editors and browsers.
I think this information could be helpful for many others, and I may have more to understand so feel free to correct or clarify. This is a bit long. I hope I'm not violating a forum rule. Due to the mix of authoring, editing, transfer, storage and processing programs and software, and the legacy techniques, data markup, and older programs the character issue becomes a bit complex. Framemaker 9.0 and up can handle UTF-8 encoded character data. Unicode 6.0 and ISO/IEC 10646:2010 defines 109,449 code points, values ( i.e. characters). That's far more than the basic 256 ANSI characters that include most of the western European characters used for the French, Spanish, and other languages. The 256 ANSI characters are represented using the values, a.k.a. code points, between 32 and 255, or in Unicode hexadecimal representation U+0020 to U+00FF. Many other characters have values above the basic 256 characters that can be troublesome such as: ☢ ≤ ≥ ∂ ∆ € ℓ ∑ ☒ £ ₇ ⁸ √x₍̅₁̅₂̅₃₎̅ № ℥ ℃ ⅓ ⅘ ⅚ ⅞ ↺ ✔☑ ☐ ✈ ど カ ␍␊ Just in case e-mail messes up the characters above, here is an image of the characters. [image: Inline image 1] There are five parts to the "character" puzzle: 1) The value used to represent a character. ASCII, ANSI, UNICODE. (I recommend using *Unicode*) 2) The encoding, or how the character's value is stored using 1, 2, 3, or 4 bytes. Where a byte is 8 bits, ones and zeros. (I recommend *UTF-8*) 3) The font. The set of glyphs that define how the characters will appear. (I recommend a *Unicode / UTF-8 based font*) 4) The character set declarations in XML, XSL, CSS, HTML, and software coding options for file open, read and write statements. 5) The capabilities and limitations of various programs (browsers, spreadsheets, editors, etc.) and data transfer methods. The 1963 ASCII standard character set used a 7-bit encoding and hence was limited to 128 values, 2⁷. Later the ANSI standard used all 8 bits providing 2⁸ or 256 codepoints or characters (0 - 255) In order to display mathematical or other special symbols not defined by the ASCII or ANSI standards, custom fonts were developed such as Symbol, Wingdings and Wingdings2 that displayed the limited 256 values differently. So π, pi, could be displayed in a browser using <font face="symbol">p</font> or in MS Word by changing the font for the character with the value of 112, i.e the "p" in a Symbol font. So the value of 112 is not always a "p" unless a font is being used that assigns the "p" glyph to the decimal value of 112 (hex x0070). Using the Symbol font, 112 is a pi symbol π, and in Wingding3 font it is solid triangle ▲. Unicode is the set of values assigned to characters using 20 bits providing 2²⁰ or 1,112,064 codepoints. Unicode 6.0 and ISO/IEC 10646:2010 defines 109,449 characters all with unique codepoint values. In Unicode pi is U+03C0 (960 decimal) and the triangle is U+25B2 (9,650 decimal). It is customary to represent the Unicode values as hexadecimal preceeded by "U+". Since codepoints 03C0 (960) and 25B2 (9,650) cannot be stored using 1-byte,i.e. 8 bits, multiple bytes are required. This is where "encoding" comes in. ANSI, ISO-1252, UTF-8, UTF-16, UTF-32 are encodings, i.e the way that the values are stored using 8, 16, 24, or 32 bits (ones and zeros). ASCII uses 7 bits and is limited to 128 characters. ANSI and ISO-1252 are 1 byte encodings that use all 8 bits with a limit of 256 codepoint values. UTF-8 uses 1 byte for the first 128 characters and then uses 2, 3, or 4 bytes as required. UTF-16 always uses 16 bits, 2-bytes, unless more are required and UTF-32 always uses 4 bytes even for the basic 256 characters. See the UTF-8 reference below for details on the binary encoding. UTF stands for Unicode Transformation Format. The UTF encodings all assume the Unicode codepoint values are being used. Then there is the font. If the font does not have a glyph defined for the character codepoint (value) it will typically display as a box or a question mark.* Arial Unicode MS* is a Windows font that supports glyphs for most of the first 65,533 Unicode codepoints.* Verdana* defines 780 Unicode codepoints while *Century Schoolbook* defines 650 Unicode codepoints and 20 "Private Use" glyphs. The basic 256 characters are the same between the three fonts, not the appearance but an "a" is an "a". T the rest character in each font use the save values but each font contains a different collection of values and characters. One may contain the infinity symol while another may not. When a single character displays likes Åö it is due to mismatched encoding, not the font. When a character's codepoint is larger than 127, U+007F, and is saved to a file as UTF-8 it uses 2 or 3 bytes. If a program such as a browser, editor or spreadsheet is expecting, reading, and displaying the data as individual single byte characters data such as ANSI, then the 2 bytes are displayed as if there are 2 characters. In other words the 16 binary ones and zeros are not decoded incorrectly to represent two or three characters and the wrong font glyphs are displayed resulting in the Åö looking stuff. I am trying to move to using Unicode codepoints, UTF-8 encoding, and a compatible fonts for everything I do, but its not easy. When the codepoint for a character creates a problem in one of the storage, transfer, or processing programs my solution was to use a named entity such as ≤ for the less-than-equal-to symbol and transform the ≤ to a character, a character wrapped in a Symbol font, a numeric entitiy for HTML, or use a Framemaker Read/Write rule. In the past I also used named entities for many of the western European characters such as é Now that I understand more I no longer need to do that as much. I still use named entities like ≤ in some of my XML data, but convert them to Unicode values for Framemaker and HTML product files and use either the Arial Unicode MS or MS Gothic fonts for those characters. I could use the numeric entities in my XML but it makes authoring difficult since looking at &2264; doesn't tell me what the character is supposed to be as well as the ≤. In addition some processes will convert the numeric entities to the actual character and then subsequent programs might choke and convert the character to a question mark. So the conversion to numeric entities is always near the last step in my processing. My goal is to someday use every character directly, and have it transfer correctly between text editors, document and publication editors, spreadsheets, databases, browsers and book readers. Some software and operating systems have to catch up to the Unicode and UTF-8 standard. I also just learned that in MS Word I can enter a Unicode value like 2264 or 2A81 then press Alt-x and it converts the value to the Unicode character using the MS Gothic font when required. I also use http://graphemica.com<http://graphemica.com/%E2%89%A4> to look up Unicode values by searching for the character, value or name, such as ∞, 221E, or infinity Codepoints (a.k.a character values), encoding and fonts cover the first three parts of the puzzle. Parts 4 and 5 of the "character" puzzle have to do with file declaration, programming statements, and understanding software limitations. An XML UTF-8 file must include the declaration <?xml version="1.0" encoding="UTF-8"?> A HTML5 UTF-8 file must include <meta charset="UTF-8"> A HTML4 and XHTML UTF-8 file must include <meta http-equiv="Content-type" content="text/html;charset=UTF-8"> And, the files must be *saved as UTF-8* if that is what is intended. To open and edit HTML files saved as UTF-8 using some text editors the file must also include the XML declaration described above. These declarations tell browsers, editors, and other software what the encoding is or the program may assume it's encoded as ANSI using one byte for each character. Many programs will detect the UTF-8 encoding when opening a file, but some may have to have an option selected. When saving a file in other than the native format, such as saving a new text file in TextPad, saving an Excel spreadsheet as a Tab Delimited file, or copying and pasting from one program to another special options and setting may have to be specified. Writing javascript, Perl, C#, Visual basic or other programs will require that the files are opened for reading or writing, and then data read and written using the appropriate options for ANSI, UTF-8 or another encoding encoding as required. I use Structured Framemaker to open and produce PDF files for publications that are maintained as single source XML files. So I can't really make specific Framemaker .FM encoding recommendations, but I think FM, as of version 9, saves files as UTF-8 and can support the Unicode character values. If single source data is being used for multiple processing streams, then the source data must be such that it can be transformed to support the limitations of software, programs, and processes that consume and display the data. I think Unicode and UTF-8 encoding are the best standards to use at this time. If the hexadecimal numbers like 2A3B are messing with your brain I can provide some insight on the decimal, binary, hexadecimal, byte, and bit jargon as well, but I'd probably take it off line since it's not Frame specific. Ed Nodland *Additional References* https://en.wikipedia.org/wiki/Unicode https://en.wikipedia.org/wiki/UTF-8 https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings https://en.wikipedia.org/wiki/Character_encoding
<<image.png>>
_______________________________________________ You are currently subscribed to framers as [email protected]. Send list messages to [email protected]. To unsubscribe send a blank email to [email protected] or visit http://lists.frameusers.com/mailman/options/framers/archive%40mail-archive.com Send administrative questions to [email protected]. Visit http://www.frameusers.com/ for more resources and info.
