: Tuesday, July 9, 2013 8:37 PM
To: Unicode Discussion
Subject: Re: Ways to show Unicode contents on Windows?
On Wed, Jul 10, 2013 at 04:24:36AM +, Murray Sargent wrote:
Ilya asked, Are there any other ways to show Unicode on Windows?
You can download Unibook (http://www.unicode.org/unibook
Albrecht notes that
The complete RTF clipboard content is this, created by Adobe Acrobat 9 Pro,
Version 9.5.1:
: 7B 5C 72 74 66 31 5C 61 6E 73 69 5C 61 6E 73 69 {\rtf1\ansi\ansi
0010: 63 70 67 31 32 35 32 5C 75 63 31 20 7B 5C 66 6F cpg1252\uc1 {\fo
0020: 6E 74 74 62 6C 5C 66 30 5C 66
If you include a {\fonttbl...} entry that defines \f0 as an Arabic font, Word
displays it correctly. For example, include {\fonttbl{\f0\fswiss\fcharset177
Arial;}}
as in
{\rtf1{\fonttbl{\f0\fswiss\fcharset177 Arial;}}
\pard\plain\ql\f0\fs20 {\fs40 \u1511 \'F7\u1493 \'E5\u1491 \'E3\u1502
Bing or google the clipboard format string. You'll get the answer in the
first few hits.
Murray
Sent from my Windows Phone
From: Stephan Stillermailto:stephan.stil...@gmail.com
Sent: 2/7/2013 8:51 PM
To: Dreiheller,
by claiming the script has
the reverse directionality. This enables Word to write RTF that represents an
LRO...PDF embedding.
Murray
-Original Message-
From: Asmus Freytag [mailto:asm...@ix.netcom.com]
Sent: Thursday, February 7, 2013 9:28 PM
To: Murray Sargent
Cc: Dreiheller, Albrecht; Raymond
Phillipe commented: (even if later Microsoft decides to map some other
characters in its own windows-1252 charset, like it did several times and
notably when the Euro symbol was mapped).
Personal opinion, but I'd be very surprised if Microsoft ever changed the 1252
charset. The euro was added
Mark E. Shoulson m...@kli.org wrote: Mirroring tends to be done for glyphs
that are used in *pairs*,
open/close things and such.
Not invariably; consider the integral and summation. They don't have mirrored
counterparts and many other mathematical symbols don't either.
Murray
[mailto:unicode-bou...@unicode.org] On Behalf
Of Richard Wordingham
Sent: Saturday, July 21, 2012 4:52 PM
To: Unicode
Subject: User-Hostile Text Editing (was: Unicode String Models)
On Fri, 20 Jul 2012 23:16:17 +
Murray Sargent murr...@exchange.microsoft.com wrote:
My latest blog post
Mark wrote: “I put together some notes on different ways for programming
languages to handle Unicode at a low level. Comments welcome.”
Nice article as far as it goes and additions are forthcoming. In addition to
multiple code units per character in UTF-8 and UTF-16, there are variation
QSJN 4 UKR asks, Why did the Unicode Consortium think that combination of one
base character and few combining is possible, and combination of few base
characters with one combining character is not?
E.g. U+0483 tilda has to cover a number. Whole number!
For mathematical constructs in general,
One set of examples of the use of these solidus variations occurs in the
mathematics linear format described in Unicode Technical Note #28
(http://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.pdf). The ASCII
solidus (U+002F) described in Section 2.1 is used to represent normal stacked
In the linear format of UTN #28, 1/2/3/4 builds up as ((1/2)/3)/4 as in
computer languages like C. The notation actually started with C semantics and
then added a larger set of operators, and finally adopted the full Unicode set
of mathematical operators. You can try it out in Microsoft Office
It's actually quite easy to convince Uniscribe to treat specific characters as
RTL, others as LTR, and, in general, with whatever classifications you desire.
Pass a preprocessed string to Uniscribe's ScriptItemize(). RichEdit has used
that approach to some degree starting with RichEdit 3.0
You can put diacritics over an arbitrarily large base by using an accent object
in a math zone. For example, in my email editor (Outlook), I type alt+= to
insert a math zone and then (a+b)\tildespacespace to get
[cid:image001.png@01CB80BE.389DD340]
(wide tilde over a+b). Evidently
In some Microsoft products, e.g., Word, WordPad, OneNote and Outlook, you can
type ctrl+~ followed by n to get ñ. Or you can type F1 alt+x to get ñ. The
alt+x conversion of hex Unicode values is easier than the alt+numpad approach,
since the Unicode Standard is in hex.
Murray
From:
Type F1 alt+x, where F1 means the letter F key followed by the 1 key, not
Function key 1. U+00F1 is the Unicode value of ñ. In general to type in a
character by its Unicode value, type in the hex value and then alt+x. E.g., to
type in math italic a, type 1D44E alt+x , which gives 푎.
Murray
Alex notes Operands are not operators, e.g. in a+b, a and b are operands, + is
an operator. I'm sure Karl Williamson knows that, but the mathematical
alphanumerics also aren't operators and they nevertheless have the math
property. We need to change the description of the math property to
Contextual rendering is getting to be more common thanks to adoption of
OpenType features. For example, both MS Publisher 2010 and MS Word 2010 support
various contextually dependent OpenType features at the user's discretion. The
choice of glyph for U+002E could be chosen according to an
Andreas Prilop commented A native speaker of English does not /automatically/
know better about English grammar, English punctuation than an informed
Frenchman. So true, so true. Most native speakers of English have only limited
understanding of English grammar. At least in my country. They
Asmus asks, Which implementation makes the required context analysis to
determine whether 002E is part of a number during layout? If it does make this
determination, which OpenType feature does it invoke? Which font supports this
particular OpenType feature?
I haven't looked to see if our
Michael asks, Are or will be OT features supported in, say, filenames? The
answer depends on the renderer. For example, if you display filenames in
NotePad using the Calibri font, default English ligatures are used
automatically using OpenType table info.
Murray
Michael asks, Are or will be OT features supported in, say, filenames? The
answer depends on the
renderer. For example, if you display filenames in NotePad using the Calibri
font, default English
ligatures are used automatically using OpenType table info.
I meant on the desktop or in the
Doug comments:
Murray Sargent murrays at exchange dot microsoft dot com wrote:
It's worth remembering that plain text is a format that was introduced
due to the limitations of early computers. Books have always been
rendered with at least some degree of rich text. And due
Vincent asks, So how does one go about getting buy-in? Are the interested
parties on this mailing list, or do you have contact information for decision
makers in the various voting organizations?
I think you, Khaled, Michael and others have made a very good case for having
some way to render
Khaled notes: There are so many issues with MS implementation(s), for example
you can not combine any arbitrary Arabic diacritical marks on any given base
character. I don't think Unicode need to invent workaround broken vendor
implementations, interested parties should instead pressure on that
Doug asks, Can anyone point me to some *real-world* examples of mathematics
text encoded in Unicode, including (especially) the Mathematical Alphanumeric
Symbols starting at U+1D400?
Here are two documents with such text:
Unicode Technical Report #25 Unicode Support for Mathematics
Couple of notes on Word's support. Word has been based on Unicode since
Word '97, although it certainly didn't support all of Unicode at that
time. Word has displayed ruby in built-up form for several versions now
(the name for it is under Asian formatting and called phonetic guide).
Murray
Wide characters in Windows 2K and XP are used for UTF-16 for most
programs that I know of including the Microsoft Office suite and OS
programs such as NotePad and WordPad. Windows 9x has limited Unicode
support, but many programs do use wide characters for UTF-16 on Windows
9x as well.
Murray
Title: Surrogates in WordPad
Type the UTF-32 code for the
character instead of the surrogate pair. For example to get a math italic i,
type 1D456 and then Alt+x. Lone surrogate codes aren't desirable. RichEdit does
allow the high code to come in alone via the WM_CHAR message, since some
Title: Does Java 1.5 support Unicode math alphanumerics as variable names?
E.g., math italic i (U+1D456)? With such usage, Java mathematical programs could look more like the original math.
Thanks
Murray
Mike Ayers asked: On Windows, it is well known that you can generate a
character from its code point by holding down the alt key and typing the
code point in decimal, with a leading 0, on the numeric keypad. I
recall that there is also a method to do this in reverse - given a
character on, say,
Raymond Mercier wrote: In MS Word if you type the Unicode code point,
followed by Alt-X, you get the character (if you have the font). This
works in reverse. Sometimes in a RichEdit control window it will work
in the first direction, but not in reverse.
It does not work in Wordpad, in spite of
Title: RE: character map in Microsoft Word
WordPad uses RichEdit 4.1 on
Windows XP and both RichEdit 4.1 and 3.0 support the Alt+NumPad numbers greater
than 255 as Unicode values. But other editors on XP, e.g., NotePad do not
(sigh). The preferred way with RichEdit is to use the hex code
Patrick asks: «Q.
How can I input any Unicode character if I know its hexadecimal
code?»
You
could use an app that supports the Alt+x input method (like Word or WordPad) and
then copy the result into an app that doesn't.
For
reference, the Alt+x input method works as follows:
A handy
An important part of Ricardo Niemietz's hex digit proposal
(http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2677) is to have columns of
hexadecimal numbers line up properly as columns of decimal numbers do.
This could be achieved using a font with a set of glyph variants for A-F
with a hexadecimal
I think the Euro at 0x80 for 1252 (and several other 125x code pages)
was added in May 1988. Cathy Wissink can confirm this. It certainly
happened before 1999, since we added support for it in RichEdit 3.0
which shipped with Windows 2000 and Office 2000.
Murray
-Original Message-
From:
As KenW pointed out, I meant May 1998, not 1988!
Thanks
Murray
-Original Message-
From: Murray Sargent
Sent: Thursday, February 27, 2003 3:44 PM
To: 'Yung-Fong Tang'
Cc: John Myers; Takayuki Tei; kat momoi; Naoki Hotta; Cathy Wissink;
[EMAIL PROTECTED]
Subject: RE: question about
I think Doug asked for lightweight. HTML and XML markup aren't
lightweight by any means, although a special purpose plain-text oriented
XML (LTML for language-tagged markup language) might not be that much
more involved than plane 14 tags. It would also have the advantage that
standard XSLT tools
Joseph Boyle says: It would be useful to have official names to
distinguish UTF-8 with and without BOM.
To see if a UTF-8 file has no BOM, you can just look at the first three
bytes. Is this a problem? Typically when you care about a file's
encoding form, you plan to read the file.
Thanks
Murray
Title: Re: script or block detection needed for Unicode fonts
John Jenkins wrote:
"This just seems wildly inefficient to me, but then I'm coming
from anOS where this isn't done. The app doesn't keep track of
whether or nota particular font can draw a particular character; that's
Michael Everson said:
I don't understand why a particular bit has to be set in
some table. Why can't the OS just accept what's in the font?
The main reason is performance. If an application has to check the font
cmap for every character in a file, it slows down reading the file.
Accordingly
I don't think the idea is that codepage equals language. Rather codepage
equals a writing system, which consists of one or more scripts (e.g., 6
scripts for ShiftJIS). As such the codepage is a useful cue in choosing
an appropriate font for rendering text. In the RichEdit edit engine, we
use a
As Ken says the Unicode interlinear annotation characters are for
internal use only. Specifically, their meanings can be different for
different programs. If you have your nice marked up text in memory and
want to export it for use by some program, you need to use a
higher-level protocol that
Michael Everson said Well then they [interlinear annotation characters]
oughtn't to have been encoded.
Michael, you aren't an implementer. When you implement things
unambiguously, you may need internal code points in your plain-text
stream to attach higher-level protocols (such as formatting
6:11 PM
To: Murray Sargent
Cc: Michael Everson; [EMAIL PROTECTED]
Subject: Re: Furigana
Murray,
It's true implementers need some place to attach higher level
protocols, but they don't need specific points for specific
implementations of internal protocols. If they weren't good enough to be
used
Title: Typing Unicode via Alt+NumPad
Actually any application using RichEdit 3.0 or later (e.g, WordPad and
often Outlook) uses any value higher than 255 as a Unicode value. Values less
than 255 are also Unicode, except for 0128 - 0159. Note that for values less
than 255, you need to
Timothy Partridge included the restriction
- No archaic styles of existing characters. E.g. dotless j.
as something inappropriate. Question: how does one code up (presumably
with markup) a caret over a jk pair in a math expression? The dot on the
j should be missing for this case, but how does
Michael Jansson says:
There are no technical reasons for why css/html4/xhtml can not produce
every bit as high quality
as any other page layout format.
Sadly this is currently far from the case. HTML/CSS even including CSS3
is far from a professional document publishing format. It doesn't
Sentinel is fairly commonly used in computer science and program code for data
delimiters. Delimiter is also a good word for this (I use it in RichEdit code), but
one may well use delimiter to describe a quote character (like U+0022), whereas I've
never seen sentinel used for a quote. As such
Stefan Persson [mailto:[EMAIL PROTECTED]] asks how in the formula
mfågel = 1 kg
would the italic å be encoded?
Mathematics has a set of standard letters for mathematical symbols. They can include
diacritics, which can be expressed using the appropriate combining marks. In your
formula
MathML does have markup to extend diacritics across arbitrary numbers of
characters and it's not likely that MathML would use the CGJ for this
purpose But it would be handy for representing such expressions in
plain-text Unicode
Murray
I agree that NotePad ought to be able to display a pure LF file
correctly. Word and WordPad do. However they do translate the LFs to
CRLFs on saving, which limits their interoperability with Unix. It would
be fairly easy to have an option to write LF files, if there's
sufficient interest.
David Starner said:
Fraktur is not a different script from the Latin script, and therefore is
not encoded separately.
True, but Fraktur math characters are encoded in plane 1 for use in mathematics. These
characters are not intended to be used for natural language purposes (unless you think
Capital pi is to product as capital sigma is to summation.
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Sun 2002/01/20 02:19
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: The benefit of a
Marco Cimarosti writes:
Tom Emerson wrote:
One gotcha, that I run into every six months or so, is forgetting that
the punctuation characters in the Basic Latin block are classified as
Latin script. This trips me up because most of my text processing work
involves CJK, so I'll write
I think I've figured out a way to find the beginning of a GB18030 character starting
anywhere in a document. The algorithm is similar to finding the beginning of a DBCS
character in that you scan backward until you find a byte that can only come at the
start of a character. The main difference
Actually fonts on Windows are normally Unicode based (including MS
Mincho and MS Gothic) and most have in addition some codepage access. So
there is neither a perf hit nor a codepage problem in using such fonts
on NT, Win2000 and WinXP. These considerations are orthogonal to
OpenType.
Murray
Hey guys, Ken is just kidding. He's evidently tired of the current
plethora of ways to represent Unicode let alone all those new ones being
proposed. Sigh, I am too. Carl, you understand the problem of adding yet
another UTF: you too will probably have to support it.
Murray
Carl Brown
If you need to roundtrip 8859-1 through ASCII, you need to use some kind
of escape mechanism inside the ASCII to represent characters that have
their high bit equal to one. A common simple escape is to use the
backslash. So you could represent the codes as \'xx, where xx is the
hexadecimal code.
It's intriguing to think of an encoding for math symbols that breaks
them down into sequences of pieces. For example, NOT EQUAL could be
EQUAL followed by a slash combining mark.
Maybe some day a cleanicode will be developed that handles this and
related characters in a consistent, uniform way.
Unicode has many multiplication signs, e.g., U+00D7, U+00B7, U+2022,
U+2219, U+2299, U+22A0, U+22C6, etc. In this spirit, you can probably
include U+2605 ($B!z(J)
Murray
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, June 05, 2001 11:59 AM
To:
Has anyone ever made a character collection for mathematics?
Please check out Unicode 3.1 and 3.2 (coming up). Characters from the
STIX collaboration of a variety mathematical sources such as AmSTeX and
MathML have been collected into a math character set that seems to have
the vast
The Weierstrass symbol U+2118 isn't a capital letter in spite of its name,
nor is it really an alphabet character. It's sort of a stylized mixture of a
rho and a lower-case script p. However in view of the principle that
character names never change, even if incorrect, this symbol remains the
In some of my talks at the Unicode conferences (see "Tips and Tricks..."), I
have addressed problems with Unicode, notably trying to figure out whether
to use a Chinese Simplified/Traditional, Japanese, or Korean font to render
a Chinese character inserted in a plain-text scenario. This is a
It would be great if things were that easy. But users typically don't want
to worry about fonts. They enter a character, maybe by pasting plain text,
and want it magically to appear as something other than the
"missing-character" glyph. They probably don't even know if it's a
For what it's worth, I've been referring to characters between 0x1 and
0x10 as "higher-plane" characters as distinguished from BMP characters.
Seems to work well in a general way. For plane 1, I use "plane=1"
characters, etc.
Murray
If you can get the text into a Win32 RichEdit control version 3.0 or later
(Office 2000 and/or Windows 2000 in WordPad), type Shift+Alt+x after the
character and the character will be replaced by its Unicode hexadecimal
value. If you type Alt+x, that code gets converted back into the Unicode
One interesting possibility for representing the APL italic characters would
be to use the math italic alphabet in plane 1. The motivation for their use
in APL is similar to that for the math case: the characters are separate
symbols, e.g., they don't get grouped into natural language words. In
Subject: Re: Plane 14 language tags
Brendan Murray wrote:
Murray Sargent [EMAIL PROTECTED] wrote:
Note that in C, it's essentially just as fast to make character
comparisons with (ch | 0x20) as with ch alone, i.e., if you know
ch is in an ASCII range (0 - 0x7F or 0xE
Note that in C, it's essentially just as fast to make character comparisons
with (ch | 0x20) as with ch alone, i.e., if you know ch is in an ASCII range
(0 - 0x7F or 0xE - 0xE007F), you can do a case insensitive compare as
quickly as a case sensitive one. The problem with assuming lower case
[EMAIL PROTECTED] asked: The question is: Is there any way for making True
type fonts and Unicode compatible?
The answer to this question is: Microsoft's implementation of TrueType has
always been based on Unicode, right from the first version in 1992. The
answer to the original question,
71 matches
Mail list logo