From: John Hudson [EMAIL PROTECTED]
Donald Z. Osborn wrote:
According to data from R. Hartell (1993), the latin alpha is used in
Fe'efe'e (a
dialect of Bamileke) in Cameroon. See
http://www.bisharat.net/A12N/CAM-table.htm (full ref. there; Hartell
names her
sources in her book). Not sure
From: Doug Ewell [EMAIL PROTECTED]
Antnio Martins-Tuvlkin antonio at tuvalkin dot web dot pt wrote:
Deseret in use (?) by micronation Molossia: It is explained at
http://www.molossia.org/alphabet.html , but they put GIFs on-line,
making no use of the U+10400 block...
I visited their site,
From: Peter Kirk [EMAIL PROTECTED]
By the way, any suggestion of making the QQ distinction with markup is
ruled out by the principle recently expounded on the main Unicode list
that separate markup cannot be applied to combining characters.
Isn't this need of allowing separate markup on
From: Jony Rosenne [EMAIL PROTECTED]
Peter Kirk
You mean, you would represent a black e with a red acute accent as
something like e, ZWJ, red, IBC, acute, /red? That
looks like
a nightmare for all kinds of processing and a nightmare for rendering.
No, it is more like forecolor:black,
From: Asmus Freytag [EMAIL PROTECTED]
At 12:49 AM 9/8/2004, Philippe Verdy wrote:
And still no decision if this invisible base character will be added or
not. It's just a public review for now,
Well, hold your horses for a bit here.
If something's out of review, there won't be a decision until
From: Asmus Freytag [EMAIL PROTECTED]
On the other hand, all aspects to *coloring* of characters
do not belong in the plain text stream - but that was not
the question.
I think suggested solutions that define markup that apply to
combining characters but place that markup outside of the
combining
From: Gerd Schumacher [EMAIL PROTECTED]
2. Another invisible diacritics carrier
I also found an acute on diphtongs, placed on the boundary of both letters
(au, ei, eu, oe, and ui).
Wouldn't such diacritic be hold by the currently proposed invisible base
character (in the Public Review section of
From: Peter Kirk [EMAIL PROTECTED]
Surely the intention is for INVISIBLE LETTER, combining acute to be
equivalent (although it cannot be canonically equivalent) to spacing
acute, U+00B4? But then would this kind of ligature mechanism with ZWNJ
and U+00B4 be appropriate? I would think not.
From: Doug Ewell [EMAIL PROTECTED]
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
I also found an acute on diphtongs, placed on the boundary of both
letters (au, ei, eu, oe, and ui).
Wouldn't such diacritic be hold by the currently proposed invisible
base character (in the Public
]
To: Philippe Verdy [EMAIL PROTECTED]
Cc: Doug Ewell [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 6:06 PM
Subject: Re: Questions about diacritics
In LaTeX2e with the Cork coding (for TeXnicians: \usepackage[T1]{fontenc})
there is a so-called compound word mark. It has
Since INVISIBLE LETTER is spacing, wouldn't it make more sense to define
Isn't rather INVISIBLE LETTER *non-spacing* (zero-width minimum), even
though it is *not combining* ?
I mean here that its width would be zero unless a visible diacritic expands
it. It is then distinct from other
This page:
http://www.omniglot.com/writing/albanian.htm
shows two historic scripts that have been used to write Albanian (Shqip):
- the Elsaban script in the 18th century, which looks like Old Greek for the
language Tosk variant. However there are lots of unique letter forms, and
mapping to Old
From: Doug Ewell [EMAIL PROTECTED]
In the case of INVISIBLE LETTER, it seems likely -- based on the
comments of experts -- that the benefits outweigh the disadvantages.
But new control characters (and quasi-controls like IL) have tended to
cause more problems and confusion for Unicode in the past
From: Doug Ewell [EMAIL PROTECTED]
Marion Gunn mgunn at egt dot ie wrote:
Is it really so hard
to make multi-platform, open-office-type utilities?
Actually, yes, it is. Mac users don't want an application to be too
Windows-like, Windows users don't want an application to be too Mac-like
(we'll
From: Chris Jacobs [EMAIL PROTECTED]
- Original Message -
From: Christopher Fynn [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Sunday, September 19, 2004 12:08 AM
Subject: Unicode Shorthand?
Is there any plan to include sets of shorthand (Pitman, Gregg etc.)
symbols in Unicode? Or are
From: D. Starner [EMAIL PROTECTED]
Christopher Fynn wrote:
Is there any plan to include sets of shorthand (Pitman, Gregg etc.)
symbols in Unicode? Or are they something which is specifically excluded?
They're a form of handwriting, which is generally excluded. Why do
they need to be encoded in a
From: Christopher Fynn [EMAIL PROTECTED]
Philippe Verdy wrote:
It's not impossible to create a rendering system for such stenographic
system, however the general layout is more complex than with traditional
alphabets, because the layout of characters is highly dependant of the
context
From: Christopher Fynn [EMAIL PROTECTED]
Philippe Verdy wrote:
Not really, because the actual rendering is bidimensionnal, not linear.
It's difficult to predict the line height, as the baseline changes
according to the context of previous characters in the word, and its
writing direction
From: Curtis Clark [EMAIL PROTECTED]
on 2004-09-24 10:05 Peter Constable did quote:
After the DNA, the ASCII-Code is the most successful code on this
planet.
Things get more and more complex. DNA is a 2-bit code.
Not completely true. It is a bit less than 2 bits, due to its replication
chains,
From: Terje Bless [EMAIL PROTECTED]
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Theodore H. Smith [EMAIL PROTECTED] wrote:
I'd like to see a UTF-8 stress test file.
The top result on Google for the query UTF-8 Stress Test is
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt.
This test
From: Doug Ewell [EMAIL PROTECTED]
Theodore H. Smith delete at elfdata dot com wrote:
- the file mixes UTF-8 and UTF-16
Does this file mix UTF-8 and UTF-16? I thought it just had surrogates
encoded into UTF-8? Of course a surrogate should never exist in UTF-8.
You are right. Philippe's statement
From: Clark Cox [EMAIL PROTECTED]
unless the file was used as a test for CESU-8
The whole point of the CESU-8-like section is that it is not legal UTF-8.
Except that the document does not even cite CESU-8 but only UTF-16! The
text itself is puzzling as well as nearly all its suggestions about
From: Philipp Reichmuth [EMAIL PROTECTED]
Don't you think you are stretching things a bit? This is an UTF-8 parser
stress test file. If an application opens it in a different encoding,
well, of course the results will be different, and things will not look
UTF-8-ish. Again, this is a
From: Antoine Leca [EMAIL PROTECTED]
On Tuesday, September 28th, 2004 03:22 Tom wrote:
Let's say. The test engineer ensures the functionality and validates
the input and output on major Latin 1 languages, such as German,
French, Spanish, Italian,
Just a side point: French cannot be fully
About the French ligatures 'oe' (and 'ae'), I should have noted this
excellent summary page (in French) on its usage and history:
http://fr.wikipedia.org/wiki/Ligature_(typographie)
Note that Latin- or Greek-inherited words use the ligature when the vowels
are not to be pronounced separately,
://www.rodage.org/pub/French-Sahel.pdf
- Original Message -
From: Stefan Persson [EMAIL PROTECTED]
To: Unicode Mailing List [EMAIL PROTECTED]
Sent: Thursday, September 30, 2004 5:05 PM
Subject: Re: internationalisation assumption
Philippe Verdy wrote:
in addition, French
keyboards typically never
From: Chris Harvey [EMAIL PROTECTED]
The users seem determined to put the entire alphabet into the PUA, thus
making a single character for ng, kw, ii etc. I would like to be
able to present them with something that works and avoid this kind of
catastrophe.
A better alternative to PUAs, which
RE: internationalization assumptionWell the main issue for
internationalization of software is not the character sets with which it was
tested. It is in fact trivial today to make an application compliant with
Unicode text encoding.
What is more complicate is to make sure that the text will be
This page on the French version of wikipedia notes that Polytonic Greek used
in the 3rd century B.C. alternate letters to denote the initial spirits
(pneuma dasú for the hard spirit, and pneuma psílon for the soft
spirit), rather than the modern 9-shaped combining accents.
From: fantasai [EMAIL PROTECTED]
Comments on CSS (but not how-to questions) should be directed to
the www-style mailing list at w3.org, not unicode:
http://lists.w3.org/Archives/Public/www-style/
OK for the numeric versus capitalize|uppercase|lowercase remark, which
is related to form
From: kefas [EMAIL PROTECTED]
Inserting unicode/basic-hebrew reults in a convinient
RtL, right-to-left, advance of the cursor, but the
space-character jumps to the far right. Is there a
RtL-space?
In MS-Word and OpenOffice I can only change whole
paragraphs to RtL-entry. But quoting just a few
From: A. Vine [EMAIL PROTECTED]
I'm just curious about the \0 thing. What problems would having a \0 in
UTF-8 present, that are not presented by having \0 in ASCII? I can't see
any advantage there.
Beats me, I wasn't there. None of the Java folks I know were there
either.
The problem is in the
- Original Message -
From: John Cowan [EMAIL PROTECTED]
To: Doug Ewell [EMAIL PROTECTED]
Cc: Unicode Mailing List [EMAIL PROTECTED]; Philippe Verdy
[EMAIL PROTECTED]; Peter Kirk [EMAIL PROTECTED]
Sent: Monday, November 15, 2004 7:05 AM
Subject: Re: U+ in C strings (was: Re: Opinions
From: Christopher Fynn [EMAIL PROTECTED]
Isn't it already deprecated? The URL that started this thread
http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html
is marked as part of the Deprecated API
Deprecated does not mean that it is not used. This interface remains
accessible when
From: Peter Kirk [EMAIL PROTECTED]
On the contrary, it is your mobile sync software which is of no use if
communication with the outside world is required, if it doesn't support
standards-conformant mail clients like Thunderbird, but only communicates
in non-standardised ways with the products
From: Edward H. Trager [EMAIL PROTECTED]
Hi, Elaine,
There is of course no limit to how many writing systems
one can have on a Unicode-encoded HTML page.
My recommendations would be to:
(3) Use Cascading Style Sheet (CSS) classes to control display of fonts
...
A better CSS class would
From: E. Keown [EMAIL PROTECTED]
Great idea! I code in the seldom-seen AHTML ('Archaic
HTML'), as you all suspected.
A friend tested a page I wrote last month and found it
wouldn't work on any of his 5 browsersoh well.
Well, Elaine, if you want maximum compatibility, you should better use
From: Christopher Fynn [EMAIL PROTECTED]
I'd also like to figure out a way to trigger this kind of behavior in
other browsers as well as in IE (using Java Script or Java rather than VB)
as not quite everyone uses IE - (but I guess you are not going to give me
any more clues on how to do that
From: Doug Ewell [EMAIL PROTECTED]
The best advice for Elaine's situation becomes simpler. To maximize the
likelihood that readers will see the right glyphs, add a font-family
style line that lists a variety of available fonts, in decreasing order
of coverage and attractiveness.
My bad advice
From: Doug Ewell [EMAIL PROTECTED]
Cryptically naming these two CSS classes .he and .heb, which
provides no indication of which is the Unicode encoding and which is the
Latin-1 hack, merely makes a bad suggestion worse.
It was not cryptocraphic: he was meant for Hebrew (generic, properly
Unicode
From: E. Keown [EMAIL PROTECTED]
Dear Doug Ewell, fantasai and List:
I will try to sort out these diverse pieces of advice.
What's the point, really, of going far beyond, even
beyond CSS, into XHTML, where few computational
Hebraists have gone before?
You're right Helen, the web is full of non
From: Edward H. Trager [EMAIL PROTECTED]
Are you saying the difference in names is SIL Ezra vs. Ezra SIL ?
That's too confusing!
You're not alone to be confused. I had completely forgotten the existence of
two versions of the same font design. I may have just seen that it used
PUAs, so I did not
From: Antoine Leca [EMAIL PROTECTED]
I do not know what does mean fully compatible in such a context. For
example, ASCII as designed allowed (please note I did not write was
designed to allow) the use of the 8th bit as parity bit when transmitted
as
octet on a telecommunication line; I doubt such
You just need a mapping table from Unicode
codepoints to Shift-JIS code positions, and a very simple code point parser to
translate UTF-8 into Unicode code points.
You'll find a mapping table in the Unicode UCD, on
its FTP server. The UTF-8 form is fully documented in the Conformance section
From: Antoine Leca [EMAIL PROTECTED]
On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure:
I'm not seeing a lot in this thread that adds to the store of
knowledge on this issue, but I see a number of statements that are
easily misconstrued or misapplied, including the thoroughly
- Original Message -
From: Addison Phillips [wM]
To: pragati ; [EMAIL PROTECTED]
Sent: Thursday, November 25, 2004 6:21 PM
Subject: RE: Shift-JIS conversion.
Dear Pragati,
You can write your own conversion, of course. The mapping tables of
Unicode-SJIS are readily availably. You should
From: Antoine Leca [EMAIL PROTECTED]
On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure:
In ASCII, or in all other ISO 646 charsets, code positions are ALL in
the range 0 to 127. Nothing is defined outside of this range, exactly
like Unicode does not define or mandate anything
From: Doug Ewell [EMAIL PROTECTED]
My impression is that Unicode and ISO/IEC 10646 are two distinct
standards, administered respectively by UTC and ISO/IEC JTC1/SC2/WG2,
which have pledged to work together to keep the standards perfectly
aligned and interoperable, because it would be destructive
From: Mark Davis [EMAIL PROTECTED]
I want to correct some misperceptions about CGJ; it should not be used for
ligatures.
True. CGJ is a combining character that extends the grapheme cluster started
before it, but it does not imply any linking with the next grapheme cluster
starting at a base
Message -
From: Mark Davis [EMAIL PROTECTED]
To: Philippe Verdy [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Friday, November 26, 2004 9:09 PM
Subject: Re: CGJ , RLM
The statements below are incorrect, but I don't have the time to correct
them all.
From: Doug Ewell [EMAIL PROTECTED]
Perhaps a better question to ask would be why you need to indicate both
hyphenation points and ligation points in text that is going to be
collated.
Because one would want to:
- prepare documents for correct rendering (including both ligatures and
hyphenation
From: Doug Ewell [EMAIL PROTECTED]
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
If I want to encode explicit ligatures for the ffi cluster, if it is
not hyphenated, I need to add ZWJ:
ef+ZWJ+SHY+f+ZWJ+i+SHY+ca+SHY+ce(1)
Great Scott! You can use ZWJ to suggest a ligation
From: Jony Rosenne [EMAIL PROTECTED]
One of the problems in this context is the phrase original meaning. What
we have is a juxtaposition of two words, which is indicated by writing the
letters of one with the vowels of the other. In many cases this does not
cause much of a problem, because the
From: Addison Phillips [wM] [EMAIL PROTECTED]
For example, Dutch sometimes treats the sequence ij as a single letter
(it turns out that there are characters for the letter 'ij' in Unicode
too, but they are for compatibility with an ancient non-Unicode character
set). Software must be modified
From: Peter Kirk [EMAIL PROTECTED]
I don't want to go along with Philippe entirely on this, but surely he
must be right on this last point. Formally, Unicode is effectively the
agent of just one national body in this decision-making process.
To be honest, Peter, I never said that Unicode was a
I'm not the one that proposed encoding a AE ligature with A+ZWJ+E. I just
spoke about cases like true typographical ligatures like ffi. I do know
that AE or ae in French is better encoded with their distinct unique code,
even if French consider this letter as two letters (which may justify the
From: John Cowan [EMAIL PROTECTED]
the need to encode Dutch
ij as a single character, which is neither necessary nor practical.
(U+0132 and U+0133 are encoded for compatibility only.) In cases where
ij is a digraph in Dutch text, i+ZWNJ+j will be effective.
I suppose you wanted to speak about the
From: Patrick Andries [EMAIL PROTECTED]
Enfin, je ne suis plus si sûr que les sociétés américaines considèrent
encore
Unicode comme quelque chose de stratégique, il s'agit surtout d'efforts
individuels
de la part de techniciens passionés dans ces entreprises, passionnés qu'on
laisse
encore
From: Otto Stolz [EMAIL PROTECTED]
Note that there is no algorithm to reliably derive the position of the
syllable break from the spelling of a Word. You could even concoct pairs
of homographs that differ only in the position of the syllable break
(and, consequently, in their respective meaning).
From: Otto Stolz [EMAIL PROTECTED]
Just because the st ligature is so uncommon (and the long with its
t ligature is almost extinct), I was looking for an example involving
fl, or fi).
with ff :
affable, baffe, biffer, Buffy, affriolant, effaroucher, effacer, ...
with ffl :
effleurer,
From: Michael Norton (a.k.a. Flarn) [EMAIL PROTECTED]
What's an ideograph? Also, what's a radical?
Are they the same thing?
Some radicals (in the Han script) may be ideographs, but most ideographs are
not radicals: they often (not always) combine 1 or more radicals, with 1 or
more strokes that
From: Peter R. Mueller-Roemer [EMAIL PROTECTED]
Doug Ewell wrote:
Robert Finch wrote:
'm trying to implement a Unicode keyboard device, and I'd rather have
keyboard processing dealing with genuine Unicode characters for the
cursor keys, rather than having to use a mix of keyboard scan codes
and
From: Peter Kirk [EMAIL PROTECTED]
On 30/11/2004 19:53, John Cowan wrote:
Your main misunderstanding seems to be your belief that WG2 is a
democratic body; that is, that it makes decisions by majority vote. ...
Thank you, John. This was in fact my question: will the amendment be
passed
There's no *universal* best encoding.
UTF-8 however is certainly today the best encoding for portable
communications and data storage (but it competes now with SCSU which uses a
compressed form where, on average, each Unicode character is represented by
one byte, in most documents; but other
If you need immutable strings, that take the least space as possible in
memory for your running app, then consider using SCSU, for the internal
storage of the string object, then have a method return an indexed array of
code points, or a UTF-32 string when you need it to mutate the string
From: Doug Ewell [EMAIL PROTECTED]
I appreciate Philippe's support of SCSU, but I don't think *even I*
would recommend it as an internal storage format. The effort to encode
and decode it, while by no means Herculean as often perceived, is not
trivial once you step outside Latin-1.
I said: for
RE: Nicest UTFFrom: Lars Kristan
I agree. But not for reasons you mentioned. There is one other important
advantage:
UTF-8 is stored in a way that permits storing invalid sequences. I will
need to
elaborate that, of course.
Not true for UTF-8. UTF-8 can only store valid sequences of code points,
From: Gary P. Grosso [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, December 03, 2004 5:10 PM
Subject: RE: OpenType vs TrueType (was current version of unicode-font)
Hi Antoine, others,
Questions about OpenType vs TrueType come up often in my work, so perhaps
the list will suffer a couple
From: Asmus Freytag [EMAIL PROTECTED]
A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
1) 1 extra test per character (to see whether it's a surrogate)
2) special handling every 100 to 1000 characters (say 10 instructions)
3) additional cost of accessing 16-bit registers (per
From: Theo [EMAIL PROTECTED]
From: Asmus Freytag [EMAIL PROTECTED]
So, despite it being UTF-8 case insensitive, it was totally blastingly
fast. (One person reported counting words at 1MB/second of pure text, from
within a mixed Basic / C environment). You'll need to keep in mind, that
the
From: Peter Constable [EMAIL PROTECTED]
Why would you think the creation of this site might suggest that
Microsoft is selling off its IP in relation to OpenType to Monotype? If
Motorola created a site www.pentium4.org, would you jump to the
conclusion that they were selling off that IP?
What
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Philippe Verdy [EMAIL PROTECTED] writes:
Random access by code point index means that you don't use strings
as immutable objects,
No. Look at Python, Java and C#: their strings are immutable (don't
change in-place) and are indexed by integers
- Original Message -
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Sunday, December 05, 2004 1:37 AM
Subject: Re: Nicest UTF
Philippe Verdy [EMAIL PROTECTED] writes:
There's nothing that requires the string storage to use the same
exposed array,
The point
Richard Cook rscook at socrates dot berkeley dot edu wrote:
Script complexity is not so easily quantified. Has anyone tried to
sort scripts by complexity? In terms of the present discussion, Han
would be viewed as a simple script, and yet it is simple only in
terms of the script model in which
From: Ray Mullan [EMAIL PROTECTED]
I don't see how the one million available codepoints in the Unicode
Standard could possibly accommodate a grammatically accurate vocabulary of
all the world's languages.
You have misread the message from Tim: he wanted to use code points above
U+10 within
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Philippe Verdy [EMAIL PROTECTED] writes:
The point is that indexing should better be O(1).
SCSU is also O(1) in terms of indexing complexity...
It is not. You can't extract the nth code point without scanning the
previous n-1 code points
of channels (networking links, file
storage, database table) with lower throughput than fast but expensive or
restricted internal processing memory (including memory caches if we
consider data locality).
From: D. Starner [EMAIL PROTECTED]
Philippe Verdy writes:
Suppose that Unicode encodes
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Now consider scanning forwards. We want to strip a beginning of a
string. For example the string is an irc message prefixed with a
command and we want to take the message only for further processing.
We have found the end of the prefix and we want
From: Doug Ewell [EMAIL PROTECTED]
Here is a string, expressed as a sequence of bytes in SCSU:
05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E
See how long it takes you to decode this to Unicode code points. (Do
not refer to UTN #14; that would be cheating. :-)
Without looking
- Original Message -
From: Arcane Jill [EMAIL PROTECTED]
Probably a dumb question, but how come nobody's invented UTF-24 yet? I
just made that up, it's not an official standard, but one could easily
define UTF-24 as UTF-32 with the most-significant byte (which is always
zero) removed,
From: E. Keown [EMAIL PROTECTED]
I wrote 3 Hebrew diacritics proposals between
May-July. (...)
1. Proposal to add Samaritan Pointing to the UCS
http://www.lashonkodesh.org/samarpro.pdf
WG2 number: N2748
2. Proposal to add Palestinian Pointing to ISO/IEC 10646
From: D. Starner [EMAIL PROTECTED]
If you're talking about a language that hides the structure of strings
and has no problem with variable length data, then it wouldn't matter
what the internal processing of the string looks like. You'd need to
use iterators and discourage the use of arbitrary
From: Kenneth Whistler [EMAIL PROTECTED]
Yes, and pigs could fly, if they had big enough wings.
Once again, this is a creative comment. As if Unicode had to be bound on
architectural constraints such as the requirement of representing code units
(which are architectural for a system) only as
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here:
most Linux/Unix filesystems (as well as many legacy filesystems for Windows
and MacOS...) do not track the encoding with which filenames were encoded
and, depending on local user preferences when that user created that
De : Michael Everson
But there is already in the pipeline a PHOENICIAN WORD SEPARATOR
[...] The glyphs for
all of these seem indistinguishable, and so are the functions. The only
difference seems to be the scripts they are associated with, but
punctuation marks are supposed to be
Probably the first thing to do for Africa is to
extend the support of softwares with localized contents that can ALREADY be
performed with existing encoded scripts. But even there, software companies are
not progressing much, even if this causes no technical problems with the
existing
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Ok, so it's the conversion from raw text to escaped character
references which should treat combining characters specially.
What about with combining acute, which doesn't have a precomposed
form? A broken opening tag or a valid text character?
From: D. Starner [EMAIL PROTECTED]
Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] writes:
If it's a broken character reference, then what about A#769; (769 is
the code for combining acute if I'm not mistaken)?
Please start adding spaces to your entity references or
something, because those of us
From: Antoine Leca [EMAIL PROTECTED]
Err, not really. MS-DOS *need to know* the encoding to use, a bit like a
*nix application that displays filenames need to know the encoding to use
the correct set of glyphs (but constrainst are much more heavy.) Also
Windows NT Unicode applications know it,
From: Carl W. Brown [EMAIL PROTECTED]
Philippe,
Also a broken opening tag for HTML/XML documents
In addition to not having endian problems UTF-8 is also useful when
tracing
intersystem communications data because XML and other tags are usually in
the ASCII subset of UTF-8 and stand out making it
From: Philippe Verdy [EMAIL PROTECTED]
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Philippe Verdy [EMAIL PROTECTED] writes:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters like '', '', quotation marks, and with special
behavior for spaces.
The point
From: Kenneth Whistler [EMAIL PROTECTED]
That it has been morphological reanalyzed is demonstrated by the
fact that it takes regular English verb endings, as in:
I RSVPed yesterday, right after I got the email.
As I said, it is now a bona fide English verb, and most
English speakers will treat it
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Regarding A, I see three choices:
1. A string is a sequence of code points.
2. A string is a sequence of combining character sequences.
3. A string is a sequence of code points, but it's encouraged
to process it in groups of combining character
From: Doug Ewell [EMAIL PROTECTED]
Lars Kristan wrote:
I am sure one of the standardizers will find a Unicodally
correct way of putting it.
I can't even understand that paragraph, let alone paraphrase it.
My understanding of his question and my reponse to his problem is that you
MUST not use
From: Séamas Ó Brógáin [EMAIL PROTECTED]
John wrote:
As far as I know, they were first used in formal invitations (to
weddings,
funerals, dances, etc.) in the corner of the card, as both shorter and
more fancy than the older phrase The favor of your reply is requested.
This is correct. The
From: Peter R. Mueller-Roemer [EMAIL PROTECTED]
For a fixed length of combining character sequence (base + 3 combining
marks is the most I have seen graphically distinguishable) the repertore
is still finite.
I do think that you are underestimating the repertoire. Also Unicode does
NOT define
From: Michael Everson [EMAIL PROTECTED]
Nonsense. You might as well try to explain SPQR on the same basis.
I won't. I know that SPQR was used on architectural constructions as a
symbol of the Roman Empire, and it was a wellknown acronym of a Latin
expression.
It largely predates the invention
RE: Roundtripping in UnicodeMy view about this problem of roundtripping is
that if data, supposed to contain only valid UTF-8 sequences, contains some
invalid byte sequences that still need to be roundtripped to some code
point for internal management that can be roundtripped later to the
Lars Kristan wrote: What I was talking about in the paragraph in question is what happens if you want to take unassigned codepoints and give them a new status.
You don't need to do that. No Unicode application must assign semantics to unassigned codepoints.
If a source sequence is invalid, and you
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Lars Kristan [EMAIL PROTECTED] writes:
Hm, here lies the catch. According to UTC, you need to keep
processing the UNIX filenames as BINARY data. And, also according
to UTC, any UTF-8 function is allowed to reject invalid sequences.
Basically,
901 - 1000 of 2449 matches
Mail list logo