Re: a character for an unknown character

2016-12-25 Thread Jukka K. Korpela
21.12.2016, 4:29, Martin Mueller wrote: Is there a Unicode character that says “I represent an alphanumerical character, but I don’t know which”. I think including such a “character” in Unicode would not fit into the the idea of Unicode as a system for encoding plain text characters. You

Re: Fwd: Why incomplete subscript/superscript alphabet ?

2016-10-06 Thread Jukka K. Korpela
6.10.2016, 19:27, Ken Whistler wrote: Their functions have been completely overtaken by markup conventions such as ... and ..., which *are* widely supported already, even in most email clients, ri^ght out of the b_ox . They are widely supported, but very widely in a typographically inferior

Re: Why incomplete subscript/superscript alphabet ?

2016-10-06 Thread Jukka K. Korpela
6.10.2016, 17:55, Frédéric Grosshans wrote: Le 06/10/2016 à 09:21, Marcel Schneider a écrit : I did never see that. Would you show us some examples to look up? Iʼm curious whether they could be managed without accented superscripts. Anyway, combining diacritics should be placeable on

Re: Why incomplete subscript/superscript alphabet ?

2016-10-03 Thread Jukka K. Korpela
3.10.2016, 20:40, Leonardo Boiko wrote: Besides, there are already control/formatting characters for such purposes – several ones, even. They look like this: , ^{}, \textsuperscript{}, \*{ \*} … They are not control or formatting characters. They are markup used at higher protocol levels –

Re: Why incomplete subscript/superscript alphabet ?

2016-10-01 Thread Jukka K. Korpela
1.10.2016, 11:29, Khaled Hosny wrote: On Fri, Sep 30, 2016 at 07:31:58PM +0300, Jukka K. Korpela wrote: [...] >> What I was pointing at was that when using rich text or markup, it is complicated or impossible to have typographically correct glyphs used (even when they exist), whereas t

Re: Why incomplete subscript/superscript alphabet ?

2016-09-30 Thread Jukka K. Korpela
30.9.2016, 19:36, Philippe Verdy wrote: 2016-09-30 17:54 GMT+02:00 Jukka K. Korpela <jkorp...@cs.tut.fi <mailto:jkorp...@cs.tut.fi>>: Using HTML, for example, the way to achieve that at present would be to use markup like ... (to avoid the problems caused by the defaul

Re: Why incomplete subscript/superscript alphabet ?

2016-09-30 Thread Jukka K. Korpela
30.9.2016, 19:11, Leonardo Boiko wrote: The Unicode codepoints are not intended as a place to store typographically variant glyphs (much like the Unicode "italic" characters aren't designed as a way of encoding italic faces). There is no disagreement on this. What I was pointing at was that

Re: Why incomplete subscript/superscript alphabet ?

2016-09-30 Thread Jukka K. Korpela
30.9.2016, 18:19, Philippe Verdy wrote: Note also that many tools generating documentation from source code allow you to insert HTML comments, so you could as well use , Yes, but there’s a serious typographic pitfall with this, as well as with using e.g. subscript or superscript formatting

Re: Why incomplete subscript/superscript alphabet ?

2016-09-30 Thread Jukka K. Korpela
30.9.2016, 12:57, Gael Lorieul wrote: I wonder why only a subset of the alphabet is available as subscript and/or superscript ? This is explained in section 22.4 of the standard: http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#page=25 To put it briefly, in my interpretation, subscript

Re: Unicode equivalence between Word for Windows/MAC

2015-10-28 Thread Jukka K. Korpela
28.10.2015, 11:59, Rafael Sarabia wrote: I need to use a document both in Word 2007 for Windows and Word 2011 for Mac and I'm finding some incompatibility issues. Before going into the details of plain text file encodings, I think it is important to decide whether you need to use plain text

Re: (R), (c) and ™

2014-12-18 Thread Jukka K. Korpela
2014-12-18, 12:31, Andrea Giammarchi wrote: I wonder if it's by accident that 00AE, 00A9, and 2122 are not listed as standard variant sensitive chars. Why would that be an accident any more than not listing 100,000 other characters there? Or to put it more constructively, why should they

Re: Code charts and code points

2014-10-24 Thread Jukka K. Korpela
2014-10-24 15:05, Shriramana Sharma wrote: Hi Martin. If you haven't noticed it before, opening Unicode charts in PDF readers has something like SECURED on the top i.o.w. the charts are sorta DRM-protected. So you can't copy-paste the characters. Heck you can't even copy-paste the character

Re: FYI: Ruble sign in Windows

2014-08-15 Thread Jukka K. Korpela
2014-08-15 1:52, Peter Constable wrote: For those interested, there is an update for Windows available now to add font, keyboard and locale data support for the Ruble sign that was added in Unicode 7.0. For details, see here: http://support.microsoft.com/kb/2970228 The update seems to have

Re: Thai unalom symbol

2014-07-02 Thread Jukka K. Korpela
2014-07-02 6:10, James Clark wrote: The unalom is widespread in Thailand. For example, the Thai Red Cross Society was originally founded as the Red Unalom Society, and its logo was a red Unalom combined with a cross. It forms the main component of the seal of Rama I (founder of the current Thai

Re: Contrastive use of kratka and breve

2014-07-02 Thread Jukka K. Korpela
2014-07-02 20:34, Philippe Verdy wrote: CGJ would be better used to prevent canonical compositions but it won't normally give a distinctive semantic. In the question, visual difference was desired. The Unicode FAQ says: “The semantics of CGJ are such that it should impact only searching and

Re: Contrastive use of kratka and breve

2014-07-02 Thread Jukka K. Korpela
2014-07-02 19:11, Leo Broukhis wrote: Here https://upload.wikimedia.org/wikipedia/commons/a/a4/Contrastive_use_of_kratka_and_breve.JPG is an example of й and и + U+0306 COMBINING BREVE used contrastively (/j/ vs short /i/) thanks to a difference in typographic style of Cyrillic breve (kratka)

Re: Characters that should be displayed?

2014-06-29 Thread Jukka K. Korpela
2014-06-29 21:44, Koji Ishii wrote: The spec currently has the following text[2]: Control characters (Unicode class Cc) other than tab (U+0009), line feed (U+000A), and carriage return (U+000D) are ignored for the purpose of rendering. (As required by [UNICODE], unsupported Default_ignorable

Re: Characters that should be displayed?

2014-06-29 Thread Jukka K. Korpela
2014-06-30 0:48, David Starner wrote: On Sun, Jun 29, 2014 at 2:02 PM, Jukka K. Korpela jkorp...@cs.tut.fi wrote: They might be seen as “not displayable by normal rendering”, so yes. On the practical side, although Private Use characters should not be used in public information interchange

Math input methods

2014-06-04 Thread Jukka K. Korpela
2014-06-04 15:32, Hans Aberg wrote under Subject: Re: Swift: On 4 Jun 2014, at 13:58, Leonardo Boiko leobo...@namakajiri.net wrote: I don't think this feature saw much use, since programmers in a global world can't assume that everyone will have easy access to their input methods, and so tend

Re: Math input methods

2014-06-04 Thread Jukka K. Korpela
2014-06-04 17:42, Ian Clifton wrote: Jukka K. Korpela jkorp...@cs.tut.fi writes: As an aside, the ISO 8-2 standard on mathematical notations describes boldface letters such as boldface R as symbols for commonly known sets of numbers. The double-struck letters like ℝ as mentioned

Re: Swift

2014-06-04 Thread Jukka K. Korpela
2014-06-04 20:15, Andre Schappo wrote: Well because outside of groups like this there is still little awareness of Unicode, little understanding of Unicode, little willingness to use Unicode and little conscious usage of Unicode That’s very true. In the specific case of “using Unicode” (which

Re: Use of Unicode Symbol 26A0

2014-06-03 Thread Jukka K. Korpela
2014-06-03 19:13, Asmus Freytag wrote: Unicode normally does not document all known usages of symbols. Not to mention unknown usages. Characters will be used in different ways, no matter what the Unicode Standard says, and it would be mostly pointless to put restrictions on it. In some

Re: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector?

2014-04-02 Thread Jukka K. Korpela
2014-04-02 21:56, Whistler, Ken wrote: U+23AF is *definitely* not a variation selector at all. It is part of a set of bracket pieces (and other graphic pieces) in the range U+239B..U+23B1. […] These glyphic pieces of symbols are only relevant and useful in the context of mathematical

Re: Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please?

2014-03-31 Thread Jukka K. Korpela
2014-03-29 13:01, Asmus Freytag wrote: On managing some types of spacing between elements in running text: On 3/27/2014 8:04 AM, Jukka K. Korpela wrote: […] The “fixed-width spaces” are mostly just legacy characters, holdover from old typography. They may have their uses, though, in contexts

Re: Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please?

2014-03-27 Thread Jukka K. Korpela
2014-03-27 10:13, William_J_G Overington wrote: Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please? It depends, among other things, on what you mean by “space”. There’s U+00A0 NO-BREAK SPACE, which surely isn’t the same

Re: Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please?

2014-03-27 Thread Jukka K. Korpela
2014-03-27 15:10, Kalvesmaki, Joel wrote: William, try the U+2000..U+200A glyphs under General Punctuation--I think that's what you're looking for to manage precise widths of blank space. That range contains some “fixed-width spaces”, yes. Being “fixed-width” is rather relative here, though,

Re: proposal for new character 'soft/preferred line break'

2014-02-10 Thread Jukka K. Korpela
2014-02-10 21:49, Richard Wordingham wrote: U+200B has the distinct advantage of being a character, and therefore readily travelling with the words it separates. Granted, but it’s still a character that the rendering software needs to know and support in order to have the desired effect. As

Re: proposal for new character 'soft/preferred line break'

2014-02-10 Thread Jukka K. Korpela
2014-02-10 22:30, Philippe Verdy kirjoitti: No I make no confusion: wbr is a formatting HTML element, SHY (or shy; in HTML syntax for the defined entity) is a character. Both play equivalent roles in HTML, Not at all. except that shy; has a defined behavior to insert an hyphen at end of

Re: proposal for new character 'soft/preferred line break'

2014-02-09 Thread Jukka K. Korpela
2014-02-10 9:13, Philippe Verdy wrote: The wbr is enough for this purpose, No, since the purpose was clearly to specify a line break point that is preferred over other possible line break points, or even the only allowed line break point within a string. The wbr tag (an old nonstandard

Re: proposal for new character 'soft/preferred line break'

2014-02-05 Thread Jukka K. Korpela
2014-02-05 18:22, Markus Scherer wrote: On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert rha...@shadowtec.de mailto:rha...@shadowtec.de wrote: Parallel to soft hyphen, a hyphen that is just inserted if the word was broken, it would be practical to have some way to tell browser: if

Re: proposal for new character 'soft/preferred line break'

2014-02-05 Thread Jukka K. Korpela
2014-02-05 23:44, Rhavin Grobert wrote: Wbr gives the opportunity to break at long|awesome. But what i mean is: - non existing sbr in parralell to shy assumed - Just giving a hypothetical character or tag an identifier does not specify its intended meaning. Do you think me gentle,sbr/do

Re: Arabic percent sign and percent signs in RTL scripts

2014-02-04 Thread Jukka K. Korpela
2014-02-04 19:05, James Lin wrote: For Arabic, percentage sign is fixed on the left side of the digit: %10 There seem to be different opinions and practices on this. In the CLDR database, the formats have “%” (the Ascii percent sign) on the right of the number, as far as I can see; Arabic

Re: Diacritical marks: Single character or combined character?

2013-12-06 Thread Jukka K. Korpela
2013-12-06 0:45, Shriramana Sharma wrote: In Unicode the characters with precomposed diacritics are given canonical equivalences to the corresponding sequences of base characters followed by separate diacritics. So Unicode-compliant parsing tools should not distinguish between the two. There

Re: How to remove accents while conforming to language standards?

2013-11-04 Thread Jukka K. Korpela
2013-11-04 21:00, Jennifer Wong wrote: The use case is that customers want to integrate data from our enterprise solution to their ASCII-based downstream systems. This is very different from the question about removing accents while conforming to language standards. The very goal makes it

Re: How to remove accents while conforming to language standards?

2013-11-01 Thread Jukka K. Korpela
2013-11-01 17:37, Jennifer Wong wrote: I would like to ask for advice on removing accents from characters. To address first the question you ask in the Subject line, “How to remove accents while conforming to language standards?”, but do not ask in the message body, the answer is: You

Re: Terminology question re ASCII

2013-10-29 Thread Jukka K. Korpela
2013-10-29 6:12, d...@bisharat.net wrote: If one refers to plain ASCII, or plain ASCII text or ... characters, should this be taken strictly as referring to the 7-bit basic characters, or might it encompass characters that might appear in an 8-bit character set (per the so-called extended

Re: Why isn't there a RIGHTWARDS BLACK ARROW?

2013-10-26 Thread Jukka K. Korpela
2013-10-26 18:36, Sindre Sorhus wrote: There are: ⬅ LEFTWARDS BLACK ARROW (U+2B05) ⬆ UPWARDS BLACK ARROW (U+2B06) ⬇ DOWNWARDS BLACK ARROW (U+2B07) But no right arrow. Why is that? There is ⬅ BLACK RIGHTWARDS ARROW (U+27A1) The code chart at http://www.unicode.org/charts/PDF/U2B00.pdf has a

Re: ¥ instead of \

2013-10-22 Thread Jukka K. Korpela
2013-10-22 21:38, Jean-François Colson wrote: I know that in some Japanese encodings (JIS, EUC), \ was replaced by a ¥. Some encodings indeed have “¥” U+00A5 YEN SIGN assigned to code point 0x5C, to which Unicode assigns “\” U+005C REVERSE SOLIDUS. This is external to Unicode as such,

Re: Dotted Circle plus Combining Mark as Text

2013-10-20 Thread Jukka K. Korpela
2013-10-20 2:38, Richard Wordingham wrote: Is a sequence of a U+25CC DOTTED CIRCLE plus a combining mark plain text? Well, is h1helloh1 plain text? The answer is that any string of characters may be considered as plain text and any string of characters may be treated as rich text according

Re: Dotted Circle plus Combining Mark as Text

2013-10-20 Thread Jukka K. Korpela
2013-10-20 11:47, Jukka K. Korpela wrote: What you could do in a web page is to put U+00A0 U+25CC in one element and U+0E31 in another and position the elements in the same place, set to have the same width and to be horizontally centered. Oops. I meant U+25CC and U+00A0 U+0E31. But I’m

Re: COMBINING OVER MARK?

2013-10-03 Thread Jukka K. Korpela
2013-10-03 7:46, Martin J. Dürst wrote: On 2013/10/02 9:52, Leo Broukhis wrote: Thanks! That comes out exactly right, although using math markup for linguistic purposes is, IMO, a stretch. Why? Surely like in other fields (Math to start with), there somewhere is a boundary between plain text

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-13 Thread Jukka K. Korpela
2013-09-13 22:02, Whistler, Ken wrote: The *interesting* question, in my opinion, is why folks feel impelled to use U+2026 to render a baseline ellipsis in Latin typography at all, rather than just using U+002E ad libitum... In traditional typography, an ellipsis usually has dots set apart

Empty set

2013-09-12 Thread Jukka K. Korpela
Under Subject: Re: Why blackletter letters? 2013-09-12 20:20, Stephan Stiller wrote: Talking about which ... I confess I usually type a Danish Ø for convenience when I'm using this, though for publication I would tend to substitute the proper ∅. Whenever I saw the empty set symbol in printed

Re: Why blackletter letters?

2013-09-10 Thread Jukka K. Korpela
2013-09-10 20:36, Jukka K. Korpela wrote: 2013-09-10 20:01, Asmus Freytag wrote: This rationale is absent in document WG2 N3907 that requests these characters. If this is document http://std.dkuug.dk/jtc1/SC2/wg2/docs/n3907.pdf then I’m rather confused: it proposes AB51 for LATIN SMALL

Re: Why blackletter letters?

2013-09-10 Thread Jukka K. Korpela
2013-09-10 20:01, Asmus Freytag wrote: This rationale is absent in document WG2 N3907 that requests these characters. If this is document http://std.dkuug.dk/jtc1/SC2/wg2/docs/n3907.pdf then I’m rather confused: it proposes AB51 for LATIN SMALL LETTER BLACKLETTER O and does not include LATIN

Re: polytonic Greek: diacritics above long vowels ᾱ, ῑ, ῡ

2013-08-06 Thread Jukka K. Korpela
2013-08-05 23:46, Richard Wordingham wrote: The requirement is that conformant processes not think they are doing the right thing by treating canonically equivalent strings differently. If there is latitude in a process, e.g. rendering, I can't find a requirement to treat canonically

Re: _Unicode_code_page_and_?.net

2013-08-06 Thread Jukka K. Korpela
2013-08-06 9:38, Christopher Fynn wrote: I wonder why so many servers, database applications, and so on, _still_ don't install with Unicode (in some encoding format) as the *default* installation option. There are probably several reasons, but one obvious reason is this: if the default

Re: Unicode code page and ☃.net

2013-07-31 Thread Jukka K. Korpela
2013-07-30 23:50, James Lin wrote: If you open the Windows character Map, Segoe UI doesn't contain the snowman while font Meiryo has. I wrote about Segoe UI Symbol, not Segoe UI. Meiryo, which is also shipped with Windows 7, indeed contains SNOWMAN. This makes it even more odd if SNOWMAN is

Re: Unicode code page and ☃.net

2013-07-29 Thread Jukka K. Korpela
2013-07-29 23:42, James Lin wrote: I have a question regarding the supported Unicode code page. There are no Unicode code pages. I thought once you have unicode code page loaded, all glyph or character should be able to map and display correctly regardless of which OS or language you are

Re: Unicode code page and ☃.net

2013-07-29 Thread Jukka K. Korpela
2013-07-30 4:03, Buck Golemon wrote: Also, some browsers have odd support for rendering unicode (non-ascii) urls, for security reasons. Both chrome and firefox under Windows 7 render http://www.☃.net/ http://www.xn--n3h.net/ as http://www.xn--n3h.net/ which is the ascii domain encoding (called

Re: ISO 2955

2013-07-05 Thread Jukka K. Korpela
2013-07-05 17:01, Dreiheller, Albrecht wrote: A topic that is different but related to the current discussion writing in an alphabet with fewer letters: letter replacements is the question about writing units with limited character sets. This is not a somehow academical question but a real

Re: Arabic quoting characters

2013-06-15 Thread Jukka K. Korpela
2013-06-14 22:30, Stephan Stiller wrote: On 6/14/2013 11:45 AM, Roozbeh Pournader wrote: They are unified with the double angle quotation marks. Persian also uses the round version (and if if I remember correctly, Greek too). Where can one find such information? It’s somewhat implicit, but

Re: Arabic quoting characters

2013-06-15 Thread Jukka K. Korpela
2013-06-15 21:24, Michael Fayez wrote: And yes as Dough Ewell said characters U+2E28 and U+2E29 can be used in new data. They have the correct shape and properties though with the wrong size unfortunately. Well, U+2E28 has General Category Ps (Punctuation, Open), not Pi (Punctuation, Initial

Re: Bug?: Not able to type పయోఽంబు.... (Telugu)

2013-05-10 Thread Jukka K. Korpela
2013-05-10 13:54, Kiran Kumar Chava wrote: From one of the books we are trying to Unicodify ... we have below line ।।ఓం పయోఽంబువచ్చేత్ తత్రాపి ఓం।। 3 There must not be that dotted circle before the sunna(zero like Telugu symbol) Am I missing something? Or is this a bug? As

Re: Rendering Raised FULL STOP between Digits

2013-03-10 Thread Jukka K. Korpela
2013-03-10 4:57, Asmus Freytag wrote: 'The Lancet' reportedly insists on the use of the raised decimal point [… That's sensible advice, in a way, because B7 is in 8859-1 and therefore supported in a huge variety of fonts, for practical purposes, the coverage among non-decorative text fonts is

Re: Rendering Raised FULL STOP between Digits

2013-03-09 Thread Jukka K. Korpela
2013-03-09 21:30, Asmus Freytag wrote: I believe the Unicode Standard should be fixed by explicitly removing all suggestions in the text that the raised decimal point is unified with 002E. That would be a good move if agreement can be found on the recommended coding of the middle dot.

Re: Does HYPHEN BULLET have synonyms?

2013-02-22 Thread Jukka K. Korpela
2013-02-22 19:46, Leif Halvard Silli wrote: Questions: Shouldn’t HYPHEN BULLET be on in the NamesList of HYPHEN-MINUS? And shouldn‘t HYPHEN BULLET have HYPHEN-MINUS in its NamesList? The comments at the start of NamesList.txt say that it is “semi-automatically derived from UnicodeData.txt”,

Re: Private Use Area

2013-02-18 Thread Jukka K. Korpela
2013-02-18 17:36, Shriramana Sharma wrote: On Mon, Feb 18, 2013 at 7:13 PM, Erkki I Kolehmainen e...@iki.fi wrote: It may also be the result of a negotiating process within a special purpose user group. I also see no problem with the current definition. Since the whole point of the standard

Re: s-j combination in Unicode?

2013-02-16 Thread Jukka K. Korpela
2013-02-16 11:38, Stephan Stiller wrote: (By the way, for those finding the German rule to write SS unsatisfactory: It's hard to come by an actual minimal pair. Example: Strauss vs. Strauß. Originally the same name, but two spellings make them two names that may need to be distinguished from

Re: s-j combination in Unicode?

2013-02-13 Thread Jukka K. Korpela
2013-02-13 21:31, Andries Brouwer wrote: I wondered how to code an s-j overstrike combination in Unicode. Attached a photograph of some text containing this combination. It looks like something that has not been encoded. The same applies to what seems to be an eth (ð) with a stroke, and

Re: Word reversal from Abobe to Word

2013-02-07 Thread Jukka K. Korpela
2013-02-07 12:21, Raymond Mercier wrote: This problem is not precisely about Unicode - or is it? Directionality of characters is a Unicode issue. If I have a Hebrew text displayed in Adobe Acrobat I can select part of it and can paste it into Word. The trouble is that while individual

2013-01-24 Thread Jukka K. Korpela
2013-01-25 2:41, Richard Wordingham wrote: On Thu, 24 Jan 2013 20:05:41 -0300 Andrés Sanhueza peroyomasli...@gmail.com wrote: Do you think that a end of story symbol may be feasible/useful? One such symbol is already encoded, the Halmos tombstone U+220E END OF PROOF. It is one of the many

Re: bullet-dash

2013-01-22 Thread Jukka K. Korpela
2013-01-23 3:55, h...@tbbs.net wrote: There is a bullet that often is uzed in local advertizing. It separates phrases as em-dash. In the plane writs where it is uzed, it is also equivalent to line-break: DRAIN-CLEANING (O) GENERAL PLUMBING or LAWNMOWER REPAiR (O)

Re: RLI and bdi, and how to get an update of changes

2013-01-15 Thread Jukka K. Korpela
2013-01-16 1:18, James Lin wrote: I have 2 fundamental questions. I’ll address the first one only. HTML5 supports isolation tag bdi, The HTML5 drafts have it, but browser support is still limited. As described at

Re: RLI and bdi, and how to get an update of changes

2013-01-15 Thread Jukka K. Korpela
2013-01-16 3:03, Phillips, Addison wrote: Code points 2066, 2067, and 2068 are unassigned. I presume you mean U+202B RIGHT-TO-LEFT EMBEDDING (RLE) and U+202C POP DIRECTIONAL FORMATTING. As Roozbeh pointed out, he means the characters added that provide bidi isolation. I see. The code

Re: help with an unknown character

2013-01-10 Thread Jukka K. Korpela
2013-01-11 0:28, Elbrecht wrote: any help with an unknown character - very appreciated: elbrecht.com/SW.png [400KB] You probably tried to attach an image, but it was not sent or it was stripped off by the mailing list software. Please upload the image in some

Re: help with an unknown character

2013-01-10 Thread Jukka K. Korpela
2013-01-11 1:04, Elbrecht wrote: the URL is: www.elbrecht.com/SW.png Well, the *URL* is http://www.elbrecht.com/SW.png or http://elbrecht.com/SW.png (I really thought it was just a local filename when I saw your first email.) problem

Re: Interoperability is getting better ... What does that mean?

2013-01-09 Thread Jukka K. Korpela
2013-01-09 2:55, Leif Halvard Silli wrote: The benefit of doing such a comparison is that we then get to count both the HTML page *plus* all the extra fonts that is included in the romanized Singhala file. Thus, we get a more *real* basis for comparing the relative size of the two pages. Not

Re: Interoperability is getting better ... What does that mean?

2013-01-09 Thread Jukka K. Korpela
2013-01-09 11:57, Leif Halvard Silli wrote: Not sure which fallacy you have identified - see below. I was referring to comparison between an ad hoc 8-but encoding and a Unicode encoding so that you count the sizes font files in first case only. I’m a bit confused with your comparison, which

Re: Interoperability is getting better ... What does that mean?

2013-01-08 Thread Jukka K. Korpela
2013-01-08 23:56, Naena Guru wrote: May I ask if the following two are Latin script, English or Singhala? 1. This is written in English. 2. mee laþingaþa síhalayi. For me, both are Latin script and 1 is English and 2 is Singhala (says,' this is romanized Singhala'). Text 2 is

Re: Basic Latin

2013-01-02 Thread Jukka K. Korpela
2013-01-02 8:35, Asmus Freytag wrote: On 1/1/2013 3:53 PM, Naena Guru wrote: (By the way, Unicode is quietly suppressing Basic Latin block by removing it from the Latin group at top of the code block page (http://www.unicode.org/charts/) and hiding it under different names in the lower part of

Re: Basic Latin

2013-01-02 Thread Jukka K. Korpela
2013-01-03 0:22, Markus Scherer wrote: On Wed, Jan 2, 2013 at 1:25 PM, Jukka K. Korpela jkorp...@cs.tut.fi mailto:jkorp...@cs.tut.fi wrote: Then again, Latin is no different from Cyrillic, Greek, or Arabic, for example, in this respect. In an apparent attempt to save space

Re: Interoperability is getting better ... What does that mean?

2012-12-30 Thread Jukka K. Korpela
2012-12-30 23:22, Costello, Roger L. wrote: I have heard it stated that, in the context of character encoding and decoding: Interoperability is getting better. Where? It seems that this is what *you* are saying. Do you have data to back up the assertion that interoperability is

Re: When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs

2012-12-23 Thread Jukka K. Korpela
2012-12-23 18:09, Karl Williamson wrote: As another poster said, this quotation would be considered fair use under USA law. It was not a quotation but an excerpt posted without permission. Quotations are allowed when they are needed to back up your statements or specify what you are

Re: When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs

2012-12-22 Thread Jukka K. Korpela
2012-12-22 23:56, Costello, Roger L. wrote: I figure the people on this list can truly appreciate this: I don’t. You are posting an excerpt from a copyrighted book as such, not as a legal quotation for an acceptable purpose. Moreover, you have distorted the text. For example: Homo

Re: UCA and Russian letter Ё

2012-12-21 Thread Jukka K. Korpela
2012-12-21 21:05, Leif Halvard Silli wrote: My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian Dictionary from 2003 agree that both list words on Ё and Е under the same category – namely, under the letter Е. This appears to be the case in any serious dictionary. The use of

Re: Character name translations

2012-12-20 Thread Jukka K. Korpela
2012-12-20 12:52, Martinho Fernandes wrote: I was wondering if there is a list of character names translated into other languages somewhere. Is there? The standard ISO 10646, which is equivalent to Unicode as regards to character names, is published in French, too. According to

Re: Character name translations

2012-12-20 Thread Jukka K. Korpela
2012-12-20 16:41, Andreas Prilop wrote: On Thu, 20 Dec 2012, Jukka K. Korpela wrote: http://www.ling.helsinki.fi/filt/info/mes2/ Unicode names have certain restrictions (capital ASCII letters, etc). This Finnish list even uses non-ASCII characters but sticks to capital letters. Why no small

Re: Character name translations

2012-12-20 Thread Jukka K. Korpela
2012-12-20 14:13, David Starner wrote: It may be useful to try to agree on official or semi-official names for characters in a language. Such a list hardly needs to cover all of the over 100,000 Unicode characters. Why not? Why should an English speaker sticking a arbitrary character into a

Re: Character name translations

2012-12-20 Thread Jukka K. Korpela
2012-12-20 17:59, Asmus Freytag wrote: Character names serve two purposes, which are sometimes at odds. One is to simply act as formal identifiers that are more or less mnemonic (which the hex codes are not). The other is an aid in identifying a character, as an aid in look-up or selection.

Re: Character name translations

2012-12-20 Thread Jukka K. Korpela
2012-12-21 2:45, Asmus Freytag wrote: But when real people, not biologists, want to look up information they have precisely two choices: they can look at a visual index (for species that can be arranged visually) or they can look up the scientific name for the species based on the only thing

Re: StandardizedVariants.txt error?

2012-11-24 Thread Jukka K. Korpela
2012-11-24 8:12, Masatoshi Kimura wrote: According to TUS v6.2 clause 16.4, http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf#page=15 The base character in a variation sequence is never a combining character or a decomposable character. However, the following base characters appearing in

Re: latin1 decoder implementation

2012-11-16 Thread Jukka K. Korpela
2012-11-17 0:20, Michael Everson wrote: On 16 Nov 2012, at 22:12, Buck Golemon b...@yelp.com wrote: That's my personal understanding as well, but can you help me find documentation that I can show to my skeptical workmates? It is the basis for most popular 8-bit character sets, including

Re: texteditors that can process and save in different encodings

2012-10-16 Thread Jukka K. Korpela
2012-10-16 13:06, Christopher Fynn wrote: On Windows I use Andrew West's Babel Pad http://www.babelstone.co.uk/Software/BabelPad.html As far as I can see, the “Encoding” menu in “Save As” in BabelPad has just a small set of encodings to choose from, basically just UTF-8 and UTF-16 and

Re: problem with combining diacritcs in HTML5

2012-10-09 Thread Jukka K. Korpela
2012-10-09 20:32, Bill Poser wrote: No, I was contrasting the behaviour of s followed by U+0332, for which there is no precomposed letter, with U+1E95, which is the precomposed equivalent of z followed by U+0332. You meant to write “followed by U+0331” at the end. But in any case, this is a

Re: problem with combining diacritcs in HTML5

2012-10-08 Thread Jukka K. Korpela
2012-10-08 20:47, Andreas Prilop wrote: I found some DejaVu bug reports where a developer called Ben Laenen suggests the nonzero advance width is intentional I wonder why. Other combining marks in DejaVu Sans Mono do not have such a problem; see

Re: problem with combining diacritcs in HTML5

2012-10-08 Thread Jukka K. Korpela
2012-10-08 21:49, Andreas Prilop wrote: On Mon, 8 Oct 2012, Jukka K. Korpela wrote: http://www.user.uni-hannover.de/nhtcapri/combining-marks.html Your test page is interesting, but is postulates the use of style sheet switching, You are always free to define your preferred font family

Re: problem with combining diacritcs in HTML5

2012-10-07 Thread Jukka K. Korpela
2012-10-07 8:38, Bill Poser wrote: I have a web page that writes into an HTML5 textarea via the javascript dom interface. U+0332 COMBINING LOW LINE is incorrectly rendered as a spacing low line in both Mozilla Firefox and Google Chrome The issue is not limited to textareas but appears in

Re: problem with combining diacritcs in HTML5

2012-10-07 Thread Jukka K. Korpela
2012-10-07 11:51, Michael Everson wrote: The issue is not limited to textareas but appears in normal text too, when the font is set to Courier New. You can also see the problem in Microsoft Word, for example, when using that font. The point is that this is a font problem, and you can see it

Re: Compiling a list of Semitic transliteration characters

2012-09-07 Thread Jukka K. Korpela
2012-09-07 21:16, Richard Wordingham wrote: Some reasons for romanizing: snip 3. Make the language accessible to those who are not familiar with the script The rest of the post is irrelevant. Transliterations from Semitic languages have been established for this reason, and possibly

Re: Compiling a list of Semitic transliteration characters

2012-09-06 Thread Jukka K. Korpela
2012-09-06 23:47, Mark Davis ☕ wrote: The distinction between transliteration and transcription is limited to a few people. Maybe, but I see that distinction clearly made in Finnish national standards, for example, and it is a useful one. It is far better to use unambiguous terms, like

Re: Compiling a list of Semitic transliteration characters

2012-09-06 Thread Jukka K. Korpela
2012-09-07 0:59, Mark Davis ☕ wrote: They might be distinct in Finnish, but in English only in specialized contexts, This is not about everyday language (which is irrelevant in this context) but about the language used in national standards. Best to use terms that will be understood by

Re: Compiling a list of Semitic transliteration characters

2012-09-06 Thread Jukka K. Korpela
2012-09-07 1:54, Mark Davis ☕ wrote: This might come off as a bit snarky, but do you /really/ think the author and every one of the commentators on the thread all really meant the following? Compiling a list of Semitic transliteration characters/, but restricted to only those

Re: Why no combining-character form for U+00F8?

2012-08-17 Thread Jukka K. Korpela
2012-08-17 1:44, Ian Clifton wrote: Andreas Prilop prilop4...@trashmail.net writes: On Thu, 16 Aug 2012, Ian Clifton wrote: Having just been to Norway, and wanting to email my friends all about it, I came across a curiosity: neither of the combining characters U+0337, U+0338 seem to work in

Re: Why no combining‐character form for U+00F8?

2012-08-16 Thread Jukka K. Korpela
2012-08-16 18:31, Ian Clifton wrote: Having just been to Norway, and wanting to email my friends all about it, I came across a curiosity: neither of the combining characters U+0337, U+0338 seem to work in usually‐reliable Emacs, and indeed U+00F8 LATIN SMALL LETTER O WITH STROKE doesn’t seem to

Re: Why no combining‐character form for U+00F8?

2012-08-16 Thread Jukka K. Korpela
2012-08-16 20:53, Cristian Secară wrote: În data de Thu, 16 Aug 2012 19:32:15 +0300, Erkki I Kolehmainen a scris: Although the stroke is not a diacritic, keyboard drivers can be made to generate atomic characters with stroke by using a dead letter key for stroke together with the base

Re: Apostrophe, and DIN keyboard

2012-08-14 Thread Jukka K. Korpela
2012-08-14 22:56, Robert Wheelock wrote: The _tonos_ (overtick) is a STRAIGHT 90º accent mark, whereas the _oxeia_ (acute) is usually slanted at 45º. It is somewhat tragicomic that you make the mistake of using masculine ordinal indicator U+00BA in place of the degree sign U+00B0, when

Re: Emoticon seen in the wild!

2012-07-26 Thread Jukka K. Korpela
2012-07-26 13:04, Andre Schappo kirjoitti: Not emoticon but ……. I received an email from Email Insider. Email was written as E✉ail ✉ being U+2079 I thought it quite clever U+2079 is SUPERSCRIPT NINE “⁹”. I suppose you meant U+2709 ENVELOPE “✉”, an old (Unicode 1.0.0) dingbat (which now

Re: (Informational only: UTF-8 BOM and the real life)

2012-07-25 Thread Jukka K. Korpela
2012-07-26 0:19, Steven Atreju wrote: | And that was an Unicode BOM that has been converted to UTF-8 and then been converted to UTF-8 once again. Apparently the problem is that the data has been doubly encoded: first into UTF-8, then interpreting the bytes of UTF-8 data, interpreting

Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?

2012-07-20 Thread Jukka K. Korpela
2012-07-20 19:52, Philippe Verdy wrote: The Subject fi[el]d is subject to special encoding like Quoted-Printable or Base64 using specific prefixes. This is a matter of character encoding. All plain text inevitably has some encoding, and the encoding may vary without changing the plain text

  1   2   >