Emoji map of Colorado

2020-04-01 Thread Karl Williamson via Unicode
https://www.reddit.com/r/Denver/comments/fsmn87/quarantine_boredom_my_emoji_map_of_colorado/?mc_cid=365e908e08_eid=0700c8706b

EGYPTIAN HIEROGLYPH MAN WITH A ROLL OF TOILET PAPER

2020-03-11 Thread Karl Williamson via Unicode
On 2/12/20 11:12 AM, Frédéric Grosshans via Unicode wrote: Dear Unicode list members (CC Michel Suignard),   the Unicode proposal L2/20-068 , “Revised draft for the encoding of an extended Egyptian Hieroglyphs repertoire,

Re: Call for feedback on UTS #18: Unicode Regular Expressions

2020-01-02 Thread Karl Williamson via Unicode
One thing I noticed in reviewing this is the removal of text about loose matching of the name property. But I didn't see an explanation for this removal. Please point me to the explanation, or tell me what it is. Specifically these lines were removed: As with other property values, names

Re: Missing UAX#31 tests?

2018-07-14 Thread Karl Williamson via Unicode
On 07/09/2018 02:11 PM, Karl Williamson via Unicode wrote: On 07/08/2018 03:21 AM, Mark Davis ☕️ wrote: I'm surprised that the tests for 11.0 passed for a 10.0 implementation, because the following should have triggered a difference for WB. Can you check on this particular case? ÷ 0020

Re: Missing UAX#31 tests?

2018-07-09 Thread Karl Williamson via Unicode
, and I should not expect a more complete series than you furnished. Mark // On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode mailto:unicode@unicode.org>> wrote: I am working on upgrading from Unicode 10 to Unicode 11. I used all the new files. The algo

Re: Missing UAX#31 tests?

2018-07-08 Thread Karl Williamson via Unicode
On 07/08/2018 03:23 AM, Mark Davis ☕️ wrote: PS, although the title was "Missing UAX#31 tests?", I assumed you were talking about http://unicode.org/reports/tr29/ Yes, sorry.

Missing UAX#31 tests?

2018-07-07 Thread Karl Williamson via Unicode
I am working on upgrading from Unicode 10 to Unicode 11. I used all the new files. The algorithms for some of the boundaries, like GCB and WB, have changed so that some of the property values no longer have code points associated with them. I ran the tests furnished in 11.0 for these

Traditional and Simplified Han in UTS 39

2017-12-27 Thread Karl Williamson via Unicode
In UTS 39, it says, that optionally, "Mark Chinese strings as “mixed script” if they contain both simplified (S) and traditional (T) Chinese characters, using the Unihan data in the Unicode Character Database [UCD]. "The criterion can only be applied if the language of the string is known

Inconsistency between UTS 39 and 24

2017-12-21 Thread Karl Williamson via Unicode
In http://unicode.org/reports/tr39/#Mixed_Script_Detection it says, "For more information on the Script_Extensions property and Jpan, Kore, and Hanb, see UAX #24" In http://www.unicode.org/reports/tr24/, there certainly is more information on scx; however, none of the terms Jpan Kore nor Hanb

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode
Under Best Practices, how many REPLACEMENT CHARACTERs should the sequence generate? 0, 1, 2, 3, 4 ? In practice, how many do parsers generate?

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode
On 05/30/2017 02:30 PM, Doug Ewell via Unicode wrote: L2/17-168 says: "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF. For example: is a single maximal subsequence because C0

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode
On 05/26/2017 12:22 PM, Ken Whistler wrote: On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: The link provided about the PRI doesn't lead to the comments. PRI #121 (August, 2008) pre-dated the practice of keeping all the feedback comments together with the PRI itself in a numbered

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode
On 05/26/2017 04:28 AM, Martin J. Dürst wrote: It may be worth to think about whether the Unicode standard should mention implementations like yours. But there should be no doubt about the fact that the PRI and Unicode 5.2 (and the current version of Unicode) are clear about what they

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Karl Williamson via Unicode
On 05/24/2017 12:46 AM, Martin J. Dürst wrote: On 2017/05/24 05:57, Karl Williamson via Unicode wrote: On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: Adding a "recommendation" this late in the game is just bad standards policy. Unless I misunderstand, you a

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Karl Williamson via Unicode
On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: On 5/23/2017 10:45 AM, Markus Scherer wrote: On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode > wrote: So, if the proposal for Unicode really was more of a "feels right"

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Karl Williamson via Unicode
On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote: In reference to: http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf I think Unicode should not adopt the proposed change. The proposal is to make ICU's spec violation conforming. I think there is both a technical and a political

Re: a character for an unknown character

2016-12-21 Thread Karl Williamson
On 12/21/2016 08:45 AM, David Corbett wrote: One Unicode character specifically for this purpose is U+3013 GETA MARK. It is a Japanese symbol used to replace characters that cannot be read during transcription of manuscripts (source: Japanese Wikipedia). It looks like a bold equals sign: 〓.

Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Karl Williamson
It seems counterintuitive to me that the two byte sequence C0 80 should be replaced by 2 replacement characters under best practices, or that E0 80 80 should also be replaced by 2. Each sequence was legal in early Unicode versions, and it seems that it would be best to treat them as each a

Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?

2016-12-12 Thread Karl Williamson
These are currently GCB Other, but when assigned, don't we know that they will be Extended? So this could be done now.

I'm excited about the proposal to add a brontosaurus emoji codepoint

2016-08-29 Thread Karl Williamson
"I'm excited about the proposal to add a brontosaurus emoji codepoint because it has the potential to bring together a half-dozen different groups of pedantic people together" From http://xkcd.com/1726/ I don't know if this is new, or I just never saw it before.

Re: Where are the tools to generate posix and json from cldr?

2016-08-11 Thread Karl Williamson
data, and still could not find the tools in it. One would think that the tools directory contains it, and I did not look in every sub-directory in it, but none looked likely. I then tried transforms, but came up empty there too. // On Thu, Aug 11, 2016 at 8:29 PM, Karl Williamson <

Where are the tools to generate posix and json from cldr?

2016-08-11 Thread Karl Williamson
I can't find these that are mentioned in http://cldr.unicode.org/ "For those interested in the source CLDR data, it is available for each release in the XML format specified by LDML. There are also tools that will convert to JSON and POSIX format. For more information, see CLDR

Re: Release date?

2016-06-21 Thread Karl Williamson
On 06/21/2016 08:43 AM, Doug Ewell wrote: http://opiniojuris.org/2016/06/20/emojis-and-international-law "And tomorrow, June 21, we will have 71 new emojis to play with." Do only bloggers and the press get notified in advance of the release date of Unicode 9.0?

Re: 9.0.0 segmentation and line breaks on the empty string

2016-06-19 Thread Karl Williamson
On 06/19/2016 07:25 AM, Daniel Bünzli wrote: Le dimanche, 12 juin 2016 à 14:26, Daniel Bünzli a écrit : Hello, I notice that in 9.0.0, UAX29 segmentations no longer report boundaries on the empty string while UAX14 still does report a hard line break on it. Is this intended ? and what is the

Re: Adopting ZWJ

2016-06-07 Thread Karl Williamson
On 06/07/2016 06:25 PM, Marcel Schneider wrote: On Tue, 7 Jun 2016 14:52:36 -0600, Karl Williamson wrote: On 06/07/2016 02:48 PM, Karl Williamson wrote: I heard that someone was considering adopting ZWJ. They seemed to think that non-printables are not adoptable. But I was unable to find

Re: Adopting ZWJ

2016-06-07 Thread Karl Williamson
On 06/07/2016 02:48 PM, Karl Williamson wrote: I heard that someone was considering adopting ZWJ. They seemed to think that non-printables are not adoptable. But I was unable to find a clear list of criteria. The page that allows one to adopt said that it wasn't available, but that page

Adopting ZWJ

2016-06-07 Thread Karl Williamson
I heard that someone was considering adopting ZWJ. They seemed to think that non-printables are not adoptable. But I was unable to find a clear list of criteria. The page that allows one to adopt said that it wasn't available, but that page really doesn't make it clear how one can test for

Re: Emoji for subdivision flags

2016-05-25 Thread Karl Williamson
On 05/25/2016 09:27 AM, Doug Ewell wrote: Now that UTR #52 has been suspended, are any *specific* alternative plans for representing subdivision flags being bandied about? -- Doug Ewell | http://ewellic.org | Thornton, CO  What I'd like to know is how does one find out about such

Re: UTC makes the Colbert show

2016-03-30 Thread Karl Williamson
On 03/30/2016 11:54 AM, Mark Davis ☕️ wrote: On Wed, Mar 30, 2016 at 7:42 PM, Jennifer 8. Lee > wrote: I thought his "elf exposing self in park" was an amazing (and accurate) facial expression. ​Right! How does he make his cheeks

Girl, 12, charged for threatening her school with emojis

2016-02-28 Thread Karl Williamson
http://abc27.com/2016/02/27/girl-12-charged-for-threatening-emojis/

Trying to understand Line_Break property apparent discrepancy

2016-01-11 Thread Karl Williamson
It appears that http://www.unicode.org/Public/8.0.0/ucd/auxiliary/LineBreakTest.txt is testing a tailoring rather than the default line break algorithm, contrary to its heading "# Default Line Break Test". And http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/LineBreakTest.html follows

Redundancy in TR14

2016-01-11 Thread Karl Williamson
Example 7 in http://www.unicode.org/reports/tr14/#Examples has these two rules NU × (NU | SY | IS) NU (NU | SY | IS)* × (NU | SY | IS | CL | CP ) It appears to me that the first rule generates a subset of what the 2nd rule generates, and so is useless. It could be hence removed for

Re: Trying to understand Line_Break property apparent discrepancy

2016-01-11 Thread Karl Williamson
On 01/11/2016 03:42 PM, Karl Williamson wrote: It appears that http://www.unicode.org/Public/8.0.0/ucd/auxiliary/LineBreakTest.txt is testing a tailoring rather than the default line break algorithm, contrary to its heading "# Default Line Break Test". And http://www.unicode.org/

Re: Redundancy in TR14

2016-01-11 Thread Karl Williamson
On 01/11/2016 10:55 PM, Mark Davis ☕️ wrote: Looks that way to me too. Can you submit this as feedback? will do {phone} On Jan 12, 2016 00:39, "Karl Williamson" <pub...@khwilliamson.com <mailto:pub...@khwilliamson.com>> wrote: Example 7 in http://www.unicode.org/

Re: Question about Perl5 extended UTF-8 design

2015-11-06 Thread Karl Williamson
On 11/06/2015 01:32 PM, Richard Wordingham wrote: On Thu, 05 Nov 2015 13:41:42 -0700 "Doug Ewell" wrote: Richard Wordingham wrote: No-one's claiming it is for a Unicode Transformation Format (UTF). Then they ought not to call it "UTF-8" or "extended" or "modified" UTF-8,

Question about Perl5 extended UTF-8 design

2015-11-05 Thread Karl Williamson
that these extra bits are "reserved". So we're wondering what potential use you had thought of for these bits. Thanks Karl Williamson

Re: Dark beer emoji

2015-09-01 Thread Karl Williamson
On 09/01/2015 10:37 AM, Doug Ewell wrote: I have no idea whether my proposal is more or less serious, or more or less likely to be adopted, than the original. When I read this, I wondered if it was April 1 instead of September 1.

\b{wb}

2015-08-22 Thread Karl Williamson
The concept of \b in a regular expression meaning to match the boundary between a word and non-word was invented by Larry Wall, for the Perl programming language. This was before Unicode, and a word was defined as alphanumerics plus the underscore, which fit well with how identifiers in that

Re: a mug

2015-07-11 Thread Karl Williamson
On 07/11/2015 10:36 AM, Johannes Bergerhausen wrote: Yes, the mug is funny. It shows not a Unicode problem, it points at a general font problem of operating systems. Dear Apple, Dear Google, Dear Microsoft: please give us *all* missing Unicode glyphs right inside your operating systems!

Re: trying to understand the relationship between the Version 1 Hangul syllables and the later versions'

2015-06-24 Thread Karl Williamson
://www.unicode.org/policies/stability_policy.html#Encoding or why the applicable version for that stability policy is 2.0+, the answer is that it was a direct reaction to The Korean Mess. --Ken On 6/19/2015 1:29 PM, Karl Williamson wrote: I haven't found any information on this. It can't just

Re: Why aren't the emoji modifiers GCB=Extend?

2015-06-21 Thread Karl Williamson
On 06/20/2015 03:02 AM, Mark Davis ☕️ wrote: On Sat, Jun 20, 2015 at 12:24 AM, Ken Whistler kenwhist...@att.net mailto:kenwhist...@att.net wrote: This results from the fact that the fallback behavior for the modifiers is simply as independent pictographic blorts, i.e. the color

Why aren't the emoji modifiers GCB=Extend?

2015-06-19 Thread Karl Williamson
Someone writing code using Unicode 8 found that the FITZPATRICK modifiers are considered separate graphemes from what they modify. This is surprising, and seems contrary to not only the concept of a grapheme cluster, but the spirit of tr51 2.2.3 A supported emoji modifier sequence should be

trying to understand the relationship between the Version 1 Hangul syllables and the later versions'

2015-06-19 Thread Karl Williamson
I haven't found any information on this. It can't just be a transliteration difference, because the number of code points is vastly different between them. Is it the case that the version 1 syllables is a failed abstraction that was replaced by the later versions? Thanks

The Oral History Of The Poop Emoji

2015-06-01 Thread Karl Williamson
https://www.fastcompany.com/3037803/the-oral-history-of-the-poop-emoji-or-how-google-brought-poop-to-america

Re: FYI: The world’s languages, in 7 maps and charts

2015-05-12 Thread Karl Williamson
On 05/12/2015 03:05 PM, Mark Davis ☕️ wrote: http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/ // And a critique: http://languagelog.ldc.upenn.edu/nll/?p=18844

Re: Meroitic cursive fractions numerical values

2015-04-01 Thread Karl Williamson
On 03/31/2015 11:30 AM, Doug Ewell wrote: Karl Williamson public at khwilliamson dot com wrote: It's a small matter to add code to reduce the UCD-specified rational numbers, but it's just one more complication to have to deal with along with the many that the UCD already presents

Re: Meroitic cursive fractions numerical values

2015-03-30 Thread Karl Williamson
On 03/29/2015 03:41 AM, Andrew West wrote: On 28 March 2015 at 20:05, Karl Williamson pub...@khwilliamson.com wrote: Existing software that looks at the numeric values of characters is written expecting that rational numbers will have been reduced to their lowest form. That seems

Re: Meroitic cursive fractions numerical values

2015-03-28 Thread Karl Williamson
of the numerator and denominator Or is that written down somewhere already? A./ On 3/28/2015 1:05 PM, Karl Williamson wrote: In the 8.0 Beta files, some numerical values are not reduced to their lowest forms. Is there a compelling reason that 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R6/12

Meroitic cursive fractions numerical values

2015-03-28 Thread Karl Williamson
In the 8.0 Beta files, some numerical values are not reduced to their lowest forms. Is there a compelling reason that 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R6/12;N; is not written as 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R1/2;N; given that there is

Problems in google and yahoo searches

2015-03-28 Thread Karl Williamson
I had thought Tangut was going to be in Unicode 8, but the beta files didn't include it, so I tried simply searching on tangut from http://www.unicode.org/search/ Only bing showed a match on the pipeline page which had the answer. Recently I wanted to review the email correspondence when

Re: Question about the Sentence_Break property

2015-02-21 Thread Karl Williamson
On 02/20/2015 04:56 PM, Philippe Verdy wrote: 2015-02-20 6:14 GMT+01:00 Richard Wordingham richard.wording...@ntlworld.com mailto:richard.wording...@ntlworld.com: TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8. One thing that is missing is mention of the convention

Question about the Sentence_Break property

2015-02-19 Thread Karl Williamson
UAX 29 says this: Break after paragraph separators. SB4.Sep | CR | LF Why are CR and LF considered to be paragraph separators? NEL and Line Break are as well. My mental model of plain text has it containing embedded characters, which I'll call \n, to allow it to be displayed in a

Re: UAX 29 questions

2015-01-29 Thread Karl Williamson
But you cannot merge any of these two last rules in a single rule for WB56. 2015-01-25 7:26 GMT+01:00 Karl Williamson pub...@khwilliamson.com mailto:pub...@khwilliamson.com: I vaguely recall asking something like this before, but if so, I didn't save the answers, and a search of the archives

Re: UAX 29 questions

2015-01-29 Thread Karl Williamson
On 01/29/2015 08:19 PM, Philippe Verdy wrote: 2015-01-29 19:52 GMT+01:00 Karl Williamson pub...@khwilliamson.com mailto:pub...@khwilliamson.com: Rule WB4 is Ignore Format and Extend characters, except when they appear at the beginning of a region of text.. Not clearly stated

Re: New Unicode Emoji draft, available for review

2014-11-07 Thread Karl Williamson
On 11/05/2014 02:48 PM, Rick McGowan wrote: FYI, Posting this on behalf of Mark Davis... Something in his original reply message is apparently toxic to our mail gateway that it can't get through. (Investigating.) May be the literal U+1F4A9, which I have (I'm sorry) redacted below. Rick

Re: What happened to...?

2014-09-22 Thread Karl Williamson
On 09/20/2014 03:32 AM, Mark Davis ☕️ wrote: I agree that we should minute at least some reason for declining. It need only be a sentence or two. I would hope that the requesters get a detailed explanation of the rejection. It would be very wrong not to do so. If so, then the minutes could

Question about WordBreak property rules

2014-07-24 Thread Karl Williamson
http://www.unicode.org/draft/reports/tr29/tr29.html#WB6 indicates that there should be no break between the first two letters in the sequence Hebrew_Letter Single_Quote Hebrew_Letter. However, rule 7a just below indicates that there should be no break between a Hebrew_Letter and a

Re: Question about WordBreak property rules

2014-07-24 Thread Karl Williamson
On 07/24/2014 01:38 PM, Karl Williamson wrote: http://www.unicode.org/draft/reports/tr29/tr29.html#WB6 indicates that there should be no break between the first two letters in the sequence Hebrew_Letter Single_Quote Hebrew_Letter. However, rule 7a just below indicates that there should

Re: Corrigendum #9

2014-07-14 Thread Karl Williamson
I ran across this in Section 3.7.4 of http://www.unicode.org/reports/tr36/ Use pairs of noncharacter code points in the range FDD0..FDEF. These are super private-use characters, and are discouraged for general interchange. The transformation would take each nibble of a byte Y, and add to FDD0

Re: Corrigendum #9

2014-07-03 Thread Karl Williamson
On 07/03/2014 02:48 PM, Asmus Freytag wrote: On 7/3/2014 11:02 AM, Richard COOK wrote: On Jul 2, 2014, at 8:02 AM, Karl Williamson pub...@khwilliamson.com wrote: Corrigendum #9 has changed this so much that people are coming to me and saying that inputs may very well have non-characters

Unencoded cased scripts and unencoded titlecase letters

2014-07-02 Thread Karl Williamson
It's my sense that there are very few cased scripts in existence that are ever likely to be encoded by Unicode that haven't already been so-encoded. I also suspect that there very few new titlecased letters will ever be added to Unicode, as I believe these all come to maintain roundtrip

Re: Corrigendum #9

2014-07-02 Thread Karl Williamson
On 06/12/2014 11:14 PM, Peter Constable wrote: From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Karl Williamson Sent: Wednesday, June 11, 2014 9:30 PM I have a something like a library that was written a long time ago (not by me) assuming that noncharacters were illegal in open

Re: Corrigendum #9

2014-06-11 Thread Karl Williamson
On 06/02/2014 09:48 AM, Markus Scherer wrote: On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell d...@ewellic.org mailto:d...@ewellic.org wrote: I suspect everyone can agree on the edge cases, that noncharacters are harmless in internal processing, but probably should not appear in random

Apparent discrepanccy between FAQ and Age.txt

2014-06-10 Thread Karl Williamson
The FAQ http://www.unicode.org/faq/private_use.html#sentinels says that the last 2 code points on the planes except BMP were made noncharacters in TUS 3.1. DerivedAge.txt gives 2.0 for these. The conformance wording about U+FFFE and U+ changed somewhat in Unicode 2.0, but these were

Re: Corrigendum #9

2014-06-08 Thread Karl Williamson
On 06/07/2014 10:33 PM, Asmus Freytag wrote: On 6/7/2014 9:19 PM, Karl Williamson wrote: On 06/02/2014 11:00 AM, Shawn Steele wrote: To further my understanding, can someone provide examples of how these are used in actual practice? I can't think of any offhand and the closest I get is like

Re: Corrigendum #9

2014-06-07 Thread Karl Williamson
On 06/02/2014 11:00 AM, Shawn Steele wrote: To further my understanding, can someone provide examples of how these are used in actual practice? I can't think of any offhand and the closest I get is like the old escape characters to get a dot matrix printer to shift modes, or old word

Re: Corrigendum #9

2014-06-01 Thread Karl Williamson
On 05/30/2014 12:49 PM, Asmus Freytag wrote: One of the concerns was that people felt that they had to have data pipeline style implementations (tools) go and filter these out - even if there was no intent for the implementation to use them internally in any way. Making clear that the standard

Re: Corrigendum #9

2014-06-01 Thread Karl Williamson
On 06/01/2014 10:07 AM, Markus Scherer wrote: On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson pub...@khwilliamson.com mailto:pub...@khwilliamson.com wrote: Thanks, I had not thought about that. I'm thinking wording something like this is more appropriate Noncharacters may

Corrigendum #9

2014-05-30 Thread Karl Williamson
I'm having a problem with this http://www.unicode.org/versions/corrigendum9.html Some people now think it means that noncharacters are really no different from private-use characters, and should be treated very similarly if not identically. It seems to me that they should be illegal in open

Re: ID_Start, ID_Continue, and stability extensions

2014-04-28 Thread Karl Williamson
On 04/24/2014 01:56 PM, Steffen Nurpmeso wrote: Markus Scherer markus@gmail.com wrote: |I strongly recommend you parse the derived properties rather than trying to |follow the derivation formula, because that can change over time. ..this file includes only those core properties that

Re: Is it save to dig into comment contents of PropList.txt?

2013-11-14 Thread Karl Williamson
On 11/07/2013 05:58 AM, Steffen Daode Nurpmeso wrote: Karl Williamson pub...@khwilliamson.com wrote: |On 11/06/2013 03:43 AM, Steffen Daode Nurpmeso wrote: | Philippe Verdy verd...@wanadoo.fr wrote: ||2013/11/5 Steffen Daode sdao...@gmail.com || (The problem i'm facing is that _PRINT

Re: Is it save to dig into comment contents of PropList.txt?

2013-11-06 Thread Karl Williamson
On 11/06/2013 03:43 AM, Steffen Daode Nurpmeso wrote: Philippe Verdy verd...@wanadoo.fr wrote: |2013/11/5 Steffen Daode sdao...@gmail.com | (The problem i'm facing is that _PRINT and _GRAPH cannot be set | for some properties from PropList.txt, say, _PRINT can't be set | for U+0009,

Re: What to backup after corruption of code units?

2013-08-28 Thread Karl Williamson
On 08/28/2013 06:52 PM, Asmus Freytag wrote: On 8/28/2013 5:19 PM, Doug Ewell wrote: Actually 0xC2, according to the rules of UTF-8. Hmm. What you are referring to is that 0xC0 and 0xC1 don't occur because of the requirement for minimal length encoding. However, a check for =0xC0 will give

Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Karl Williamson
On 07/19/2013 11:51 AM, Costello, Roger L. wrote: Hi Folks, Suppose that these hex bytes: C3 83 C2 B1 show up in a message and the message contains no hint what its encoding is. Perhaps it is 8859-1, in which case the message consists of four 1-byte characters: C3 = Ã 83 = the “no

Bing now translates to/from Klingon

2013-05-17 Thread Karl Williamson
http://www.bing.com/translator

Re: Rendering Raised FULL STOP between Digits

2013-03-22 Thread Karl Williamson
On 03/21/2013 04:48 PM, Richard Wordingham wrote: For linguistic analysis, you need the normalisation appropriate to the task. This is a case where Unicode normalisation generally throws away information (namely, how the author views the characters), whereas in analysing Burmese you may want to

Re: Rendering Raised FULL STOP between Digits

2013-03-20 Thread Karl Williamson
On 03/09/2013 07:52 PM, Richard Wordingham wrote: On Sat, 09 Mar 2013 16:21:17 -0700 Karl Williamson pub...@khwilliamson.com wrote: Sorry, for the delayed reply; I've been under deadline Rendering is not the only consideration. Processing textual content for 0387 is broken because

Re: When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs

2012-12-23 Thread Karl Williamson
nothing wrong with the original post. And I found Jukka's response objectionable. They say that the road to hell is paved with good intentions. Sincerely, Erkki -Alkuperäinen viesti- Lähettäjä: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] Puolesta Karl Williamson

Re: When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs

2012-12-23 Thread Karl Williamson
On 12/23/2012 09:56 AM, Jukka K. Korpela wrote: 2012-12-23 18:09, Karl Williamson wrote: As another poster said, this quotation would be considered fair use under USA law. It was not a quotation but an excerpt posted without permission. Quotations are allowed when they are needed to back up

Re: When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs

2012-12-22 Thread Karl Williamson
On 12/22/2012 03:45 PM, Jukka K. Korpela wrote: 2012-12-22 23:56, Costello, Roger L. wrote: I figure the people on this list can truly appreciate this: I don’t. You are posting an excerpt from a copyrighted book as such, not as a legal quotation for an acceptable purpose. Moreover, you have

Re: Are Named sequences always going to be graphemes?

2012-06-21 Thread Karl Williamson
On 06/21/2012 01:45 PM, Asmus Freytag wrote: OK. Will they always be in NFC? To apply Ken's dictume to this case: That seems like a straitjacket looking for an unwilling wearer. ;-) Unless it's excluded from the start, anytime you limit it, when the time comes you need something

Are Named sequences always going to be graphemes?

2012-06-20 Thread Karl Williamson
All current named sequences appear to be each a single grapheme. That seems like it should always be the case. If I'm right, should UAX #34 say this.

Turkic casefolding rules

2012-05-12 Thread Karl Williamson
In CaseFolding.txt, it says the following: Note that the Turkic mappings do not maintain canonical equivalence without additional processing. See the discussions of case mapping in the Unicode Standard for more information. I couldn't find any more detail about these in the 6.1 Unicode

Re: Origins of w

2012-04-16 Thread Karl Williamson
On 04/16/2012 12:04 PM, Asmus Freytag wrote: On 4/16/2012 9:23 AM, arno.s wrote: Am 16/04/2012 15:55, schrieb Andreas Prilop: On Sun, 15 Apr 2012, David Starner wrote: At Wiktionary, we're looking at (U+1E98) and we can't figure out where it came from. It's from Unicode 1.1, which makes it

Three character canonical decompositions in version 2 releases

2012-04-03 Thread Karl Williamson
http://unicode.org/policies/stability_policy.html says that effective starting in Version 2.0, Canonical mappings (Decomposition_Mapping property values) are always limited either to a single value or to a pair. The second character in the pair cannot itself have a canonical mapping. I

Re: more flexible pipeline for new scripts and characters

2011-11-18 Thread Karl Williamson
On 11/16/2011 07:25 AM, Asmus Freytag wrote: Peter, in principle, the idea of a provisional status is a useful concept whenever one wants to publish something based on potentially doubtful or possibly incomplete information. And you are correct, that, in principle, such an approach could be

How do we find out what assigned code points aren't normally used in text?

2011-09-09 Thread Karl Williamson
On 07/06/2011 04:23 PM, Ken Whistler wrote: I'm not sure whether the FB05/FB06 instance is important enough to add or not. Neither of those compabitility ligatures should ordinarily be used in text, anyway ... --Ken I'm wondering what other characters might not ordinarily be used in text,

Re: How do we find out what assigned code points aren't normally used in text?

2011-09-09 Thread Karl Williamson
On 09/09/2011 02:36 PM, Kent Karlsson wrote: Den 2011-09-09 21:24, skrev Karl Williamsonpub...@khwilliamson.com: On 07/06/2011 04:23 PM, Ken Whistler wrote: I'm not sure whether the FB05/FB06 instance is important enough to add or not. Neither of those compabitility ligatures should

Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-31 Thread Karl Williamson
On 08/30/2011 06:27 PM, Philippe Verdy wrote: After looking at the effective reason why this PRI #202 emerged (a request from Perl authors), exposed in UTC document number L2/2011/11281, I think now that even *all* these aliases were not needed. The bug emerged in Perl only because a character

Slide show: Survey of current programming language support for Unicode

2011-07-30 Thread Karl Williamson
Tom Christiansen recently gave a talk at the OSCON conference concerning the varying levels of support for Unicode in some current programming languages. It is accessible via this link http://training.perl.com/OSCON2011/index.html The talk is entitled Unicode Support Shootout, and is is one

Re: Questions about UAX #29

2011-07-05 Thread Karl Williamson
On 07/05/2011 09:29 AM, Mark Davis ☕ wrote: Ah, you're right; I wasn't looking carefully enough at what you wrote. Yes, an unassigned code point (Cn) is treated as a base character. Unassigned code points are peculiar beasts, since we don't know really how they should behave until (and if)

Re: Questions about UAX #29

2011-07-04 Thread Karl Williamson
On 07/03/2011 05:52 PM, Mark Davis ☕ wrote: Mark /— Il meglio è l’inimico del bene —/ On Sat, Jul 2, 2011 at 14:58, Karl Williamson pub...@khwilliamson.com mailto:pub...@khwilliamson.com wrote: I have two questions about this. 1) In UAX #44, it says for information about

Questions about UAX #29

2011-07-02 Thread Karl Williamson
I have two questions about this. 1) In UAX #44, it says for information about the Grapheme_Base property, to see UAX #29, but that document doesn't mention this property. 2) The definition in UAX #29 for both legacy and extended grapheme clusters effectively says that any Gc=Cn code points

Unicode game

2010-11-17 Thread karl williamson
I'm posting this Perl program so the author doesn't have to subscribe to this list. We thought people here might appreciate it. As the sample output shows, it takes input text and reverses and mirrors it. [This is completely silly, just an afternoon programming game.] Witness leo in

Irrational numeric values in TUS

2010-10-12 Thread karl williamson
The Unicode standard only gives numeric values to rational numbers. Is the reason for this merely because of the difficulty of representing irrational ones? In looking through the list of code points, I actually found only one case where a character totally unambiguously refers to a

What should happen with \N{LATIN SMALL LIGATURE IJ} =~ /(i)(j)/i

2010-09-05 Thread karl williamson
\N{LATIN SMALL LIGATURE IJ} =~ /ij/i matches, as the fold of the ligature is 'ij'. But if you simply add capturing parentheses, as in this post's subject line, it becomes somewhat nonsensical, as each captured group should match some part of the indivisible character LATIN SMALL LIGATURE IJ.

Re: Digit/letter variants in the same unified script (was: stability policy on numeric type = decimal)

2010-07-29 Thread karl williamson
Mark Davis ☕ wrote: Mark /— Il meglio è l’inimico del bene —/ On Thu, Jul 29, 2010 at 05:57, Philippe Verdy verd...@wanadoo.fr mailto:verd...@wanadoo.fr wrote: Martin J. Dürst due...@it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp wrote: On 2010/07/29 13:33, karl

Re: Digit/letter variants in the same unified script (was: stability policy on numeric type = decimal)

2010-07-29 Thread karl williamson
Asmus Freytag wrote: Having Nd be limited to characters that a) are used in decimal radix numbers b) are part of a complete, ordered sequence 0..9 would make this property regular enough to serve implementers. You could script the creation of relevant data for your implementation based on that

Re: Reasonable to propose stability policy on numeric type = decimal

2010-07-28 Thread karl williamson
Asmus Freytag wrote: On 7/25/2010 6:05 PM, Martin J. Dürst wrote: On 2010/07/26 4:37, Asmus Freytag wrote: PPS: a very hypothetical tough case would be a script where letters serve both as letters and as decimal place-value digits, and with modern living practice. Well, there actually is

Why does EULER CONSTANT not have math property and PLANCK CONSTANT does?

2010-07-27 Thread karl williamson
They are U+2107 and U+210E respectively. Chapter 4 of TUS seems to indicate that neither should, since they both are operands, and it says this property applies to mathematical operators.

  1   2   >