Re: a suggestion new emoji .
On 19 Aug 2015 at 20:59, I wrote: On 19 Aug 2015 at 17:18, William_J_G Overington wrote: I suggest to Emma the contacting of Unicode Inc. using the following form. http://www.unicode.org/reporting.html William is right. I strongly recommend you to first use the Contact form, as I did from the beginning on, long before e-mailing to the List. Using the Contact form you will always get a good answer (not always a *positive* response, but always a *good* answer). I forgot to add that of course you are always welcome on the Mailing List, where you equally get good answers. But you need to be patient, as best answers come naturally last in thread, like it occurred just six hours before you posted. On 19 Aug 2015 at 20:48, Andrew West wrote: I don't know, I think durian emoji would be quite distinctive, as shown in the examples on this page (I am rather taken with the sad durian which gets no hugs). http://www.cafepress.co.uk/+durian+stickers For a fruit, a vegetable, a cereal, a plant, an animal, having its emoji encoded in Unicode is like a big hug! So we thank Mrs Haneys for having suggested the DURIAN emoji! Marcel
Thai Word Breaking
I'm trying to work out the meaning of TUS 8.0 Section 23.2. To do Thai word breaking properly, one needs to do a semantic analysis of the text to do the equivalent of resolving the equivalent of 'humanevents' into 'human events' rather than 'humane vents'. One also needs to cope with unknown and misspelt words. (A lot of effort has been devoted to avoid going to the extreme of doing semantic analysis.) However, it is possible to read Section 23.2 as prohibiting the use of certain information, and I would like to check whether this is the intended meaning. The opening paragraph seems clear enough on first reading: The effect of layout controls is specific to particular text processes. As much as possible, lay-out controls are transparent to those text processes for which they were not intended. In other words, their effects are mutually orthogonal. However, my first question is, Are paragraph boundaries directly admissible as evidence for or against word boundaries not adjacent to them?. For example, most Thai word breakers would not regard a paragraph boundary as any more significant than a phrase-delimiting space. However, a paragraph boundary often indicates a change of topic. My second question is, Are line breaks admissible as evidence for or against word boundaries not adjacent to them? For example, if a phrase makes heavy use of U+200B ZERO WIDTH SPACE (ZWSP), one may deduce that it is likely that all word boundaries within it are marked explicitly. This example is more useful for Khmer than to Thai, for whereas Cambodians were once taught to mark word boundaries, Thais rarely use ZWSP to mark word boundaries. My third question is, Is the absence of a line break opportunity admissible as evidence for or against a word boundary?. Here I see conflicting signals. There is a character U+2060 WORD JOINER (WJ) which *used* to be regarded as the counterpart of ZWSP. The understanding was that ZSWP marked a word boundary and provided a line-break opportunity, while WJ denied both. This, however, is no longer the case. To quote the TUS section about WJ: P1: (Ignored) P2S1: The word joiner must not be confused with the zero width joiner or the combining grapheme joiner, which have very different functions. P2S2: In particular, inserting a word joiner between two characters has no effect on their ligating and cursive joining behavior. P2S3: The word joiner should be ignored in contexts other than line breaking. P2S4: Note in particular that the word joiner is ignored for word segmentation. P2S5: (See Unicode Standard Annex #29, “Unicode Text Segmentation.”) Paragraph 2 Sentence 3 (P2S3) appears to rule out its use in word-breaking, but perhaps it does not if line-breaking is being used as evidence for word boundaries. P2S4 has three very different interpretations: (i) This is an assertion of fact, and may therefore be incorrect. (ii) The word 'is' is sloppy wording for 'should be'. Section 23.2 contains much sloppier wording, as I have already advised members of the UTC (4 July 2015). (iii) This is a deduction from other parts of the specification. Now, if P2S4 said 'is normally ignored for word segmentation', that would have made sense, for that applies to the default word boundary specification in UAX#29. However, just before Section 4.1, UAX#29 explains that it does not specify what happens for word boundary determination in Thai! (It does constrain what happens, though.) At the end of UAX#29 Section 6.2, there is the provision, The Ignore rules should not be overridden by tailorings, with the possible exception of remapping some of the Format characters to other classes. To accord with the user perceptions of Unicode-aware people who work with SE Asian scripts, I am tempted to ask for CLDR to tailor the word-breaking algorithms for the corresponding languages so that the word-breaking classes of WJ (and ZWNBSP) are changed from Format to MidLetter. That would match the widespread old *perception* that there should be no word break in a sequence Thai letter, (Thai mark,)* WJ, Thai letter. However, there are several objections: (a) Perhaps P2S3 and P2S4 prohibit this. (b) If the word-break property of Thai letters falls back to Other, there would still be a word break between them. (c) If the word-break property of Thai letters fell back to ALetter, an old suggestion, WJ would have no effect on the presence of a word break. (d) If Thai word breaking assigns word-break classes to each letter (gc=Lo), then word boundaries can be suppressed by choosing the classes appropriately.Non-spacing Thai vowels are very relevant to Thai word-breaking, but formally are 'ignored'. WJ could be 'ignored' in exactly the same way. Richard.
Re: \b{wb}
On Sat, 22 Aug 2015 14:08:14 -0600 Karl Williamson pub...@khwilliamson.com wrote: But it isn't such a replacement, creating some consternation, and the main reason is that, unlike \b, it treats the boundary between white space characters as a breaking opportunity, so that it doesn't create runs of them. Thus if you have two spaces after a full stop, it treats each as an individual word. My question is Was this intentional, and if so, Why? See below. TR18 says \b{w} is aZero-width match at a Unicode word boundary. Note that this is different than \b alone, which corresponds to \w and \W. Unless I'm being stupid, \b and \b{w} are indeed vary different. Consider a sequence U+0020, U+1F1EB REGIONAL INDICATOR SYMBOL LETTER F, U+1F1F7 REGIONAL INDICATOR SYMBOL LETTER R, U+0041 LATIN CAPITAL LETTER A, U+0062 LATIN SMALL LETTER B That has two internal word boundaries, splitting it into a space, a flag, and the word Ab. Is this what you want? Worse, consider a short Thai sentence ผมไม่มีคอมพิวเตอร์ที่ดี. That gets split by ICU into |ผม|ไม่มี|คอมพิวเตอร์|ที่|ดี| - 5 words and 4 internal word boundaries. Note that there's a word or two between each boundary. Is this what you want? My question is Was this intentional, and if so, Why? Take a look at the rules in UAX#29 Section 4.1.1. Apart from the first two and the last, they all identify where word boundaries aren't. This is tidy - the algorithm concentrates on working out where a word continues. In principle, you could, I believe, extend the rules so that characters outside words and regional indicator runs were not divided, but it would make for a more complicated algorithm with plenty of opportunities for error. I think the thought was that word-free runs did not need to be assembled into runs of non-word material. The short answer, of course, is that the regular expression engine could do this final step of post-processing itself. This may get tricky with customised word-breaking. Richard.
Re: Square Brackets with Tick
On Sat, 22 Aug 2015 10:32:45 -0700 Asmus Freytag (t) asmus-...@ix.netcom.com wrote: On 8/22/2015 9:35 AM, Julian Bradfield wrote: There is no inherent meaning to the order of codepoints, it's just convenience. And for that reason, we have property files to explicitly give the properties rather than asking the user to glean them from code point order. But codepoints are normally orderly until they enter the ISO approval process. Thereafter, disorder creeps in, and becomes ever more likely as blocks fill up. The concern here is that the opening-closing pairing information, which used not to be a property, has been deduced wrongly. The code chart is prima facie evidence that whoever drew the order up conceived of U+298D and U+298E as a pair. I've traced the character as far back as http://www.unicode.org/L2/L1999/99159.pdf . Unfortunately, its meaning therein is implicitly described as unknown! It looks as though someone somewhere fashioned type for it - or perhaps another of the set of four - but no-one remembers what it was used for! Now, *if* no-one is using it, it doesn't really matter if the pair is wrong. Richard.
\b{wb}
The concept of \b in a regular expression meaning to match the boundary between a word and non-word was invented by Larry Wall, for the Perl programming language. This was before Unicode, and a word was defined as alphanumerics plus the underscore, which fit well with how identifiers in that computer language (and many others) were defined. Essentially \b is defined to break between runs of word characters versus runs of non-word characters. The latest version of Perl 5 (recently released) has added \b{w} based on Unicode's definition. The typical expectation of its programmers is that it would be a drop-in replacement for the old \b, with much better results in parsing natural languages. But it isn't such a replacement, creating some consternation, and the main reason is that, unlike \b, it treats the boundary between white space characters as a breaking opportunity, so that it doesn't create runs of them. Thus if you have two spaces after a full stop, it treats each as an individual word. My question is Was this intentional, and if so, Why? TR18 says \b{w} is aZero-width match at a Unicode word boundary. Note that this is different than \b alone, which corresponds to \w and \W. And UAX29 says adjacent spaces are collapsed to a single space in intelligent cut and paste using the WB property.
Re: Square Brackets with Tick
On 8/22/2015 2:47 PM, Richard Wordingham wrote: But codepoints are normally orderly until they enter the ISO approval process. Thereafter, disorder creeps in, and becomes ever more likely as blocks fill up Haha, good one. . The concern here is that the opening-closing pairing information, which used not to be a property, has been deduced wrongly. The code chart is prima facie evidence that whoever drew the order up conceived of U+298D and U+298E as a pair. Not necessarily. Code charts are sometimes ordered in mysterious ways. However, read on. I've traced the character as far back as http://www.unicode.org/L2/L1999/99159.pdf . Unfortunately, its meaning therein is implicitly described as unknown! It looks as though someone somewhere fashioned type for it - or perhaps another of the set of four - but no-one remembers what it was used for! This document doesn't tell you what the pairing is supposed to be, only that which ones are opening and closing (so we know that they are intended to be arranged [ ] and not ] [ (ticks omitted), but we don't know which of the two [[ go with which of the two ]], other than the - natural - assumptions that pairs are listed adjacently). For the first document that gives the pairing information, see: http://www.unicode.org/L2/L2012/12173r-bidi-paren.pdf There is no note or other indication in this document that shows that any thought was put into the different ordering. However, it is notable that all other bracket pairings follow the bidi mirroring glyph relation, so I would put my money on that that file was used to create the pairs using a script, rather than manual editing. This is corroborated in section 3.2 of that document. Nigel was the first to notice that these were not encoded as left-right glyph pairs, but with the diagonal "tick" (originally called a solidus) having the same orientation in a pair (as if intended to bracket something in either diagonal or anti-diagonal direction). Given that L2/12-173 states that the property was derived via algorithm that is based on left-right mirroring and not via matching open/close pairs based on other factors, (including adjacency in the charts) I'm happy to join the growing chorus that declares this to be a bug. Luckily there seems to be no stability policy that would prevent fixing this one. A./
Re: Square Brackets with Tick
On 8/22/2015 9:35 AM, Julian Bradfield wrote: There is no inherent meaning to the order of codepoints, it's just convenience. And for that reason, we have property files to explicitly give the properties rather than asking the user to "glean" them from code point order. A./
Square Brackets with Tick
Hi all I am looking for clarification on an aspect of Unicode bracket pairing, specifically in relation to the following four characters: 298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER 298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER These stand out from all other brackets listed in *BidiBrackets.txt* due to an inconsistency in pairing. I have looked for references online on where these brackets are used in the wild as mathematical symbols but have been unable to find anything useful. All other bracket pairs are listed as opener followed by closer, sometimes with several code points in between. According to the code point pairs in the first and second columns of this file, these particular brackets should be paired as the *first and fourth* and the *third and second*. Intuitively however, these would actually be *first and second* and *third and fourth* if one is to expect consistency. My guess is that there are three possibilities here: 1. The current pairing information is correct and the sequence is irregular for some historical reason 2. The pairing information is wrong and the sequence is consistent with other brackets 3. Pairing can be mixed with either left bracket used as a valid opener and either right bracket used as a valid closer; in this case, the pairing information is incomplete I'd be very grateful if anyone could clarify the situation here or if anyone knows of a resource that describes where such brackets are used in practice. Many thanks Nigel Small
Re: Square Brackets with Tick
From: Nigel Small ni...@nigelsmall.com Date: Sat, 22 Aug 2015 17:08:48 +0100 I am looking for clarification on an aspect of Unicode bracket pairing, specifically in relation to the following four characters: 298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER 298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER These stand out from all other brackets listed in BidiBrackets.txt due to an inconsistency in pairing. I have looked for references online on where these brackets are used in the wild as mathematical symbols but have been unable to find anything useful. All other bracket pairs are listed as opener followed by closer, sometimes with several code points in between. I think the order in the file is by the codepoint in the leftmost column. All the rest is just a coincidence. But I don't speak for the Unicode Consortium, so please wait for a definitive reply.
Re: Square Brackets with Tick
On 2015-08-22, Nigel Small ni...@nigelsmall.com wrote: 298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER 298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER with several code points in between. According to the code point pairs in the first and second columns of this file, these particular brackets should be paired as the *first and fourth* and the *third and second*. Intuitively however, these would actually be *first and second* and *third and fourth* if one is to expect consistency. That's a strange intuition! Mathematical brackets are expected to pair with left-right symmetry, not rotational symmetry. As in, for example, floor and ceiling brackets. The pairing in the file is the natural one. 1. The current pairing information is correct and the sequence is irregular for some historical reason That will be the explanation. There is no inherent meaning to the order of codepoints, it's just convenience. One of the experts here can probably tell us why these four brackets happen to be coded in this order. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.