Re: a suggestion new emoji .

2015-08-22 Thread Marcel Schneider
On 19 Aug 2015 at 20:59, I wrote:

 On 19 Aug 2015 at 17:18, William_J_G Overington  wrote:

 I suggest to Emma the contacting of Unicode Inc. using the following form.
 http://www.unicode.org/reporting.html

 William is right. I strongly recommend you to first use the Contact form, as 
 I did from the beginning on, long before e-mailing to the List.
 Using the Contact form you will always get a good answer (not always a 
 *positive* response, but always a *good* answer).

I forgot to add that of course you are always welcome on the Mailing List, 
where you equally get good answers. But you need to be patient, as best answers 
come naturally last in thread, like it occurred just six hours before you 
posted.


On 19 Aug 2015 at 20:48, Andrew West wrote:
 
  I don't know, I think durian emoji would be quite distinctive, as
  shown in the examples on this page (I am rather taken with the sad
  durian which gets no hugs).
 http://www.cafepress.co.uk/+durian+stickers

For a fruit, a vegetable, a cereal, a plant, an animal, having its emoji 
encoded in Unicode is like a big hug!
So we thank Mrs Haneys for having suggested the DURIAN emoji!

Marcel


Thai Word Breaking

2015-08-22 Thread Richard Wordingham
I'm trying to work out the meaning of TUS 8.0 Section 23.2.

To do Thai word breaking properly, one needs to do a semantic analysis
of the text to do the equivalent of resolving the equivalent of
'humanevents' into 'human events' rather than 'humane vents'.  One also
needs to cope with unknown and misspelt words.  (A lot of effort has
been devoted to avoid going to the extreme of doing semantic analysis.)
However, it is possible to read Section 23.2 as prohibiting the use of
certain information, and I would like to check whether this is the
intended meaning.

The opening paragraph seems clear enough on first reading:

The effect of layout controls is specific to particular text processes.
As much as possible, lay-out controls are transparent to those text
processes for which they were not intended. In other words, their
effects are mutually orthogonal.

However, my first question is, Are paragraph boundaries
directly admissible as evidence for or against word boundaries not
adjacent to them?.  For example, most Thai word breakers would not
regard a paragraph boundary as any more significant than a
phrase-delimiting space.  However, a paragraph boundary often indicates
a change of topic.

My second question is, Are line breaks admissible as evidence for
or against word boundaries not adjacent to them?  For example, if a
phrase makes heavy use of U+200B ZERO WIDTH SPACE (ZWSP), one may deduce
that it is likely that all word boundaries within it are marked
explicitly. This example is more useful for Khmer than to Thai, for
whereas Cambodians were once taught to mark word boundaries, Thais
rarely use ZWSP to mark word boundaries.

My third question is, Is the absence of a line break opportunity
admissible as evidence for or against a word boundary?.  Here I
see conflicting signals.

There is a character U+2060 WORD JOINER (WJ) which *used* to be regarded
as the counterpart of ZWSP.  The understanding was that ZSWP marked a
word boundary and provided a line-break opportunity, while WJ denied
both.  This, however, is no longer the case.  To quote the TUS section
about WJ:

P1: (Ignored)

P2S1: The word joiner must not be confused with the zero width joiner
or the combining grapheme joiner, which have very different functions.

P2S2: In particular, inserting a word joiner between two characters has
no effect on their ligating and cursive joining behavior.

P2S3: The word joiner should be ignored in contexts other than line
breaking.

P2S4: Note in particular that the word joiner is ignored for word
segmentation.

P2S5: (See Unicode Standard Annex #29, “Unicode Text Segmentation.”)

Paragraph 2 Sentence 3 (P2S3) appears to rule out its use in
word-breaking, but perhaps it does not if line-breaking is being used
as evidence for word boundaries.

P2S4 has three very different interpretations:

(i) This is an assertion of fact, and may therefore be incorrect.

(ii) The word 'is' is sloppy wording for 'should be'.  Section 23.2
contains much sloppier wording, as I have already advised members of
the UTC (4 July 2015).

(iii) This is a deduction from other parts of the specification.  Now,
if P2S4 said 'is normally ignored for word segmentation', that would
have made sense, for that applies to the default word boundary
specification in UAX#29.  However, just before Section 4.1, UAX#29
explains that it does not specify what happens for word boundary
determination in Thai!  (It does constrain what happens, though.)

At the end of UAX#29 Section 6.2, there is the provision, The Ignore
rules should not be overridden by tailorings, with the possible
exception of remapping some of the Format characters to other
classes.  To accord with the user perceptions of Unicode-aware
people who work with SE Asian scripts, I am tempted to ask for CLDR
to tailor the word-breaking algorithms for the corresponding languages
so that the word-breaking classes of WJ (and ZWNBSP) are changed from
Format to MidLetter.  That would match the widespread old *perception*
that there should be no word break in a sequence Thai letter, (Thai
mark,)* WJ, Thai letter. However, there are several objections:

(a) Perhaps P2S3 and P2S4 prohibit this.

(b) If the word-break property of Thai letters falls back to Other,
there would still be a word break between them.

(c) If the word-break property of Thai letters fell back to ALetter,
an old suggestion, WJ would have no effect on the presence of a word
break.

(d) If Thai word breaking assigns word-break classes to each letter
(gc=Lo), then word boundaries can be suppressed by choosing the classes
appropriately.Non-spacing Thai vowels are very relevant to Thai
word-breaking, but formally are 'ignored'.  WJ could be 'ignored' in
exactly the same way.

Richard.




Re: \b{wb}

2015-08-22 Thread Richard Wordingham
On Sat, 22 Aug 2015 14:08:14 -0600
Karl Williamson pub...@khwilliamson.com wrote:

 But it isn't such a replacement, creating some consternation, and the 
 main reason is that, unlike \b, it treats the boundary between white 
 space characters as a breaking opportunity, so that it doesn't create 
 runs of them.  Thus if you have two spaces after a full stop, it
 treats each as an individual word.
 
 My question is Was this intentional, and if so, Why?

See below.

 TR18 says \b{w} is aZero-width match at a Unicode word boundary.
 Note that this is different than \b alone, which corresponds to \w
 and \W.

Unless I'm being stupid, \b and \b{w} are indeed vary different.
Consider a sequence U+0020, U+1F1EB REGIONAL INDICATOR SYMBOL LETTER F,
U+1F1F7 REGIONAL INDICATOR SYMBOL LETTER R, U+0041 LATIN CAPITAL
LETTER A, U+0062 LATIN SMALL LETTER B

That has two internal word boundaries, splitting it into a space, a
flag, and the word Ab.  Is this what you want?

Worse, consider a short Thai sentence ผมไม่มีคอมพิวเตอร์ที่ดี.  That
gets split by ICU into |ผม|ไม่มี|คอมพิวเตอร์|ที่|ดี| - 5 words and
4 internal word boundaries.  Note that there's a word or two between
each boundary.  Is this what you want?

 My question is Was this intentional, and if so, Why?

Take a look at the rules in UAX#29 Section 4.1.1.  Apart from the first
two and the last, they all identify where word boundaries aren't.  This
is tidy - the algorithm concentrates on working out where a word
continues.

In principle, you could, I believe, extend the rules so that characters
outside words and regional indicator runs were not divided, but it
would make for a more complicated algorithm with plenty of
opportunities for error.  I think the thought was that word-free runs
did not need to be assembled into runs of non-word material.

The short answer, of course, is that the regular expression engine
could do this final step of post-processing itself.  This may get
tricky with customised word-breaking.

Richard.



Re: Square Brackets with Tick

2015-08-22 Thread Richard Wordingham
On Sat, 22 Aug 2015 10:32:45 -0700
Asmus Freytag (t) asmus-...@ix.netcom.com wrote:

 On 8/22/2015 9:35 AM, Julian Bradfield wrote:

 There is no inherent meaning to the
 order of codepoints, it's just convenience.

 And for that reason, we have property files to explicitly give the
 properties rather than asking the user to glean them from code
 point order.

But codepoints are normally orderly until they enter the ISO approval
process.  Thereafter, disorder creeps in, and becomes ever more likely
as blocks fill up.  The concern here is that the opening-closing
pairing information, which used not to be a property, has been deduced
wrongly.  The code chart is prima facie evidence that whoever drew the
order up conceived of U+298D and U+298E as a pair.

I've traced the character as far back as
http://www.unicode.org/L2/L1999/99159.pdf . Unfortunately, its meaning
therein is implicitly described as unknown! It looks as though someone
somewhere fashioned type for it - or perhaps another of the set of four
- but no-one remembers what it was used for!

Now, *if* no-one is using it, it doesn't really matter if the pair is
wrong.

Richard.


\b{wb}

2015-08-22 Thread Karl Williamson
The concept of \b in a regular expression meaning to match the boundary 
between a word and non-word was invented by Larry Wall, for the Perl 
programming language.  This was before Unicode, and a word was defined 
as alphanumerics plus the underscore, which fit well with how 
identifiers in that computer language (and many others) were defined. 
Essentially \b is defined to break between runs of word characters 
versus runs of non-word characters.


The latest version of Perl 5 (recently released) has added \b{w} based 
on Unicode's definition.  The typical expectation of its programmers is 
that it would be a drop-in replacement for the old \b, with much better 
results in parsing natural languages.


But it isn't such a replacement, creating some consternation, and the 
main reason is that, unlike \b, it treats the boundary between white 
space characters as a breaking opportunity, so that it doesn't create 
runs of them.  Thus if you have two spaces after a full stop, it treats 
each as an individual word.


My question is Was this intentional, and if so, Why?

TR18 says \b{w} is aZero-width match at a Unicode word boundary. Note 
that this is different than \b alone, which corresponds to \w and \W.


And UAX29 says adjacent spaces are collapsed to a single space in 
intelligent cut and paste using the WB property.




Re: Square Brackets with Tick

2015-08-22 Thread Asmus Freytag

  
  
On 8/22/2015 2:47 PM, Richard
  Wordingham wrote:


  But codepoints are normally orderly until they enter the ISO approval
process.  Thereafter, disorder creeps in, and becomes ever more likely
as blocks fill up



Haha, good one.


  .  The concern here is that the opening-closing
pairing information, which used not to be a property, has been deduced
wrongly.  The code chart is prima facie evidence that whoever drew the
order up conceived of U+298D and U+298E as a pair.


Not necessarily. Code charts are sometimes ordered in mysterious
ways. However, read on.


  

I've traced the character as far back as
http://www.unicode.org/L2/L1999/99159.pdf . Unfortunately, its meaning
therein is implicitly described as unknown! It looks as though someone
somewhere fashioned type for it - or perhaps another of the set of four
- but no-one remembers what it was used for!


This document doesn't tell you what the pairing is supposed to be,
only that which 
ones are opening and closing (so we know that they are intended to
be arranged [ ] 
and not ] [ (ticks omitted), but we don't know which of the two [[
go with which of 
the two ]], other than the - natural - assumptions that pairs are
listed adjacently).

For the first document that gives the pairing information, see:

http://www.unicode.org/L2/L2012/12173r-bidi-paren.pdf

There is no note or other indication in this document that shows
that any thought
was put into the different ordering.

However, it is notable that all other bracket pairings follow the
bidi mirroring glyph
relation, so I would put my money on that that file was used to
create the pairs using
a script, rather than manual editing.

This is corroborated in section 3.2 of that document.

Nigel was the first to notice that these were not encoded as
left-right glyph pairs,
but with the diagonal "tick" (originally called a solidus) having
the same orientation
in a pair (as if intended to bracket something in either diagonal or
anti-diagonal
direction).

Given that L2/12-173 states that the property was derived via
algorithm that is based
on left-right mirroring and not via matching open/close pairs based
on other factors,
(including adjacency in the charts) I'm happy to join the growing
chorus that declares
this to be a bug.

Luckily there seems to be no stability policy that would prevent
fixing this one.

A./
  



Re: Square Brackets with Tick

2015-08-22 Thread Asmus Freytag (t)

  
  
On 8/22/2015 9:35 AM, Julian Bradfield
  wrote:


  There is no inherent meaning to the
order of codepoints, it's just convenience.


And for that reason, we have property files to
  explicitly give the properties rather than asking the user to "glean"
  them from code point order.
  
  A./

  



Square Brackets with Tick

2015-08-22 Thread Nigel Small
Hi all

I am looking for clarification on an aspect of Unicode bracket pairing,
specifically in relation to the following four characters:

298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER

These stand out from all other brackets listed in *BidiBrackets.txt* due to
an inconsistency in pairing. I have looked for references online on where
these brackets are used in the wild as mathematical symbols but have been
unable to find anything useful.

All other bracket pairs are listed as opener followed by closer, sometimes
with several code points in between. According to the code point pairs in
the first and second columns of this file, these particular brackets should
be paired as the *first and fourth* and the *third and second*. Intuitively
however, these would actually be *first and second* and *third and fourth*
if one is to expect consistency.

My guess is that there are three possibilities here:
1. The current pairing information is correct and the sequence is irregular
for some historical reason
2. The pairing information is wrong and the sequence is consistent with
other brackets
3. Pairing can be mixed with either left bracket used as a valid opener and
either right bracket used as a valid closer; in this case, the pairing
information is incomplete

I'd be very grateful if anyone could clarify the situation here or if
anyone knows of a resource that describes where such brackets are used in
practice.

Many thanks
Nigel Small


Re: Square Brackets with Tick

2015-08-22 Thread Eli Zaretskii
 From: Nigel Small ni...@nigelsmall.com
 Date: Sat, 22 Aug 2015 17:08:48 +0100
 
 I am looking for clarification on an aspect of Unicode bracket pairing,
 specifically in relation to the following four characters:
 
 298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
 298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
 298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
 2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER
 
 These stand out from all other brackets listed in BidiBrackets.txt due to an
 inconsistency in pairing. I have looked for references online on where these
 brackets are used in the wild as mathematical symbols but have been unable to
 find anything useful.
 
 All other bracket pairs are listed as opener followed by closer, sometimes 
 with
 several code points in between.

I think the order in the file is by the codepoint in the leftmost
column.  All the rest is just a coincidence.

But I don't speak for the Unicode Consortium, so please wait for a
definitive reply.


Re: Square Brackets with Tick

2015-08-22 Thread Julian Bradfield
On 2015-08-22, Nigel Small ni...@nigelsmall.com wrote:
 298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
 298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
 298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
 2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER

 with several code points in between. According to the code point pairs in
 the first and second columns of this file, these particular brackets should
 be paired as the *first and fourth* and the *third and second*. Intuitively
 however, these would actually be *first and second* and *third and fourth*
 if one is to expect consistency.

That's a strange intuition! Mathematical brackets are expected to pair
with left-right symmetry, not rotational symmetry. As in, for example,
floor and ceiling brackets. The pairing in the file is the natural one.

 1. The current pairing information is correct and the sequence is irregular
 for some historical reason

That will be the explanation. There is no inherent meaning to the
order of codepoints, it's just convenience.
One of the experts here can probably tell us why these four brackets
happen to be coded in this order.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.