Re: emoji props in the ucdxml ?

2017-07-06 Thread Daniel Bünzli via Unicode
Ken,  Thanks for your explanations.  I would just like to note that UAX42 expresses a general xml data format to associate properties to code points. So it would be possible for the standard maintainers to publish, independently from the UCD, alongside the ad-hoc text files, xml files that

emoji props in the ucdxml ?

2017-07-05 Thread Daniel Bünzli via Unicode
Hello,  I know the emoji properties [1] are no formally part of the UCD (not sure exactly why though), but are there any plans to integrate the data in the ucdxml [2] (possibly as separate files) ?  Thanks,  Daniel [1] http://www.unicode.org/reports/tr51/#Emoji_Properties_and_Data_Files [2]

Re: Unicode String Models

2018-09-09 Thread Daniel Bünzli via Unicode
Hello,  I find your notion of "model" and presentation a bit confusing since it conflates what I would call the internal representation and the API.  The internal representation defines how the Unicode text is stored and should not really matter to the end user of the string data structure.

UAX #42 update for 11.0.0 & \p{Extended_Pictographic}

2018-04-04 Thread Daniel Bünzli via Unicode
Hello,  Is there any ETA for an update to the ucdxml for 11.0.0 ?  Also while reviewing the proposed update to UAX #29, I noticed it refers to a property (\p{Extended_Pictographic}) that doesn't seem to be formally part of the UCD but to be found in UTS #51. Is there any chance for this

Re: Unicode String Models

2018-10-03 Thread Daniel Bünzli via Unicode
On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (unicode@unicode.org) wrote: > There are two main choices for a scalar-value API: > > 1. Guarantee that the storage never contains surrogates. This is the > simplest model. > 2. Substitute U+FFFD for surrogates when the API returns code

Re: Unicode String Models

2018-10-03 Thread Daniel Bünzli via Unicode
On 3 October 2018 at 15:41:42, Mark Davis ☕️ via Unicode (unicode@unicode.org) wrote:   > Let me clear that up; I meant that "the underlying storage never contains > something that would need to be represented as a surrogate code point." Of > course, UTF-16 does need surrogate code units. What #1

Re: Unicode String Models

2018-10-02 Thread Daniel Bünzli via Unicode
On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (unicode@unicode.org) wrote: > Because of performance and storage consideration, you need to consider the > possible internal data structures when you are looking at something as > low-level as strings. But most of the 'model's in the

Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

2019-10-22 Thread Daniel Bünzli via Unicode
On 22 October 2019 at 09:37:22, Richard Wordingham via Unicode (unicode@unicode.org) wrote: > When it comes to the second sentence of the text of Slide 7 'Grapheme > Clusters', my overwhelming reaction is one of extreme anger. Slide 8 > does nothing to lessen the offence. The problem is that it

Re: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

2019-10-22 Thread Daniel Bünzli via Unicode
Thanks for you answer. > The compromise that has generally been reached is that 'delete' deletes > a grapheme cluster and 'backspace' deletes a scalar value. (There are > good editors like Emacs that delete only a single character.) Just to make things clear. When you say character in your

Website format (was Re: Unicode website glitches. (was The Most Frequent Emoji))

2019-10-12 Thread Daniel Bünzli via Unicode
On 12 October 2019 at 02:05:23, Martin J. Dürst via Unicode (unicode@unicode.org) wrote: > I think it's less the format and much more the split personality of the > Unicode Web site(s?) that I have problems with. I also do.  One thing that is particulary annoying is the fact that the "home"

Re: UAX #29 and WB4

2020-03-04 Thread Daniel Bünzli via Unicode
On 4 March 2020 at 18:48:09, Daniel Bünzli (daniel.buen...@erratique.ch) wrote: > On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buen...@erratique.ch) > wrote: > > > Re-reading the text I suspect I should not restart the rules from the first > > one when a > WB4 > > rewrite occurs but

Re: UAX #29 and WB4

2020-03-04 Thread Daniel Bünzli via Unicode
On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buen...@erratique.ch) wrote: > Re-reading the text I suspect I should not restart the rules from the first > one when a WB4 > rewrite occurs but only apply the subsequent rules. Is that correct ? However even if that's correct I don't

UAX #29 and WB4

2020-03-04 Thread Daniel Bünzli via Unicode
Hello,  My implementation of word break chokes only on the following test case from the file [1]:  ÷ 0020 × 0308 ÷ 0020 ÷ #  ÷ [0.2] SPACE (WSegSpace) × [4.0] COMBINING DIAERESIS (Extend_FE) ÷ [999.0] SPACE (WSegSpace) ÷ [0.3]  I find:  ÷ 0020 × 0308 × 0020 ÷ Basically my implementation

UAX #14 for 13.0.0: LB27 first's line is obsolete

2020-03-03 Thread Daniel Bünzli via Unicode
Hello,  I think (more precisely my compiler thinks [1]) the first line of LB27 is already handled by the new LB22 rule and can be removed.  Best,  Daniel [1] File "uuseg_line_break.ml", line 206, characters 38-40: 206 |   | (* LB27 *)  _, (JL|JV|JT|H2|H3), (IN|PO) -> no_boundary s