Re: Bengla syllables <... 09BF 09BE> and <... 09BF 09C0>

2017-02-07 Thread Manish Goregaokar
Not a Bangla speaker, but they look like typos to me too. Only certain vowel diacritics double up in Indic languages (e.g. anusvaras). I'm not sure how you would even pronounce such sounds. I suppose such combinations of diacritics could be used to represent dipthongs in words from other

Re: Bengla syllables <... 09BF 09BE> and <... 09BF 09C0>

2017-02-08 Thread Manish Goregaokar
OCR algorithm isn't aware of Assamese-only characters in this case. -Manish On Tue, Feb 7, 2017 at 9:38 PM, Manish Goregaokar <man...@mozilla.com> wrote: > > The very first one কিী‎ (0995 09BF 09C0) had 1090 hits and shows up in a > book of short stories: > > That's bad OCR, th

Re: Bengla syllables <... 09BF 09BE> and <... 09BF 09C0>

2017-02-07 Thread Manish Goregaokar
> The very first one কিী‎ (0995 09BF 09C0) had 1090 hits and shows up in a book of short stories: That's bad OCR, that's an apostrophe, a Ka, and an E, with the apostrophe being interpreted as a matra somehow. I bet there are only a couple of OCR algorithms out there handling Bangla. Indic

UAX #29: Ambiguities in WB4, and contributing back testcases

2016-12-21 Thread Manish Goregaokar
ng, the algorithm is suddenly dependent on the order and fashion in which WB4 is applied. Could this be clarified? Thanks, -Manish Goregaokar [1]: https://github.com/unicode-rs/unicode-segmentation/pull/10 [2]: http://www.unicode.org/reports/tr29/ (permalink: http://www.unicode.org/reports/

Another UAX #29 bug: property tables need updating

2016-12-22 Thread Manish Goregaokar
The spec lists GraphemeBreakProperty.txt[1] and WordBreakProperty.txt[2] as the normative source for grapheme and word categorization respectively. However, the spec also gives non-normative definitions of these properties. In particular, it defines Glue_After_Zwj[3] as > Emoji characters that

Re: UAX #29: Ambiguities in WB4, and contributing back testcases

2016-12-22 Thread Manish Goregaokar
orld.com> wrote: > On Wed, 21 Dec 2016 15:24:21 -0800 > Manish Goregaokar <man...@mozilla.com> wrote: > > >> Aside from that, WB4's[6] greediness is underspecified. In previous >> versions, the rule was > > >> However, now the rule is >>

Re: Another UAX #29 bug: property tables need updating

2016-12-22 Thread Manish Goregaokar
Will do, thanks! -Manish On Thu, Dec 22, 2016 at 11:16 AM, Ken Whistler <kenwhist...@att.net> wrote: > Manish, > > > On 12/22/2016 10:35 AM, Manish Goregaokar wrote: >> >> The property table should include all role and gender modifiers as GAZ. >> >> Cou

Re: New tool unidump

2017-03-17 Thread Manish Goregaokar
https://r12a.github.io/uniview/ https://r12a.github.io/apps/conversion/ are excellent tools for this, as well, if you're in a situation where you can copy into a web form. This looks useful for commandline stuff, though, thanks! -Manish On Fri, Mar 17, 2017 at 1:44 PM, Manuel Strehl

Re: "A Programmer's Introduction to Unicode"

2017-03-12 Thread Manish Goregaokar
> This is just another confirmation that the present Unicode terminology is confusing. I find this to be a symptom of our pedagogy around "characters" in programming; most folks get taught that characters are bytes are code points, especially because many languages try to make this the case. The

Re: "A Programmer's Introduction to Unicode"

2017-03-10 Thread Manish Goregaokar
I recently wrote http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ , which sort of addresses the whole hangup programmers have with treating code points as "characters". I also wrote

Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Manish Goregaokar
Do you have examples of AA being split that way (and further reading)? I think I'm aware of what you're talking about, but would love to read more about it. -Manish On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordingham wrote: > On Mon, 13 Mar 2017 23:10:11 +0200 >

Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Manish Goregaokar
Ah, it was what I thought you were talking about -- I wasn't aware they were considered word boundaries :) Thanks for the links! On Mar 13, 2017 4:54 PM, "Richard Wordingham" < richard.wording...@ntlworld.com> wrote: On Mon, 13 Mar 2017 15:26:00 -0700 Manish Goregaokar <

Re: Counting Devanagari Aksharas

2017-04-20 Thread Manish Goregaokar via Unicode
see how intra-conjunct selection would be useful otherwise. -Manish On Thu, Apr 20, 2017 at 12:14 PM, Richard Wordingham via Unicode <unicode@unicode.org> wrote: > On Thu, 20 Apr 2017 11:17:05 -0700 > Manish Goregaokar via Unicode <unicode@unicode.org> wrote: > >

Re: Counting Devanagari Aksharas

2017-04-20 Thread Manish Goregaokar via Unicode
I don't think there's consensus. When given a rendered representation people seem to uniformly count conjuncts as multiple aksharas if rendered with visible halant, and as a single akshara if they are rendered conjoined. Most fonts for devanagari these days are pretty good at conjoining

Re: Counting Devanagari Aksharas

2017-04-22 Thread Manish Goregaokar via Unicode
> You cannot even > meaningfully move by single characters in most clusters, because > composing characters generally completely changes how the original > characters looked, so there's nowhere you can display the cursor. Yes, and this is one of the reasons it feels broken in devanagari, you get

Re: Counting Devanagari Aksharas

2017-04-21 Thread Manish Goregaokar via Unicode
That seems like a relatively niche use case (especially with Vedic Sanskrit) compared to having weird selection for everything else. I'm not convinced. When I use a romanized Devanagari input method (I typically do on my laptop), deleting the whole cluster is necessary anyway for things to work

Re: Counting Devanagari Aksharas

2017-04-21 Thread Manish Goregaokar via Unicode
ble. -Manish On Fri, Apr 21, 2017 at 4:04 PM, Richard Wordingham via Unicode <unicode@unicode.org> wrote: > On Thu, 20 Apr 2017 11:17:05 -0700 > Manish Goregaokar via Unicode <unicode@unicode.org> wrote: > >> On Wed, Apr 19, 2017 at 4:35 PM, Richard Wordingham via U

Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

2017-12-10 Thread Manish Goregaokar via Unicode
> GB9c: (Virama | ZWJ ) × Extend* LinkingConsonant You can also explicitly request ligatureification with a ZWJ, so perhaps this rule should be something like (Virama ZWJ? | ZWJ) x Extend* LinkingConsonant -Manish On Sat, Dec 9, 2017 at 7:16 AM, Mark Davis ☕️ via Unicode <

Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-04 Thread Manish Goregaokar via Unicode
Hi, The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to be treated as equivalent under NFKC. Are

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-04 Thread Manish Goregaokar via Unicode
Oh, looks like UAX 31 has info on how to be closed under NFC http://www.unicode.org/reports/tr31/#NFKC_Modifications -Manish On Mon, Jun 4, 2018 at 12:49 PM Manish Goregaokar wrote: > Hi, > > The Rust community is considering > <https://github.com/rust-lang/rfcs/pull/2457>

Requiring typed text to be NFKC (was: Can NFKC turn valid UAX 31 identifiers into non-identifiers?)

2018-06-05 Thread Manish Goregaokar via Unicode
Following up from my previous email , one of the ideas that was brought up was that if we're going to consider NFKC forms equivalent, we should require things to be typed in NFKC. I'm a bit wary of this. As Richard brought up in

Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

2017-12-21 Thread Manish Goregaokar via Unicode
> When deleting by backspace, the usual practice is to delete one Unicode character for each key press. This seems to depend on the operating system and program involved. For example, on OSX any native text input field (Spotlight, TextEdit, etc) will delete by extended grapheme cluster. Chrome

Re: Emoji’s

2018-01-11 Thread Manish Goregaokar via Unicode
I submitted a proposal to emojify the left writing hand code point. -Manish On Thu, Jan 11, 2018 at 5:00 PM, Christoph Päper via Unicode < unicode@unicode.org> wrote: > jillian mestel: > > > > I was very disappointed to learn that there are no emojis of portraying > a dominant left hand. > >

Re: Unicode of Death 2.0

2018-02-17 Thread Manish Goregaokar via Unicode
ific to the renderers implemented by Apple in iOS and > MacOS). This bug does not occur if another text rendering engine is used > (e.g. in non-Apple web browsers). > > > 2018-02-16 19:44 GMT+01:00 Manish Goregaokar <man...@mozilla.com>: > >> FWIW I dissected

Re: Unicode of Death 2.0

2018-02-18 Thread Manish Goregaokar via Unicode
hich may have been reordered in that buffer). >> >> Microsoft's text renderer, or other engines use do not delay the >> constructiuon of the glyphs buffer, which can be reused for processing the >> rest of the text, provided it is correctly reset after processing a cluster.

Re: Unicode of Death 2.0

2018-02-18 Thread Manish Goregaokar via Unicode
Oh, also vatu. Seems like that ordering algorithm is indeed relevant. -Manish On Sat, Feb 17, 2018 at 11:57 PM, Manish Goregaokar <man...@mozilla.com> wrote: > Ah, looking at that the OpenType `pstf` feature seems relevant, though I > cannot get it to crash with Gurmukhi (where t

Re: Unicode of Death 2.0

2018-02-16 Thread Manish Goregaokar via Unicode
FWIW I dissected the crashing strings, it's basically all sequences in Telugu, Bengali, Devanagari where the consonant is suffix-joining (ra in Devanagari, jo and ro in Bengali, and all Telugu consonants), the vowel is not Bengali au or o / Telugu ai,

Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10)

2018-01-02 Thread Manish Goregaokar via Unicode
unicode.org/reports/tr41/tr41-21.html#UTS51>]. > *and not* GCB = Virama > > Note: we are already planning to get rid of the GAZ/EBG distinction ( > http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. > > Mark > > On Mon, Jan 1, 2018 at 3:52 PM, Richard Wording

Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10)

2018-01-02 Thread Manish Goregaokar via Unicode
:02 PM, Manish Goregaokar <man...@mozilla.com> wrote: > > Note: we are already planning to get rid of the GAZ/EBG distinction ( > http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. > > > This is great! I hadn't noticed this when I last saw that

Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10)

2017-12-31 Thread Manish Goregaokar via Unicode
In UAX 29, the GB10 rule[1] (and the WB14 rule[2]) states that we should not break before E_modifier characters in case it is after an emoji base (with optional Extend characters in between) Given that the spec is allowed to ignore degenerates, is there any value lost by merging E_Modifier and

Re: Submissions open for 2020 Emoji

2018-04-20 Thread Manish Goregaokar via Unicode
It would also be useful if "Added to larger set" mentioned which proposal it was added to. Last December I proposed emojification for U+1F58E LEFT WRITING HAND, and that's marked as merged but it's unclear which proposal it was merged with. (Also the document isn't on L2 yet, I'm not sure why)