Re: [HarfBuzz] script segmentation

2018-02-15 Thread Martin Hosken
Dear Richard,

I would reply to my own message, but that never comes back to me. Thank you 
(NOT) Google. So here goes.

I've started a discussion document here: 
https://github.com/OpenType/opentype-layout/blob/master/docs/script_segmentation.md.
 Please feel free to interact with it. If you are one of the implementers of 
the segmentation algorithms summarised, please feel free to correct me. I'm 
certain I've made lots of mistakes in describing this.

Yours,
Martin

> > 1. Do we have a standard algorithm for this?  
> Well, the obvious fix is a per-block default script, just as some
> unassigned characters have a default property of AL or R.  The problem
> comes with Indic scripts, though a default of consonant will often work.
> 
> > 2. Do we want one?  
> I suspect you're the expert.  How well does MultiScribe work on
> Windows?  On Apple systems, the answer for ordinary users is to use
> AAT, and I suspect that will soon extend to Linux applications courtesy
> of HarfBuzz.  I don't know if that would work on ChromeOS.
> 
> On the other hand, in the free world it would be nice to test out
> OpenType fonts.  Several applications already use a Linux sharable
> object for HarfBuzz, and one could in principle replace them with a
> version that already included the new characters.  LibreOffice is one
> such application.
> 
> > 3. How can we make it more future resilient?  
> 
> A mechanism that ascribes properties to PUA points could be extended to
> unassigned characters in general.
> 
> In principal, the USE grammar policeman is a problem.  Combining marks
> can usually be identified by an OpenType glyph category of 'mark', but
> unassigned combining marks are unlikely to get a security clearance, so
> the obvious relaxation will not work.
> 
> Richard.
> ___
> HarfBuzz mailing list
> HarfBuzz@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/harfbuzz
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz


Re: [HarfBuzz] script segmentation

2018-02-15 Thread Richard Wordingham
On Wed, 14 Feb 2018 11:01:55 +0700
Martin Hosken  wrote:

> 1. Do we have a standard algorithm for this?
Well, the obvious fix is a per-block default script, just as some
unassigned characters have a default property of AL or R.  The problem
comes with Indic scripts, though a default of consonant will often work.

> 2. Do we want one?
I suspect you're the expert.  How well does MultiScribe work on
Windows?  On Apple systems, the answer for ordinary users is to use
AAT, and I suspect that will soon extend to Linux applications courtesy
of HarfBuzz.  I don't know if that would work on ChromeOS.

On the other hand, in the free world it would be nice to test out
OpenType fonts.  Several applications already use a Linux sharable
object for HarfBuzz, and one could in principle replace them with a
version that already included the new characters.  LibreOffice is one
such application.

> 3. How can we make it more future resilient?

A mechanism that ascribes properties to PUA points could be extended to
unassigned characters in general.

In principal, the USE grammar policeman is a problem.  Combining marks
can usually be identified by an OpenType glyph category of 'mark', but
unassigned combining marks are unlikely to get a security clearance, so
the obvious relaxation will not work.

Richard.
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz


[HarfBuzz] script segmentation

2018-02-13 Thread Martin Hosken
Dear All,

One problem I am facing as we add characters to Unicode, is that if a character 
is added to a block, it doesn't necessarily mean that an existing application 
will keep that character in the same run as other characters in the same script 
of that block. This means the app is broken until the character is published in 
a future Unicode standard, a library is updated, and the application is updated 
to use the new version of the library. It also makes it impossible to test out 
proposed changes to Unicode. It would be great if we could come up with a 
standard script segmentation algorithm for runs of text that is also somewhat 
future proof, even if it is not perfect and changes in the future. A best guess 
at what script an unknown character may take has a much higher probability of 
being correct than to give it a special script category of unknown, which is 
always going to be wrong.

So.

1. Do we have a standard algorithm for this?
2. Do we want one?
3. How can we make it more future resilient?

TIA,
Yours,
Martin
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz