Re: Proposed Phobos equivalent of wcswidth()

2018-01-20 Thread Dmitry Olshansky via Digitalmars-d

On Friday, 19 January 2018 at 19:33:28 UTC, H. S. Teoh wrote:
On Thu, Jan 18, 2018 at 06:42:26PM +, Dmitry Olshansky via 
Digitalmars-d wrote: [...]
Also forgot to mention that can pass BitPacked!(ubyte,2) to 
Trie template as value type to use 2 bit per value. Should 
reduce your width table 4-fold.  Just saying;)


Thanks for the tip!  Indeed, the table size was reduced 4-fold. 
Awesome.


However, now I'm finding that it no longer works properly when 
loaded from the precompiled data.  It appears to have something 
to do with the default value for the width table being 1 rather 
than ubyte.init, and so far I couldn't figure out how to get 
the Trie ctor that takes .offsets, .sizes, .data to specify a 
default value.


Why would you need a default in a low-level construction? I think 
it naturally takes the tables with whatever was stored in there. 
There is no processing.


So the default has to be explicitly stored during building of 
trie.


So now the trie is returning the wrong value for certain dchar 
ranges. :-(








Re: Proposed Phobos equivalent of wcswidth()

2018-01-19 Thread H. S. Teoh via Digitalmars-d
On Thu, Jan 18, 2018 at 06:42:26PM +, Dmitry Olshansky via Digitalmars-d 
wrote:
[...]
> Also forgot to mention that can pass BitPacked!(ubyte,2) to Trie
> template as value type to use 2 bit per value. Should reduce your
> width table 4-fold.  Just saying;)

Thanks for the tip!  Indeed, the table size was reduced 4-fold.
Awesome.

However, now I'm finding that it no longer works properly when loaded
from the precompiled data.  It appears to have something to do with the
default value for the width table being 1 rather than ubyte.init, and so
far I couldn't figure out how to get the Trie ctor that takes .offsets,
.sizes, .data to specify a default value.  So now the trie is returning
the wrong value for certain dchar ranges. :-(


T

-- 
Some ideas are so stupid that only intellectuals could believe them. -- George 
Orwell


Re: Proposed Phobos equivalent of wcswidth()

2018-01-18 Thread Dmitry Olshansky via Digitalmars-d

On Wednesday, 17 January 2018 at 22:59:58 UTC, H. S. Teoh wrote:

I took a first stab at integrating this into dlang/tools:

https://github.com/quickfur/tools/tree/unicode_gen

So far, I can get the 64-bit generator to run and produce the 
generated unicode_*.d files. Unfortunately they are missing the 
32-bit data, because I couldn't get a 32-bit dmd toolchain 
working on my PC.






Maybe you could take a look and submit PRs against that branch 
for any fixes you'd like to get in?  I'll see if I can somehow 
get 32-bit working on my PC.


Alternatively, maybe the solution is to hack the Trie code so 
that it uses explicit int sizes rather than size_t, then we can 
use it to generate both 32-bit and 64-bit tables without 
requiring the host platform to support both.


Yes, I guess we have to allow word size to be redefined. I just 
wanted fastest version by default w/o possibility to screw up on 
the user side of things.


Also forgot to mention that can pass BitPacked!(ubyte,2) to Trie 
template as value type to use 2 bit per value. Should reduce your 
width table 4-fold. Just saying;)


Re: Proposed Phobos equivalent of wcswidth()

2018-01-17 Thread H. S. Teoh via Digitalmars-d
On Wed, Jan 17, 2018 at 05:06:05AM +, Dmitry Olshansky via Digitalmars-d 
wrote:
> On Tuesday, 16 January 2018 at 23:01:19 UTC, H. S. Teoh wrote:
[...]
> > One thing, though: I think it would benefit us all if we could
> > import at least gen_uni into Phobos, so that in the future when we
> > need to update std.uni to a new version of Unicode, it can be
> > (mostly) automated.  It's better to have the tools to generate the
> > tables in Phobos itself, than to be dependent on an external repo
> > that may go out-of-sync eventually.
> 
> Yes but it’s non-trivial at the moment, if you take a look at script
> to generate stuff it takes both 32-bit and 64-bit executables to
> populate tables.
> 
> I think having it in tools repo should be fine though. Last time I
> tried to update to Unicode 10, I found one table in Phobos that is
> missing from generator (ooops!).

I took a first stab at integrating this into dlang/tools:

https://github.com/quickfur/tools/tree/unicode_gen

So far, I can get the 64-bit generator to run and produce the generated
unicode_*.d files. Unfortunately they are missing the 32-bit data,
because I couldn't get a 32-bit dmd toolchain working on my PC.

Maybe you could take a look and submit PRs against that branch for any
fixes you'd like to get in?  I'll see if I can somehow get 32-bit
working on my PC.

Alternatively, maybe the solution is to hack the Trie code so that it
uses explicit int sizes rather than size_t, then we can use it to
generate both 32-bit and 64-bit tables without requiring the host
platform to support both.  I imagine we may have problems getting the
tools repo to build on the autotester once we integrate gen_uni into the
makefile, unless we do something like this.


> > When I get around to making a PR for strwidth AKA displayWidth, the
> > plan is to check-in compileWidth.d in some form into Phobos
> > somewhere, so that somebody else can pick it up and improve the
> > implementation in the future if I'm not around / unavailable.
> > 
> > If we can get gen_uni into Phobos, perhaps we can even include the
> > displayWidth table generation in gen_uni too, so that all the table
> > generation code is in one place.
> 
> Right. A good step would be to move it to tools, then add your code.
[...]

Good idea.  Well, I started with the branch linked above in my fork of
dlang/tools.  If I can get it off the ground, I'll add the displayWidth
stuff in as well, then formulate a PR to add displayWidth to std.uni.

Well, technically I don't need to wait for that, since I could just add
the precomputed table directly into std/internal/unicode_tables.d. But
it's probably better to let the generator do the job instead.  A
precomputed table is rather hard to review for correctness when it comes
PR review time. :-D


T

-- 
Don't get stuck in a closet---wear yourself out.


Re: Proposed Phobos equivalent of wcswidth()

2018-01-16 Thread Dmitry Olshansky via Digitalmars-d

On Tuesday, 16 January 2018 at 23:01:19 UTC, H. S. Teoh wrote:
On Tue, Jan 16, 2018 at 05:49:11PM +, Dmitry Olshansky via 
Digitalmars-d wrote:

On Monday, 15 January 2018 at 19:52:07 UTC, H. S. Teoh wrote:

[...]
> One thing I'm seeking help with, and this is mainly directed 
> at Dmitry Olshansky but can be anyone here who knows the 
> internal workings of std.uni well enough, is how to 
> transform the Trie generated by the static ctor into 
> compile-time TrieNode declarations.  This is one blocker for 
> my turning this code into a Phobos PR, because I don't want 
> to incur the cost of initializing this trie at runtime.



Checkout my horribly named repo gsoc-uni-benchmark:

https://github.com/DmitryOlshansky/gsoc-bench-2012/blob/master/gen_uni.d

This is what generates unicode tables.
Need to revise it, as folks were delicate enough to hand-patch
auto-generated code in Phobos.

Maybe make some of that user-acessible.

[...]

Whoa. There's some pretty cool stuff in there!  Thanks, I've 
started experimenting with pre-generating the width table.  
Pretty neat. There's a lot of hidden gems in std.uni that I 
never knew existed, hidden away under `private`. :-D


The intent is to open that up somehow, to allow folks to make 
their own extended versions of std.uni. Unicode is all about 
“tailoring” - adjusting algorithm to your specific regional 
preferences hy messing with tables.


I think there is at least 1 bug in Bugzilla on this.


One thing, though: I think it would benefit us all if we could 
import at least gen_uni into Phobos, so that in the future when 
we need to update std.uni to a new version of Unicode, it can 
be (mostly) automated.  It's better to have the tools to 
generate the tables in Phobos itself, than to be dependent on 
an external repo that may go out-of-sync eventually.


Yes but it’s non-trivial at the moment, if you take a look at 
script to generate stuff it takes both 32-bit and 64-bit 
executables to populate tables.


I think having it in tools repo should be fine though. Last time 
I tried to update to Unicode 10, I found one table in Phobos that 
is missing from generator (ooops!).




When I get around to making a PR for strwidth AKA displayWidth, 
the plan is to check-in compileWidth.d in some form into Phobos 
somewhere, so that somebody else can pick it up and improve the 
implementation in the future if I'm not around / unavailable.


If we can get gen_uni into Phobos, perhaps we can even include 
the displayWidth table generation in gen_uni too, so that all 
the table generation code is in one place.


Right. A good step would be to move it to tools, then add your 
code.






T





Re: Proposed Phobos equivalent of wcswidth()

2018-01-16 Thread H. S. Teoh via Digitalmars-d
On Tue, Jan 16, 2018 at 05:49:11PM +, Dmitry Olshansky via Digitalmars-d 
wrote:
> On Monday, 15 January 2018 at 19:52:07 UTC, H. S. Teoh wrote:
[...]
> > One thing I'm seeking help with, and this is mainly directed at
> > Dmitry Olshansky but can be anyone here who knows the internal
> > workings of std.uni well enough, is how to transform the Trie
> > generated by the static ctor into compile-time TrieNode
> > declarations.  This is one blocker for my turning this code into a
> > Phobos PR, because I don't want to incur the cost of initializing
> > this trie at runtime.
> 
> 
> Checkout my horribly named repo gsoc-uni-benchmark:
> 
> https://github.com/DmitryOlshansky/gsoc-bench-2012/blob/master/gen_uni.d
> 
> This is what generates unicode tables.
> Need to revise it, as folks were delicate enough to hand-patch
> auto-generated code in Phobos.
> 
> Maybe make some of that user-acessible.
[...]

Whoa. There's some pretty cool stuff in there!  Thanks, I've started
experimenting with pre-generating the width table.  Pretty neat. There's
a lot of hidden gems in std.uni that I never knew existed, hidden away
under `private`. :-D

One thing, though: I think it would benefit us all if we could import at
least gen_uni into Phobos, so that in the future when we need to update
std.uni to a new version of Unicode, it can be (mostly) automated.  It's
better to have the tools to generate the tables in Phobos itself, than
to be dependent on an external repo that may go out-of-sync eventually.

When I get around to making a PR for strwidth AKA displayWidth, the plan
is to check-in compileWidth.d in some form into Phobos somewhere, so
that somebody else can pick it up and improve the implementation in the
future if I'm not around / unavailable.

If we can get gen_uni into Phobos, perhaps we can even include the
displayWidth table generation in gen_uni too, so that all the table
generation code is in one place.


T

-- 
Having a smoking section in a restaurant is like having a peeing section in a 
swimming pool. -- Edward Burr 


Re: Proposed Phobos equivalent of wcswidth()

2018-01-16 Thread Dmitry Olshansky via Digitalmars-d

On Monday, 15 January 2018 at 19:52:07 UTC, H. S. Teoh wrote:
On Sat, Jan 13, 2018 at 09:26:52AM -0800, H. S. Teoh via 
Digitalmars-d wrote: [...]

https://github.com/quickfur/strwidth

[...]

One thing I'm seeking help with, and this is mainly directed at 
Dmitry Olshansky but can be anyone here who knows the internal 
workings of std.uni well enough, is how to transform the Trie 
generated by the static ctor into compile-time TrieNode 
declarations.  This is one blocker for my turning this code 
into a Phobos PR, because I don't want to incur the cost of 
initializing this trie at runtime.



Checkout my horribly named repo gsoc-uni-benchmark:

https://github.com/DmitryOlshansky/gsoc-bench-2012/blob/master/gen_uni.d

This is what generates unicode tables.
Need to revise it, as folks were delicate enough to hand-patch 
auto-generated code in Phobos.


Maybe make some of that user-acessible.



T




Re: Proposed Phobos equivalent of wcswidth()

2018-01-15 Thread WhatMeForget via Digitalmars-d

On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:

On Saturday, 13 January 2018 at 17:26:52 UTC, H. S. Teoh wrote:

...


Thanks for taking the time to do this.

And now the obligatory bikeshed: what should the Phobos 
equivalent of wcswidth be called?


std.utf.displayWidth


std.utf.bikeshed

Never heard that phrase before. Nice one :)


Re: Proposed Phobos equivalent of wcswidth()

2018-01-15 Thread H. S. Teoh via Digitalmars-d
On Sat, Jan 13, 2018 at 09:26:52AM -0800, H. S. Teoh via Digitalmars-d wrote:
[...]
>   https://github.com/quickfur/strwidth
[...]

One thing I'm seeking help with, and this is mainly directed at Dmitry
Olshansky but can be anyone here who knows the internal workings of
std.uni well enough, is how to transform the Trie generated by the
static ctor into compile-time TrieNode declarations.  This is one
blocker for my turning this code into a Phobos PR, because I don't want
to incur the cost of initializing this trie at runtime.

Also, on a related note, there exist nicer interfaces in std.uni for
constructing Tries that map ranges of codepoints to non-boolean values,
but none of these are available publicly.  The current implementation in
strwidth only uses the public API of std.uni, so the construction of the
trie is pretty horrendous (looping over individual codepoints and
creating an AA of individual codepoints -- including very large ranges
like the entire Unicode plane 2).  I wonder if some of these facilities
should be made public so that user code that needs to construct
codepoint tries that include large ranges of codepoints can do so more
efficiently.


T

-- 
This sentence is false.


Re: Proposed Phobos equivalent of wcswidth()

2018-01-15 Thread Jonathan M Davis via Digitalmars-d
On Monday, January 15, 2018 10:37:14 H. S. Teoh via Digitalmars-d wrote:
> On Mon, Jan 15, 2018 at 06:20:16PM +, Jack Stouffer via Digitalmars-d 
wrote:
> > On Monday, 15 January 2018 at 17:32:40 UTC, H. S. Teoh wrote:
> > > On Mon, Jan 15, 2018 at 02:14:56PM +, Simen Kjærås via
> > > Digitalmars-d
> > >
> > > wrote:
> > > > On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:
> > > > > std.utf.displayWidth
> > > >
> > > > +1
> > >
> > > [...]
> > >
> > > Why std.utf rather than std.uni, though?
> >
> > The way I understand it is that std.uni is (supposed to be) for
> > functions on individual unicode units (be they code units/points or
> > graphemes) and std.utf is for functions which handle operating on
> > unicode strings.
>
> Are you sure?  I thought std.utf was specifically dealing with UTF-*
> encodings, i.e., code units and conversions to/from code points, and
> std.uni was supposed to be for implementing Unicode algorithms and
> Unicode compliance in general, i.e., stuff that works at the code point
> level.

Your understanding of the division more or less matches mine, though I'm not
sure that the line is entirely clearcut. I would definitely think that
std.uni was the more appropriate place for such a function.

- Jonathan M Davis




Re: Proposed Phobos equivalent of wcswidth()

2018-01-15 Thread H. S. Teoh via Digitalmars-d
On Mon, Jan 15, 2018 at 06:20:16PM +, Jack Stouffer via Digitalmars-d wrote:
> On Monday, 15 January 2018 at 17:32:40 UTC, H. S. Teoh wrote:
> > On Mon, Jan 15, 2018 at 02:14:56PM +, Simen Kjærås via Digitalmars-d
> > wrote:
> > > On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:
> > > > std.utf.displayWidth
> > > 
> > > +1
> > [...]
> > 
> > Why std.utf rather than std.uni, though?
> 
> The way I understand it is that std.uni is (supposed to be) for
> functions on individual unicode units (be they code units/points or
> graphemes) and std.utf is for functions which handle operating on
> unicode strings.

Are you sure?  I thought std.utf was specifically dealing with UTF-*
encodings, i.e., code units and conversions to/from code points, and
std.uni was supposed to be for implementing Unicode algorithms and
Unicode compliance in general, i.e., stuff that works at the code point
level.


> Obviously there are exceptions. I think "they" put graphemeStride in
> std.uni because Grapheme was defined there and it seemed reasonable at
> the time.  But, generally I think utf stuff should go into std.utf.

But displayWidth isn't really directly related to UTF (i.e., the
encoding of Unicode code points).  It seems to me to be more to do with
processing Unicode in general, though, granted, the optimizations I
implemented are kinda in a grey zone between dealing with Unicode proper
(i.e., with code points) vs. working with code units.


T

-- 
Klein bottle for rent ... inquire within. -- Stephen Mulraney


Re: Proposed Phobos equivalent of wcswidth()

2018-01-15 Thread Jack Stouffer via Digitalmars-d

On Monday, 15 January 2018 at 17:32:40 UTC, H. S. Teoh wrote:
On Mon, Jan 15, 2018 at 02:14:56PM +, Simen Kjærås via 
Digitalmars-d wrote:
On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer 
wrote:

> std.utf.displayWidth

+1

[...]

Why std.utf rather than std.uni, though?


The way I understand it is that std.uni is (supposed to be) for 
functions on individual unicode units (be they code units/points 
or graphemes) and std.utf is for functions which handle operating 
on unicode strings. Obviously there are exceptions. I think 
"they" put graphemeStride in std.uni because Grapheme was defined 
there and it seemed reasonable at the time. But, generally I 
think utf stuff should go into std.utf.


Re: Proposed Phobos equivalent of wcswidth()

2018-01-15 Thread H. S. Teoh via Digitalmars-d
On Mon, Jan 15, 2018 at 02:14:56PM +, Simen Kjærås via Digitalmars-d wrote:
> On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:
> > std.utf.displayWidth
> 
> +1
[...]

Why std.utf rather than std.uni, though?


T

-- 
ASCII stupid question, getty stupid ANSI.


Re: Proposed Phobos equivalent of wcswidth()

2018-01-15 Thread Dominikus Dittes Scherkl via Digitalmars-d

On Monday, 15 January 2018 at 15:08:24 UTC, Kagamin wrote:
columnWidth as it only makes sense for column-oriented text 
display.


I think displayWidth is better, because "width" is directly 
linked to hozizontal direction (else it would be called hight), 
and setting text in colums would still take additional steps to 
be set correct.
Also "display" indicates that it has nothing to do with the 
string length, which is good to avoid confusion.


Re: Proposed Phobos equivalent of wcswidth()

2018-01-15 Thread Kagamin via Digitalmars-d
columnWidth as it only makes sense for column-oriented text 
display.


Re: Proposed Phobos equivalent of wcswidth()

2018-01-15 Thread Simen Kjærås via Digitalmars-d

On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:

std.utf.displayWidth


+1

--
  Simen


Re: Proposed Phobos equivalent of wcswidth()

2018-01-15 Thread Jack Stouffer via Digitalmars-d

On Saturday, 13 January 2018 at 17:26:52 UTC, H. S. Teoh wrote:

...


Thanks for taking the time to do this.

And now the obligatory bikeshed: what should the Phobos 
equivalent of wcswidth be called?


std.utf.displayWidth


Proposed Phobos equivalent of wcswidth()

2018-01-13 Thread H. S. Teoh via Digitalmars-d
This past week, while reviewing Phobos PR #6008, I started experimenting
with an optimized D equivalent of wcswidth().

For more details, see:

https://issues.dlang.org/show_bug.cgi?id=7054
https://issues.dlang.org/show_bug.cgi?id=17810

as well as the discussion on:

https://github.com/dlang/phobos/pull/6008

Anyway, the TL;DR summary is this: given a format() spec like "%20s", in
order to insert the correct number of spaces to pad the string to 20
characters (or rather, 20 spaces in the output), we need to compute the
displayed length of the string in monospace font.  Unfortunately, given
the complexities of Unicode, this is far from trivial:

- In C, the C library doesn't even pretend to know Unicode, so the
  padding is just based on the number of bytes the string occupies.
  Obviously, for anything non-ASCII the output will be wrong
  (misaligned).

- In D, in the original naïve implementation, we try to be a little
  smarter by counting the number of dchars. Unfortunately, this is also
  wrong, because of combining diacritics like U+0301 which modify the
  preceding character and do not advance the cursor. 

- In Phobos PR #6008, we improved this to count grapheme clusters
  instead. However, this is *still* wrong, because of the existence of
  zero-width characters (don't you just love Unicode?!), and also
  because of "wide" or "full-width" East Asian block characters as
  specified by Unicode TR11 (and scarily enough, the new Emoji blocks
  are included in this "wide" category), which on a text console
  generally occupies 2 positions per grapheme rather than 1.

Eventually, the solution boils down to implementing the equivalent of
Posix wcswidth().

But a naïve implementation of this is extremely inefficient, because
segmenting a Unicode string by grapheme and *then* computing its width
is non-trivial. So inefficient that it's just too slow to use in
format(), especially if most strings you'd pass to format() are
ASCII-only or mostly ASCII.

Thankfully, std.uni provides (some of) the tools to optimize this. The
basic idea is this: we don't actually care to segment graphemes; all we
want to do is to know, given some string s, how many display positions
it will occupy, so that we can insert the right number of spaces. The
actual grapheme segmentation and typesetting is the terminal's job, and
none of format()'s business.  So we can cut some corners while still
producing the right results.

Basically my current solution consists of:

- Parsing EastAsianWidth.txt published by the Unicode Consortium to
  precompute a table of wide/full-width characters (W and F) -- this is
  not done at runtime or compile-time, but as a separate step to
  generate the source code of the table, since otherwise it's either too
  slow at runtime or would slow down Phobos compilation too much, plus
  it depends on an external file which is not practical;

- Combining this table with Unicode category Grapheme_extend, plus a
  bunch of hand-coded zero-width characters to produce a mapping of
  every dchar to 0, 1, or 2. All characters that extend a grapheme, like
  a combining diacritic, maps to 0. All characters designated as Wide or
  Full-width (excluding grapheme extenders) map to 2. Everything else
  maps to 1.

- Compiling this table into a 3-level Trie (std.uni.Trie) for O(1)
  runtime lookup per dchar.

- Computing the display width, then, is just a matter of iterating over
  dchars in the string and summing the values looked up in the trie.

Of course, no matter how optimized a width lookup is, it's still pretty
slow for an ASCII-only string, which is 90% of the use cases of
format(). So to improve this common case, the additional optimization is
to scan the string for ASCII-only bytes, and just incrementing the width
since we know ASCII characters are always 1 column wide.  Only when we
encounter a non-ASCII byte that we bother with UTF-8 decoding and the
table lookup.

Here's my current implementation:

https://github.com/quickfur/strwidth

Here's my current benchmark results:

- walkLength is literally passing the string to std.range.walkLength,
  which is basically counting the number of code points in the string.
  As mentioned before, this does not produce the correct width.

- byGraphemeWalk is the next step up, to count the number of graphemes
  using std.uni.byGrapheme.  Unfortunately, this is still not fully
  correct.

- graphemeStrideWalk is a slight optimization of byGraphemeWalk, by not
  actually decoding the grapheme, but just computing the stride. It also
  has the virtue of being usable in CTFE. Performance-wise, it's not
  that much different from byGraphemeWalk.

- width0 is the first "correct" string width computation, but with a
  naïve, slow implementation. It serves as a baseline to compare the
  next implementations.

- width1 is the trie-optimized version of width0. It shows significant
  improvement over width0, but is still very slow for ASCII strings