Re: Proposed Phobos equivalent of wcswidth()
On Friday, 19 January 2018 at 19:33:28 UTC, H. S. Teoh wrote: On Thu, Jan 18, 2018 at 06:42:26PM +, Dmitry Olshansky via Digitalmars-d wrote: [...] Also forgot to mention that can pass BitPacked!(ubyte,2) to Trie template as value type to use 2 bit per value. Should reduce your width table 4-fold. Just saying;) Thanks for the tip! Indeed, the table size was reduced 4-fold. Awesome. However, now I'm finding that it no longer works properly when loaded from the precompiled data. It appears to have something to do with the default value for the width table being 1 rather than ubyte.init, and so far I couldn't figure out how to get the Trie ctor that takes .offsets, .sizes, .data to specify a default value. Why would you need a default in a low-level construction? I think it naturally takes the tables with whatever was stored in there. There is no processing. So the default has to be explicitly stored during building of trie. So now the trie is returning the wrong value for certain dchar ranges. :-(
Re: Proposed Phobos equivalent of wcswidth()
On Thu, Jan 18, 2018 at 06:42:26PM +, Dmitry Olshansky via Digitalmars-d wrote: [...] > Also forgot to mention that can pass BitPacked!(ubyte,2) to Trie > template as value type to use 2 bit per value. Should reduce your > width table 4-fold. Just saying;) Thanks for the tip! Indeed, the table size was reduced 4-fold. Awesome. However, now I'm finding that it no longer works properly when loaded from the precompiled data. It appears to have something to do with the default value for the width table being 1 rather than ubyte.init, and so far I couldn't figure out how to get the Trie ctor that takes .offsets, .sizes, .data to specify a default value. So now the trie is returning the wrong value for certain dchar ranges. :-( T -- Some ideas are so stupid that only intellectuals could believe them. -- George Orwell
Re: Proposed Phobos equivalent of wcswidth()
On Wednesday, 17 January 2018 at 22:59:58 UTC, H. S. Teoh wrote: I took a first stab at integrating this into dlang/tools: https://github.com/quickfur/tools/tree/unicode_gen So far, I can get the 64-bit generator to run and produce the generated unicode_*.d files. Unfortunately they are missing the 32-bit data, because I couldn't get a 32-bit dmd toolchain working on my PC. Maybe you could take a look and submit PRs against that branch for any fixes you'd like to get in? I'll see if I can somehow get 32-bit working on my PC. Alternatively, maybe the solution is to hack the Trie code so that it uses explicit int sizes rather than size_t, then we can use it to generate both 32-bit and 64-bit tables without requiring the host platform to support both. Yes, I guess we have to allow word size to be redefined. I just wanted fastest version by default w/o possibility to screw up on the user side of things. Also forgot to mention that can pass BitPacked!(ubyte,2) to Trie template as value type to use 2 bit per value. Should reduce your width table 4-fold. Just saying;)
Re: Proposed Phobos equivalent of wcswidth()
On Wed, Jan 17, 2018 at 05:06:05AM +, Dmitry Olshansky via Digitalmars-d wrote: > On Tuesday, 16 January 2018 at 23:01:19 UTC, H. S. Teoh wrote: [...] > > One thing, though: I think it would benefit us all if we could > > import at least gen_uni into Phobos, so that in the future when we > > need to update std.uni to a new version of Unicode, it can be > > (mostly) automated. It's better to have the tools to generate the > > tables in Phobos itself, than to be dependent on an external repo > > that may go out-of-sync eventually. > > Yes but it’s non-trivial at the moment, if you take a look at script > to generate stuff it takes both 32-bit and 64-bit executables to > populate tables. > > I think having it in tools repo should be fine though. Last time I > tried to update to Unicode 10, I found one table in Phobos that is > missing from generator (ooops!). I took a first stab at integrating this into dlang/tools: https://github.com/quickfur/tools/tree/unicode_gen So far, I can get the 64-bit generator to run and produce the generated unicode_*.d files. Unfortunately they are missing the 32-bit data, because I couldn't get a 32-bit dmd toolchain working on my PC. Maybe you could take a look and submit PRs against that branch for any fixes you'd like to get in? I'll see if I can somehow get 32-bit working on my PC. Alternatively, maybe the solution is to hack the Trie code so that it uses explicit int sizes rather than size_t, then we can use it to generate both 32-bit and 64-bit tables without requiring the host platform to support both. I imagine we may have problems getting the tools repo to build on the autotester once we integrate gen_uni into the makefile, unless we do something like this. > > When I get around to making a PR for strwidth AKA displayWidth, the > > plan is to check-in compileWidth.d in some form into Phobos > > somewhere, so that somebody else can pick it up and improve the > > implementation in the future if I'm not around / unavailable. > > > > If we can get gen_uni into Phobos, perhaps we can even include the > > displayWidth table generation in gen_uni too, so that all the table > > generation code is in one place. > > Right. A good step would be to move it to tools, then add your code. [...] Good idea. Well, I started with the branch linked above in my fork of dlang/tools. If I can get it off the ground, I'll add the displayWidth stuff in as well, then formulate a PR to add displayWidth to std.uni. Well, technically I don't need to wait for that, since I could just add the precomputed table directly into std/internal/unicode_tables.d. But it's probably better to let the generator do the job instead. A precomputed table is rather hard to review for correctness when it comes PR review time. :-D T -- Don't get stuck in a closet---wear yourself out.
Re: Proposed Phobos equivalent of wcswidth()
On Tuesday, 16 January 2018 at 23:01:19 UTC, H. S. Teoh wrote: On Tue, Jan 16, 2018 at 05:49:11PM +, Dmitry Olshansky via Digitalmars-d wrote: On Monday, 15 January 2018 at 19:52:07 UTC, H. S. Teoh wrote: [...] > One thing I'm seeking help with, and this is mainly directed > at Dmitry Olshansky but can be anyone here who knows the > internal workings of std.uni well enough, is how to > transform the Trie generated by the static ctor into > compile-time TrieNode declarations. This is one blocker for > my turning this code into a Phobos PR, because I don't want > to incur the cost of initializing this trie at runtime. Checkout my horribly named repo gsoc-uni-benchmark: https://github.com/DmitryOlshansky/gsoc-bench-2012/blob/master/gen_uni.d This is what generates unicode tables. Need to revise it, as folks were delicate enough to hand-patch auto-generated code in Phobos. Maybe make some of that user-acessible. [...] Whoa. There's some pretty cool stuff in there! Thanks, I've started experimenting with pre-generating the width table. Pretty neat. There's a lot of hidden gems in std.uni that I never knew existed, hidden away under `private`. :-D The intent is to open that up somehow, to allow folks to make their own extended versions of std.uni. Unicode is all about “tailoring” - adjusting algorithm to your specific regional preferences hy messing with tables. I think there is at least 1 bug in Bugzilla on this. One thing, though: I think it would benefit us all if we could import at least gen_uni into Phobos, so that in the future when we need to update std.uni to a new version of Unicode, it can be (mostly) automated. It's better to have the tools to generate the tables in Phobos itself, than to be dependent on an external repo that may go out-of-sync eventually. Yes but it’s non-trivial at the moment, if you take a look at script to generate stuff it takes both 32-bit and 64-bit executables to populate tables. I think having it in tools repo should be fine though. Last time I tried to update to Unicode 10, I found one table in Phobos that is missing from generator (ooops!). When I get around to making a PR for strwidth AKA displayWidth, the plan is to check-in compileWidth.d in some form into Phobos somewhere, so that somebody else can pick it up and improve the implementation in the future if I'm not around / unavailable. If we can get gen_uni into Phobos, perhaps we can even include the displayWidth table generation in gen_uni too, so that all the table generation code is in one place. Right. A good step would be to move it to tools, then add your code. T
Re: Proposed Phobos equivalent of wcswidth()
On Tue, Jan 16, 2018 at 05:49:11PM +, Dmitry Olshansky via Digitalmars-d wrote: > On Monday, 15 January 2018 at 19:52:07 UTC, H. S. Teoh wrote: [...] > > One thing I'm seeking help with, and this is mainly directed at > > Dmitry Olshansky but can be anyone here who knows the internal > > workings of std.uni well enough, is how to transform the Trie > > generated by the static ctor into compile-time TrieNode > > declarations. This is one blocker for my turning this code into a > > Phobos PR, because I don't want to incur the cost of initializing > > this trie at runtime. > > > Checkout my horribly named repo gsoc-uni-benchmark: > > https://github.com/DmitryOlshansky/gsoc-bench-2012/blob/master/gen_uni.d > > This is what generates unicode tables. > Need to revise it, as folks were delicate enough to hand-patch > auto-generated code in Phobos. > > Maybe make some of that user-acessible. [...] Whoa. There's some pretty cool stuff in there! Thanks, I've started experimenting with pre-generating the width table. Pretty neat. There's a lot of hidden gems in std.uni that I never knew existed, hidden away under `private`. :-D One thing, though: I think it would benefit us all if we could import at least gen_uni into Phobos, so that in the future when we need to update std.uni to a new version of Unicode, it can be (mostly) automated. It's better to have the tools to generate the tables in Phobos itself, than to be dependent on an external repo that may go out-of-sync eventually. When I get around to making a PR for strwidth AKA displayWidth, the plan is to check-in compileWidth.d in some form into Phobos somewhere, so that somebody else can pick it up and improve the implementation in the future if I'm not around / unavailable. If we can get gen_uni into Phobos, perhaps we can even include the displayWidth table generation in gen_uni too, so that all the table generation code is in one place. T -- Having a smoking section in a restaurant is like having a peeing section in a swimming pool. -- Edward Burr
Re: Proposed Phobos equivalent of wcswidth()
On Monday, 15 January 2018 at 19:52:07 UTC, H. S. Teoh wrote: On Sat, Jan 13, 2018 at 09:26:52AM -0800, H. S. Teoh via Digitalmars-d wrote: [...] https://github.com/quickfur/strwidth [...] One thing I'm seeking help with, and this is mainly directed at Dmitry Olshansky but can be anyone here who knows the internal workings of std.uni well enough, is how to transform the Trie generated by the static ctor into compile-time TrieNode declarations. This is one blocker for my turning this code into a Phobos PR, because I don't want to incur the cost of initializing this trie at runtime. Checkout my horribly named repo gsoc-uni-benchmark: https://github.com/DmitryOlshansky/gsoc-bench-2012/blob/master/gen_uni.d This is what generates unicode tables. Need to revise it, as folks were delicate enough to hand-patch auto-generated code in Phobos. Maybe make some of that user-acessible. T
Re: Proposed Phobos equivalent of wcswidth()
On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote: On Saturday, 13 January 2018 at 17:26:52 UTC, H. S. Teoh wrote: ... Thanks for taking the time to do this. And now the obligatory bikeshed: what should the Phobos equivalent of wcswidth be called? std.utf.displayWidth std.utf.bikeshed Never heard that phrase before. Nice one :)
Re: Proposed Phobos equivalent of wcswidth()
On Sat, Jan 13, 2018 at 09:26:52AM -0800, H. S. Teoh via Digitalmars-d wrote: [...] > https://github.com/quickfur/strwidth [...] One thing I'm seeking help with, and this is mainly directed at Dmitry Olshansky but can be anyone here who knows the internal workings of std.uni well enough, is how to transform the Trie generated by the static ctor into compile-time TrieNode declarations. This is one blocker for my turning this code into a Phobos PR, because I don't want to incur the cost of initializing this trie at runtime. Also, on a related note, there exist nicer interfaces in std.uni for constructing Tries that map ranges of codepoints to non-boolean values, but none of these are available publicly. The current implementation in strwidth only uses the public API of std.uni, so the construction of the trie is pretty horrendous (looping over individual codepoints and creating an AA of individual codepoints -- including very large ranges like the entire Unicode plane 2). I wonder if some of these facilities should be made public so that user code that needs to construct codepoint tries that include large ranges of codepoints can do so more efficiently. T -- This sentence is false.
Re: Proposed Phobos equivalent of wcswidth()
On Monday, January 15, 2018 10:37:14 H. S. Teoh via Digitalmars-d wrote: > On Mon, Jan 15, 2018 at 06:20:16PM +, Jack Stouffer via Digitalmars-d wrote: > > On Monday, 15 January 2018 at 17:32:40 UTC, H. S. Teoh wrote: > > > On Mon, Jan 15, 2018 at 02:14:56PM +, Simen Kjærås via > > > Digitalmars-d > > > > > > wrote: > > > > On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote: > > > > > std.utf.displayWidth > > > > > > > > +1 > > > > > > [...] > > > > > > Why std.utf rather than std.uni, though? > > > > The way I understand it is that std.uni is (supposed to be) for > > functions on individual unicode units (be they code units/points or > > graphemes) and std.utf is for functions which handle operating on > > unicode strings. > > Are you sure? I thought std.utf was specifically dealing with UTF-* > encodings, i.e., code units and conversions to/from code points, and > std.uni was supposed to be for implementing Unicode algorithms and > Unicode compliance in general, i.e., stuff that works at the code point > level. Your understanding of the division more or less matches mine, though I'm not sure that the line is entirely clearcut. I would definitely think that std.uni was the more appropriate place for such a function. - Jonathan M Davis
Re: Proposed Phobos equivalent of wcswidth()
On Mon, Jan 15, 2018 at 06:20:16PM +, Jack Stouffer via Digitalmars-d wrote: > On Monday, 15 January 2018 at 17:32:40 UTC, H. S. Teoh wrote: > > On Mon, Jan 15, 2018 at 02:14:56PM +, Simen Kjærås via Digitalmars-d > > wrote: > > > On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote: > > > > std.utf.displayWidth > > > > > > +1 > > [...] > > > > Why std.utf rather than std.uni, though? > > The way I understand it is that std.uni is (supposed to be) for > functions on individual unicode units (be they code units/points or > graphemes) and std.utf is for functions which handle operating on > unicode strings. Are you sure? I thought std.utf was specifically dealing with UTF-* encodings, i.e., code units and conversions to/from code points, and std.uni was supposed to be for implementing Unicode algorithms and Unicode compliance in general, i.e., stuff that works at the code point level. > Obviously there are exceptions. I think "they" put graphemeStride in > std.uni because Grapheme was defined there and it seemed reasonable at > the time. But, generally I think utf stuff should go into std.utf. But displayWidth isn't really directly related to UTF (i.e., the encoding of Unicode code points). It seems to me to be more to do with processing Unicode in general, though, granted, the optimizations I implemented are kinda in a grey zone between dealing with Unicode proper (i.e., with code points) vs. working with code units. T -- Klein bottle for rent ... inquire within. -- Stephen Mulraney
Re: Proposed Phobos equivalent of wcswidth()
On Monday, 15 January 2018 at 17:32:40 UTC, H. S. Teoh wrote: On Mon, Jan 15, 2018 at 02:14:56PM +, Simen Kjærås via Digitalmars-d wrote: On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote: > std.utf.displayWidth +1 [...] Why std.utf rather than std.uni, though? The way I understand it is that std.uni is (supposed to be) for functions on individual unicode units (be they code units/points or graphemes) and std.utf is for functions which handle operating on unicode strings. Obviously there are exceptions. I think "they" put graphemeStride in std.uni because Grapheme was defined there and it seemed reasonable at the time. But, generally I think utf stuff should go into std.utf.
Re: Proposed Phobos equivalent of wcswidth()
On Mon, Jan 15, 2018 at 02:14:56PM +, Simen Kjærås via Digitalmars-d wrote: > On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote: > > std.utf.displayWidth > > +1 [...] Why std.utf rather than std.uni, though? T -- ASCII stupid question, getty stupid ANSI.
Re: Proposed Phobos equivalent of wcswidth()
On Monday, 15 January 2018 at 15:08:24 UTC, Kagamin wrote: columnWidth as it only makes sense for column-oriented text display. I think displayWidth is better, because "width" is directly linked to hozizontal direction (else it would be called hight), and setting text in colums would still take additional steps to be set correct. Also "display" indicates that it has nothing to do with the string length, which is good to avoid confusion.
Re: Proposed Phobos equivalent of wcswidth()
columnWidth as it only makes sense for column-oriented text display.
Re: Proposed Phobos equivalent of wcswidth()
On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote: std.utf.displayWidth +1 -- Simen
Re: Proposed Phobos equivalent of wcswidth()
On Saturday, 13 January 2018 at 17:26:52 UTC, H. S. Teoh wrote: ... Thanks for taking the time to do this. And now the obligatory bikeshed: what should the Phobos equivalent of wcswidth be called? std.utf.displayWidth
Proposed Phobos equivalent of wcswidth()
This past week, while reviewing Phobos PR #6008, I started experimenting with an optimized D equivalent of wcswidth(). For more details, see: https://issues.dlang.org/show_bug.cgi?id=7054 https://issues.dlang.org/show_bug.cgi?id=17810 as well as the discussion on: https://github.com/dlang/phobos/pull/6008 Anyway, the TL;DR summary is this: given a format() spec like "%20s", in order to insert the correct number of spaces to pad the string to 20 characters (or rather, 20 spaces in the output), we need to compute the displayed length of the string in monospace font. Unfortunately, given the complexities of Unicode, this is far from trivial: - In C, the C library doesn't even pretend to know Unicode, so the padding is just based on the number of bytes the string occupies. Obviously, for anything non-ASCII the output will be wrong (misaligned). - In D, in the original naïve implementation, we try to be a little smarter by counting the number of dchars. Unfortunately, this is also wrong, because of combining diacritics like U+0301 which modify the preceding character and do not advance the cursor. - In Phobos PR #6008, we improved this to count grapheme clusters instead. However, this is *still* wrong, because of the existence of zero-width characters (don't you just love Unicode?!), and also because of "wide" or "full-width" East Asian block characters as specified by Unicode TR11 (and scarily enough, the new Emoji blocks are included in this "wide" category), which on a text console generally occupies 2 positions per grapheme rather than 1. Eventually, the solution boils down to implementing the equivalent of Posix wcswidth(). But a naïve implementation of this is extremely inefficient, because segmenting a Unicode string by grapheme and *then* computing its width is non-trivial. So inefficient that it's just too slow to use in format(), especially if most strings you'd pass to format() are ASCII-only or mostly ASCII. Thankfully, std.uni provides (some of) the tools to optimize this. The basic idea is this: we don't actually care to segment graphemes; all we want to do is to know, given some string s, how many display positions it will occupy, so that we can insert the right number of spaces. The actual grapheme segmentation and typesetting is the terminal's job, and none of format()'s business. So we can cut some corners while still producing the right results. Basically my current solution consists of: - Parsing EastAsianWidth.txt published by the Unicode Consortium to precompute a table of wide/full-width characters (W and F) -- this is not done at runtime or compile-time, but as a separate step to generate the source code of the table, since otherwise it's either too slow at runtime or would slow down Phobos compilation too much, plus it depends on an external file which is not practical; - Combining this table with Unicode category Grapheme_extend, plus a bunch of hand-coded zero-width characters to produce a mapping of every dchar to 0, 1, or 2. All characters that extend a grapheme, like a combining diacritic, maps to 0. All characters designated as Wide or Full-width (excluding grapheme extenders) map to 2. Everything else maps to 1. - Compiling this table into a 3-level Trie (std.uni.Trie) for O(1) runtime lookup per dchar. - Computing the display width, then, is just a matter of iterating over dchars in the string and summing the values looked up in the trie. Of course, no matter how optimized a width lookup is, it's still pretty slow for an ASCII-only string, which is 90% of the use cases of format(). So to improve this common case, the additional optimization is to scan the string for ASCII-only bytes, and just incrementing the width since we know ASCII characters are always 1 column wide. Only when we encounter a non-ASCII byte that we bother with UTF-8 decoding and the table lookup. Here's my current implementation: https://github.com/quickfur/strwidth Here's my current benchmark results: - walkLength is literally passing the string to std.range.walkLength, which is basically counting the number of code points in the string. As mentioned before, this does not produce the correct width. - byGraphemeWalk is the next step up, to count the number of graphemes using std.uni.byGrapheme. Unfortunately, this is still not fully correct. - graphemeStrideWalk is a slight optimization of byGraphemeWalk, by not actually decoding the grapheme, but just computing the stride. It also has the virtue of being usable in CTFE. Performance-wise, it's not that much different from byGraphemeWalk. - width0 is the first "correct" string width computation, but with a naïve, slow implementation. It serves as a baseline to compare the next implementations. - width1 is the trie-optimized version of width0. It shows significant improvement over width0, but is still very slow for ASCII strings