Re: [dev] [libgrapheme] Some questions about libgrapheme
If efficiency is not a concern, then you can easily use something like this (just a quick prototype, didn't verify if it's correct or not): [...] Thanks for the free code :) I think that will be the way to go in my case, since most input will be ascii and moving the cursor will be quite rare If I was expecting a decent amount of non-ascii input, I would use the bitvector approach described by Thomas Oltmann. 1bit per byte overhead should be fine for most use-cases. I think it is very good too, the only problem is the overhead of having to preprocess everything Thank you a lot for helping! ~ Arthur Bacci
Re: [dev] [libgrapheme] Some questions about libgrapheme
Hi! This is a really good suggestion, but I think it may add a lot of overhead since it would need to go through the entire buffer, and since moving the cursor is not very frequent (not more than changing you position or opening a new buffer), I think it would be better to do it the "lazy" way. However, thanks for pointing out a solution, I guess it would be really good for some other situations 1. Regarding stepping backwards throught the graphemes: As Laslo explained, trying to find the starting point of the previous grapheme is simply not possible. In your situation, if scanning from the front of the string is too inefficient for you, you could try keeping a bitfield in addition to the string, with one bit for each char of the string. A 1 in the bitfield means 'this char is the start of a new grapheme', 0 is the opposite. Every time the string changes, the bitfield is recomputed. This way, moving the cursor left or right in a text editor is just a matter of finding the next or previous set bit in the bitfield, which is extremely cheap. https://github.com/vim/vim/blob/master/src/libvterm/find-wide-chars.pl https://github.com/vim/vim/blob/master/src/libvterm/src/fullwidth.inc I am not 100% sure but it looks like vim goes by the old way. There are also some comments on this file about it: https://github.com/vim/vim/blob/master/src/libvterm/src/unicode.c https://github.com/tmux/tmux/blob/master/utf8.c tmux seems to go even lazier by using `wcwidth` itself and btw, they seem to have dropped support for systems who don't support it too: https://github.com/tmux/tmux/pull/3003 Even neovim seems to use the hack: https://github.com/neovim/neovim/blob/master/src/unicode/EastAsianWidth.txt I guess the only robust approach is to render the character on the terminal, and then read back by how much the cursor was advanced. This looks like a good idea, the problem is that I'm not sure if most terminals will return the actual position in the grid or the number of graphemes or code points, since it seems like it is not specified in VT* or in xterm. But as long as this applies to /most/ terminals I think it's fine, or at least better than wcwidth 2. Regarding the avoidance of terminal linewrap: AFAIK there's no proper way to query the display width of a character. It definitely depends on the font though. I guess the only robust approach is to render the character on the terminal, and then read back by how much the cursor was advanced. So perhaps you could try to render the whole line, detect when a line overflow happens in the terminal based on the cursor position, and then react accordingly. It would be interesting to know how (or even if!) other software such as tmux or vim has solved this issue. Thank you a lot for helping me!
Re: [dev] [libgrapheme] Some questions about libgrapheme
On Fri, Sep 02, 2022 at 02:08:03PM -0300, atrtar...@cock.li wrote: > Quite inefficient really, but I guess it's fine since my usage would be > only user input (left arrow) If efficiency is not a concern, then you can easily use something like this (just a quick prototype, didn't verify if it's correct or not): /* returns an offset into `s` */ static size_t prev_char_offset(const char *s, size_t slen, size_t off) { assert(s != NULL); assert(slen > 0); assert(off <= slen); size_t ret = 0; const char *const end = s + slen; while (s < end) { size_t n = grapheme_next_character_break_utf8(s, end - s); if (ret + n >= off) return ret; ret += n; s += n; } return 0; /* unreachable (?) */ } If I was expecting a decent amount of non-ascii input, I would use the bitvector approach described by Thomas Oltmann. 1bit per byte overhead should be fine for most use-cases. - NRK
Re: [dev] [libgrapheme] Some questions about libgrapheme
Hi atrtarget, I thought I'd chip in my two cents. 1. Regarding stepping backwards throught the graphemes: As Laslo explained, trying to find the starting point of the previous grapheme is simply not possible. In your situation, if scanning from the front of the string is too inefficient for you, you could try keeping a bitfield in addition to the string, with one bit for each char of the string. A 1 in the bitfield means 'this char is the start of a new grapheme', 0 is the opposite. Every time the string changes, the bitfield is recomputed. This way, moving the cursor left or right in a text editor is just a matter of finding the next or previous set bit in the bitfield, which is extremely cheap. 2. Regarding the avoidance of terminal linewrap: AFAIK there's no proper way to query the display width of a character. It definitely depends on the font though. I guess the only robust approach is to render the character on the terminal, and then read back by how much the cursor was advanced. So perhaps you could try to render the whole line, detect when a line overflow happens in the terminal based on the cursor position, and then react accordingly. It would be interesting to know how (or even if!) other software such as tmux or vim has solved this issue. Cheers, Thomas On Fri, Sep 2, 2022 at 7:08 PM wrote: > > Thank you a lot for spending some time answering! > > > The problem with this heuristic is that the algorithm can become very > > inefficient, especially when you have long preceding segments. If n is > > the offset-length, the worst-case runtime could be O((n-1)!) for a > > segment that is in fact of length n-1, because of the single backsteps > > it has to take. > > Quite inefficient really, but I guess it's fine since my usage would be > only user input (left arrow) > > > The proper way to solve the column-problem is to render each grapheme > > cluster and see how wide the font-rendering-library renders it, given > > it depends on the font. I know that this isn't satisfactory, but that's > > how it is. > > In the case of a terminal would this mean asking for the position of the > cursor after every character I print? My usage would be to avoid > terminal > induced soft-wraps in a text editor. > > Anyway, thanks again for the help! >
Re: [dev] [libgrapheme] Some questions about libgrapheme
Thank you a lot for spending some time answering! The problem with this heuristic is that the algorithm can become very inefficient, especially when you have long preceding segments. If n is the offset-length, the worst-case runtime could be O((n-1)!) for a segment that is in fact of length n-1, because of the single backsteps it has to take. Quite inefficient really, but I guess it's fine since my usage would be only user input (left arrow) The proper way to solve the column-problem is to render each grapheme cluster and see how wide the font-rendering-library renders it, given it depends on the font. I know that this isn't satisfactory, but that's how it is. In the case of a terminal would this mean asking for the position of the cursor after every character I print? My usage would be to avoid terminal induced soft-wraps in a text editor. Anyway, thanks again for the help!
Re: [dev] [libgrapheme] Some questions about libgrapheme
On Thu, 01 Sep 2022 21:43:06 -0300 atrtar...@cock.li wrote: Dear atrtarget, thanks for reaching out! > libgrapheme looks really useful, but I still don't get some things > from it. For example, if I need to get back one grapheme, how should > I do it since there's no `grapheme_prev_character_break`? This is difficult to achieve, given the Unicode standard pretty much gave up on specifying it (see [0]). Going backwards would first require you to know that you're in a "safe spot" and not in the middle of nowhere. Going back to a safe spot, though, is unspecified. C as a language also makes it a bit inelegant to go "backwards" in a string. The only way I could think of is a function prototype of the form size_t grapheme_prev_character_break(const uint_least32_t *str, size_t strlen, size_t offset); where the offset is the "starting" point you want to go back from, returning the offset of the previous breakpoint. A trivial heuristic could be to go backwards until the breakpoint-detector stops _before_ the specified offset, however, to be on the completely safe side I imagine that it must be done such that you even go back further to "see" two breakpoints (i.e. including the breakpoint before the desired previous breakpoint to force self-synchronization). The problem with this heuristic is that the algorithm can become very inefficient, especially when you have long preceding segments. If n is the offset-length, the worst-case runtime could be O((n-1)!) for a segment that is in fact of length n-1, because of the single backsteps it has to take. > And to get the number of columns a character takes up, should I > convert everything to wchar and use `wcswidth`? In my case that would > be very inefficient :( Thanks for reading Unicode explictly warns against using the EastAsianWidth-property (which is what wcswidth uses behind the scenes) to determine the column-size of a string (see [1]): [...] the guidelines on use of this property should be considered recommendations based on a particular legacy practice that may be overridden by implementations as necessary. and Note: The East_Asian_Width property is not intended for use by modern terminal emulators without appropriate tailoring on a case-by-case basis. Such terminal emulators need a way to resolve the halfwidth/fullwidth dichotomy that is necessary for such environments, but the East_Asian_Width property does not provide an off-the-shelf solution for all situations. The growing repertoire of the Unicode Standard has long exceeded the bounds of East Asian legacy character encodings, and terminal emulations often need to be customized to support edge cases and for changes in typographical behavior over time. So, in other words: EAW was added to the standard decades ago and now they're stuck with it. They don't recommend it without tailoring. What does the tailoring look like? Unspecified, because this is a text-rendering thing and impossible to solve on a "logical" basis. I have the goal that libgrapheme only offers interfaces that work as intended and not hacks that "usually" work or "have always worked" based on a misinterpretation. This half-assed approach has already led to many problems before in the context of text-handling in software. The proper way to solve the column-problem is to render each grapheme cluster and see how wide the font-rendering-library renders it, given it depends on the font. I know that this isn't satisfactory, but that's how it is. I hope that this answers your questions. With best regards Laslo [0]:https://unicode.org/reports/tr29/#Random_Access [1]:https://www.unicode.org/reports/tr11/tr11-39.html#Scope
[dev] [libgrapheme] Some questions about libgrapheme
libgrapheme looks really useful, but I still don't get some things from it. For example, if I need to get back one grapheme, how should I do it since there's no `grapheme_prev_character_break`? And to get the number of columns a character takes up, should I convert everything to wchar and use `wcswidth`? In my case that would be very inefficient :( Thanks for reading