Re: [dev] [libgrapheme] Some questions about libgrapheme

2022-09-02 Thread atrtarget

If efficiency is not a concern, then you can easily use something like
this (just a quick prototype, didn't verify if it's correct or not):

[...]


Thanks for the free code :)
I think that will be the way to go in my case, since most input will be
ascii and moving the cursor will be quite rare



If I was expecting a decent amount of non-ascii input, I would use the
bitvector approach described by Thomas Oltmann. 1bit per byte overhead
should be fine for most use-cases.


I think it is very good too, the only problem is the overhead of having
to preprocess everything


Thank you a lot for helping!
~ Arthur Bacci



Re: [dev] [libgrapheme] Some questions about libgrapheme

2022-09-02 Thread atrtarget

Hi!


This is a really good suggestion, but I think it may add a lot of 
overhead
since it would need to go through the entire buffer, and since moving 
the

cursor is not very frequent (not more than changing you position or
opening a new buffer), I think it would be better to do it the "lazy" 
way.

However, thanks for pointing out a solution, I guess it would be really
good for some other situations

1. Regarding stepping backwards throught the graphemes:

As Laslo explained, trying to find the starting point of the previous
grapheme is simply not possible.
In your situation, if scanning from the front of the string is too
inefficient for you, you could try keeping
a bitfield in addition to the string, with one bit for each char of the 
string.

A 1 in the bitfield means 'this char is the start of a new grapheme',
0 is the opposite.
Every time the string changes, the bitfield is recomputed.
This way, moving the cursor left or right in a text editor is just a
matter of finding the next
or previous set bit in the bitfield, which is extremely cheap.



https://github.com/vim/vim/blob/master/src/libvterm/find-wide-chars.pl
https://github.com/vim/vim/blob/master/src/libvterm/src/fullwidth.inc

I am not 100% sure but it looks like vim goes by the old way. There are
also some comments on this file about it:

https://github.com/vim/vim/blob/master/src/libvterm/src/unicode.c


https://github.com/tmux/tmux/blob/master/utf8.c

tmux seems to go even lazier by using `wcwidth` itself and btw, they
seem to have dropped support for systems who don't support it too:

https://github.com/tmux/tmux/pull/3003


Even neovim seems to use the hack:

https://github.com/neovim/neovim/blob/master/src/unicode/EastAsianWidth.txt



I guess the only robust approach is to render the character on the
terminal, and then read back by how much the
cursor was advanced.


This looks like a good idea, the problem is that I'm not sure if most
terminals will return the actual position in the grid or the number
of graphemes or code points, since it seems like it is not specified
in VT* or in xterm. But as long as this applies to /most/ terminals I
think it's fine, or at least better than wcwidth


2. Regarding the avoidance of terminal linewrap:

AFAIK there's no proper way to query the display width of a character.
It definitely depends on the font though.
I guess the only robust approach is to render the character on the
terminal, and then read back by how much the
cursor was advanced.
So perhaps you could try to render the whole line, detect when a line
overflow happens in the terminal based on
the cursor position, and then react accordingly.
It would be interesting to know how (or even if!) other software such
as tmux or vim has solved this issue.



Thank you a lot for helping me!



Re: [dev] [libgrapheme] Some questions about libgrapheme

2022-09-02 Thread NRK
On Fri, Sep 02, 2022 at 02:08:03PM -0300, atrtar...@cock.li wrote:
> Quite inefficient really, but I guess it's fine since my usage would be
> only user input (left arrow)

If efficiency is not a concern, then you can easily use something like
this (just a quick prototype, didn't verify if it's correct or not):

/* returns an offset into `s` */
static size_t
prev_char_offset(const char *s, size_t slen, size_t off)
{
assert(s != NULL);
assert(slen > 0);
assert(off <= slen);

size_t ret = 0;
const char *const end = s + slen;
while (s < end) {
size_t n = grapheme_next_character_break_utf8(s, end - 
s);
if (ret + n >= off)
return ret;
ret += n;
s += n;
}
return 0; /* unreachable (?) */
}

If I was expecting a decent amount of non-ascii input, I would use the
bitvector approach described by Thomas Oltmann. 1bit per byte overhead
should be fine for most use-cases.

- NRK



Re: [dev] [libgrapheme] Some questions about libgrapheme

2022-09-02 Thread Thomas Oltmann
Hi atrtarget,

I thought I'd chip in my two cents.

1. Regarding stepping backwards throught the graphemes:

As Laslo explained, trying to find the starting point of the previous
grapheme is simply not possible.
In your situation, if scanning from the front of the string is too
inefficient for you, you could try keeping
a bitfield in addition to the string, with one bit for each char of the string.
A 1 in the bitfield means 'this char is the start of a new grapheme',
0 is the opposite.
Every time the string changes, the bitfield is recomputed.
This way, moving the cursor left or right in a text editor is just a
matter of finding the next
or previous set bit in the bitfield, which is extremely cheap.

2. Regarding the avoidance of terminal linewrap:

AFAIK there's no proper way to query the display width of a character.
It definitely depends on the font though.
I guess the only robust approach is to render the character on the
terminal, and then read back by how much the
cursor was advanced.
So perhaps you could try to render the whole line, detect when a line
overflow happens in the terminal based on
the cursor position, and then react accordingly.
It would be interesting to know how (or even if!) other software such
as tmux or vim has solved this issue.

Cheers,
  Thomas


On Fri, Sep 2, 2022 at 7:08 PM  wrote:
>
> Thank you a lot for spending some time answering!
>
> > The problem with this heuristic is that the algorithm can become very
> > inefficient, especially when you have long preceding segments. If n is
> > the offset-length, the worst-case runtime could be O((n-1)!) for a
> > segment that is in fact of length n-1, because of the single backsteps
> > it has to take.
>
> Quite inefficient really, but I guess it's fine since my usage would be
> only user input (left arrow)
>
> > The proper way to solve the column-problem is to render each grapheme
> > cluster and see how wide the font-rendering-library renders it, given
> > it depends on the font. I know that this isn't satisfactory, but that's
> > how it is.
>
> In the case of a terminal would this mean asking for the position of the
> cursor after every character I print? My usage would be to avoid
> terminal
> induced soft-wraps in a text editor.
>
> Anyway, thanks again for the help!
>



Re: [dev] [libgrapheme] Some questions about libgrapheme

2022-09-02 Thread atrtarget

Thank you a lot for spending some time answering!


The problem with this heuristic is that the algorithm can become very
inefficient, especially when you have long preceding segments. If n is
the offset-length, the worst-case runtime could be O((n-1)!) for a
segment that is in fact of length n-1, because of the single backsteps
it has to take.


Quite inefficient really, but I guess it's fine since my usage would be
only user input (left arrow)


The proper way to solve the column-problem is to render each grapheme
cluster and see how wide the font-rendering-library renders it, given
it depends on the font. I know that this isn't satisfactory, but that's
how it is.


In the case of a terminal would this mean asking for the position of the
cursor after every character I print? My usage would be to avoid 
terminal

induced soft-wraps in a text editor.

Anyway, thanks again for the help!



Re: [dev] [libgrapheme] Some questions about libgrapheme

2022-09-02 Thread Laslo Hunhold
On Thu, 01 Sep 2022 21:43:06 -0300
atrtar...@cock.li wrote:

Dear atrtarget,

thanks for reaching out!

> libgrapheme looks really useful, but I still don't get some things
> from it. For example, if I need to get back one grapheme, how should
> I do it since there's no `grapheme_prev_character_break`?

This is difficult to achieve, given the Unicode standard pretty much
gave up on specifying it (see [0]). Going backwards would first require
you to know that you're in a "safe spot" and not in the middle of
nowhere. Going back to a safe spot, though, is unspecified.

C as a language also makes it a bit inelegant to go "backwards" in a
string. The only way I could think of is a function prototype of the
form

size_t grapheme_prev_character_break(const uint_least32_t *str,
 size_t strlen, size_t offset);

where the offset is the "starting" point you want to go back from,
returning the offset of the previous breakpoint.

A trivial heuristic could be to go backwards until the
breakpoint-detector stops _before_ the specified offset, however, to be
on the completely safe side I imagine that it must be done such that
you even go back further to "see" two breakpoints (i.e. including the
breakpoint before the desired previous breakpoint to force
self-synchronization).

The problem with this heuristic is that the algorithm can become very
inefficient, especially when you have long preceding segments. If n is
the offset-length, the worst-case runtime could be O((n-1)!) for a
segment that is in fact of length n-1, because of the single backsteps
it has to take.

> And to get the number of columns a character takes up, should I
> convert everything to wchar and use `wcswidth`? In my case that would
> be very inefficient :( Thanks for reading

Unicode explictly warns against using the EastAsianWidth-property
(which is what wcswidth uses behind the scenes) to determine the
column-size of a string (see [1]):

[...] the guidelines on use of this property should be
considered recommendations based on a particular legacy
practice that may be overridden by implementations as necessary.

and

Note: The East_Asian_Width property is not intended for use by
modern terminal emulators without appropriate tailoring on a
case-by-case basis. Such terminal emulators need a way to
resolve the halfwidth/fullwidth dichotomy that is necessary for
such environments, but the East_Asian_Width property does not
provide an off-the-shelf solution for all situations. The
growing repertoire of the Unicode Standard has long exceeded
the bounds of East Asian legacy character encodings, and
terminal emulations often need to be customized to support edge
cases and for changes in typographical behavior over time.

So, in other words: EAW was added to the standard decades ago and now
they're stuck with it. They don't recommend it without tailoring.
What does the tailoring look like? Unspecified, because this is a
text-rendering thing and impossible to solve on a "logical" basis.

I have the goal that libgrapheme only offers interfaces that work as
intended and not hacks that "usually" work or "have always worked"
based on a misinterpretation. This half-assed approach has already led
to many problems before in the context of text-handling in software.

The proper way to solve the column-problem is to render each grapheme
cluster and see how wide the font-rendering-library renders it, given
it depends on the font. I know that this isn't satisfactory, but that's
how it is.

I hope that this answers your questions.

With best regards

Laslo

[0]:https://unicode.org/reports/tr29/#Random_Access
[1]:https://www.unicode.org/reports/tr11/tr11-39.html#Scope



[dev] [libgrapheme] Some questions about libgrapheme

2022-09-01 Thread atrtarget
libgrapheme looks really useful, but I still don't get some things from 
it. For example, if I need to get back one grapheme, how should I do it 
since there's no `grapheme_prev_character_break`? And to get the number 
of columns a character takes up, should I convert everything to wchar 
and use `wcswidth`? In my case that would be very inefficient :(

Thanks for reading