On Thu, May 14, 2026 at 3:09 PM Branko Čibej <[email protected]> wrote:

> On 14. 5. 26 14:12, Timofei Zhakov wrote:
>
> This function counts real printable UTF characters in a string. It
> currently contains a table of all patterns that is manually checked. I
> believe it was stolen from elsewhere a long time ago. Before we had
> utf8proc as a required dependency.
>
> I have a few reasons to rewrite it to use the library instead;
>
> 1. I'm pretty sure nobody would ever care to update the dataset. On
> the other hand, utf8proc bundles all available information about the
> latest Unicode version that is supported on the current platform.
>
> 2. There is also a property that defines *display* width, that
> basically makes symbols like emojis wider than normal characters even
> on monospace fonts.
>
> (For context I want to fix indentation in places throughout our
> cmdline like the authors in 'svn list -v' that mess up the tables.
> This is where a function like that will be useful.)
>
> 3. Cleanup redundant code.
>
> 4. It might be slightly faster to use their dataset because utf8proc
> only accesses a table in static memory twice (for address and then
> retrieves properties) instead of binary searching and checking all
> ranges. Maybe it's slower though idk.
>
> Thoughts?
>
>
> I've had such thoughts before. The problem with our home-grown Unicode
> metadata is that it's very much out of date. It might even predate Emoji...
> Using utf8proc would fix that in most cases, since most distros use the
> system-provided library.
>
> So +1 in principle and there are other functions that could do with an
> overhaul.
>

Good.

Then I will wait some time for others to potentially speak up and will
rewrite that part...

There is also an argument to use utf8proc_iterate() instead of manually
traversing the string. This might be a little extra. I'm not sure what the
right approach is.

-- 
Timofei Zhakov

Reply via email to