On Thu, May 14, 2026 at 3:09 PM Branko Čibej <[email protected]> wrote:
> On 14. 5. 26 14:12, Timofei Zhakov wrote: > > This function counts real printable UTF characters in a string. It > currently contains a table of all patterns that is manually checked. I > believe it was stolen from elsewhere a long time ago. Before we had > utf8proc as a required dependency. > > I have a few reasons to rewrite it to use the library instead; > > 1. I'm pretty sure nobody would ever care to update the dataset. On > the other hand, utf8proc bundles all available information about the > latest Unicode version that is supported on the current platform. > > 2. There is also a property that defines *display* width, that > basically makes symbols like emojis wider than normal characters even > on monospace fonts. > > (For context I want to fix indentation in places throughout our > cmdline like the authors in 'svn list -v' that mess up the tables. > This is where a function like that will be useful.) > > 3. Cleanup redundant code. > > 4. It might be slightly faster to use their dataset because utf8proc > only accesses a table in static memory twice (for address and then > retrieves properties) instead of binary searching and checking all > ranges. Maybe it's slower though idk. > > Thoughts? > > > I've had such thoughts before. The problem with our home-grown Unicode > metadata is that it's very much out of date. It might even predate Emoji... > Using utf8proc would fix that in most cases, since most distros use the > system-provided library. > > So +1 in principle and there are other functions that could do with an > overhaul. > Good. Then I will wait some time for others to potentially speak up and will rewrite that part... There is also an argument to use utf8proc_iterate() instead of manually traversing the string. This might be a little extra. I'm not sure what the right approach is. -- Timofei Zhakov

