On 14. 5. 26 14:12, Timofei Zhakov wrote:
This function counts real printable UTF characters in a string. It
currently contains a table of all patterns that is manually checked. I
believe it was stolen from elsewhere a long time ago. Before we had
utf8proc as a required dependency.
I have a few reasons to rewrite it to use the library instead;
1. I'm pretty sure nobody would ever care to update the dataset. On
the other hand, utf8proc bundles all available information about the
latest Unicode version that is supported on the current platform.
2. There is also a property that defines *display* width, that
basically makes symbols like emojis wider than normal characters even
on monospace fonts.
(For context I want to fix indentation in places throughout our
cmdline like the authors in 'svn list -v' that mess up the tables.
This is where a function like that will be useful.)
3. Cleanup redundant code.
4. It might be slightly faster to use their dataset because utf8proc
only accesses a table in static memory twice (for address and then
retrieves properties) instead of binary searching and checking all
ranges. Maybe it's slower though idk.
Thoughts?
I've had such thoughts before. The problem with our home-grown Unicode
metadata is that it's very much out of date. It might even predate
Emoji... Using utf8proc would fix that in most cases, since most distros
use the system-provided library.
So +1 in principle and there are other functions that could do with an
overhaul.
-- Brane