Den lör 16 maj 2026 kl 17:08 skrev Timofei Zhakov <[email protected]>:
>
> On Thu, May 14, 2026 at 7:40 PM Ivan Zhakov <[email protected]> wrote:
>>
>> On Thu, 14 May 2026 at 14:13, Timofei Zhakov <[email protected]> wrote:
>>>
>>> This function counts real printable UTF characters in a string. It
>>> currently contains a table of all patterns that is manually checked. I
>>> believe it was stolen from elsewhere a long time ago. Before we had
>>> utf8proc as a required dependency.
>>>
>>> I have a few reasons to rewrite it to use the library instead;
>>>
>>> 1. I'm pretty sure nobody would ever care to update the dataset. On
>>> the other hand, utf8proc bundles all available information about the
>>> latest Unicode version that is supported on the current platform.
>>>
>>> 2. There is also a property that defines *display* width, that
>>> basically makes symbols like emojis wider than normal characters even
>>> on monospace fonts.
>>>
>>> (For context I want to fix indentation in places throughout our
>>> cmdline like the authors in 'svn list -v' that mess up the tables.
>>> This is where a function like that will be useful.)
>>>
>>> 3. Cleanup redundant code.
>>>
>>> 4. It might be slightly faster to use their dataset because utf8proc
>>> only accesses a table in static memory twice (for address and then
>>> retrieves properties) instead of binary searching and checking all
>>> ranges. Maybe it's slower though idk.
>>>
>>> Thoughts?
>>>
>> Sounds good to me.
>>
>> Regarding potential performance regression: is it something we can measure? 
>> As far as I understand svn_utf_cstring_utf8_width() is not used for 
>> performance critical code, but it would be nice to know if there is 
>> significant performance regression anyway.
>>
>
> Well, I guess it was a bad argument because the utf8proc version looks well 
> optimised anyway and we already use in multiple places.
>
> I'm done with the first version of a patch to make this switch. Please find 
> it in the attachments. I also added a unit test to confirm.
>
> I think since this function depends on utf8proc it should be implemented in 
> utf8proc.c file. Hence utf_width.c can be removed entirely. I guess the main 
> reason it was in a separate file was to keep the dataset away from other code.
>
> By the way, I also found that before traversing the string, this function 
> calls svn_utf__cstring_is_valid() to check validity. I think 
> utf8proc_iterate() does the same check, but I really am not sure. 
> svn_utf__cstring_is_valid() skips all first octets with values >0x80. Then it 
> basically does the same work utf8proc_iterate() would.

Thank you for preparing the patch! I have reviewed it and it looks good to me.

I agree with your analysis of utf8proc_interate: it is not neccessary
to check the string first since utf8proc_iterate also return if the
string is invalid.

I made a very simple test to see the performance: Calling
svn_utf_cstring_utf8_width() 1000 times with the fat_emojis string.
The basic version of the patch is around 25% faster compared to the
original implementation. If I remove the call to
svn_utf__cstring_is_valid, the patch is 33% faster (with a longer
string, I expect this difference to be larger).

Cheers,
Daniel

Reply via email to