Re: [PATCH] libstdc++: Add comments to __unicode::_Utf_iterator

Jonathan Wakely Wed, 16 Jul 2025 05:06:29 -0700

On Wed, 16 Jul 2025 at 12:30, Tomasz Kaminski <tkami...@redhat.com> wrote:
>
>
>
>
> On Wed, Jul 16, 2025 at 12:05 PM Jonathan Wakely <jwak...@redhat.com> wrote:
>>
>> Add comments documenting what it does and how it does it.
>>
>> Also reorder the if-else in operator++ so that we check whether to
>> iterate over code units in the local buffer before checking whether to
>> refill that buffer. That seems the more natural way to structure the
>> function.
>>
>> libstdc++-v3/ChangeLog:
>>
>>         * include/bits/unicode.h (__unicode::_Utf_iterator): Add
>>         comments.
>>         (__unicode:_Utf_iterator::operator++()): Check whether to
>>         iterate over the buffer first.
>> ---
>>
>> Tested x86_64-linux. Are these comments clear? Helpful enough?
>
> I have used the  iterator without the comments, so my option is biased.
> They are clear for me, except one, as noted below.
>>
>>
>>  libstdc++-v3/include/bits/unicode.h | 48 +++++++++++++++++++++++------
>>  1 file changed, 38 insertions(+), 10 deletions(-)
>>
>> diff --git a/libstdc++-v3/include/bits/unicode.h 
>> b/libstdc++-v3/include/bits/unicode.h
>> index f1b6bf49c54c..d4e379938c20 100644
>> --- a/libstdc++-v3/include/bits/unicode.h
>> +++ b/libstdc++-v3/include/bits/unicode.h
>> @@ -86,6 +86,25 @@ namespace __unicode
>>        { return *__it == iter_value_t<_It>{}; }
>>    };
>>
>> +  // An iterator over an input range of FromFmt code units that yields 
>> either
>> +  // UTF-8, UTF-16, or UTF-32, as a range of ToFmt code units.
>> +  // The code units from the input range are interpreted as Unicode code 
>> points
>> +  // and the iterator produces the individual code unit for each code point.
>> +  // Invalid sequences in the input are replaced with U+FFDD so that the 
>> result
>> +  // is always valid UTF-8, UTF-16, or UTF-32.
>> +  //
>> +  // The iterator knows the bounds of the underlying input range and will 
>> not
>> +  // read outside those bounds (incrementing or decrementing at the boundary
>> +  // is erroneously idempotent).
>> +  //
>> +  // On construction, the iterator attemps to decode a single code point 
>> from
>> +  // the input range and then encode it into an internal buffer in the 
>> output
>> +  // format, e.g. if the input is UTF-8 and the output is UTF-16, it might 
>> read
>> +  // three char8_t code units from the input and store two char16_t code 
>> units
>> +  // in its buffer. Incrementing the iterator will first iterate over 
>> buffer,
>> +  // yielding each code unit in turn, and then extract another code point 
>> from
>> +  // the input. Failure to extract a valid code point from the input will 
>> store
>> +  // U+FFFD in the buffer, encoded as the appropriate code units of type 
>> ToFmt.
>>    template<typename _FromFmt, typename _ToFmt,
>>            input_iterator _Iter, sentinel_for<_Iter> _Sent = _Iter,
>>            typename _ErrorHandler = _Repl>
>> @@ -162,17 +181,20 @@ namespace __unicode
>>        constexpr _Utf_iterator&
>>        operator++()
>>        {
>> -       if (_M_buf_index + 1 == _M_buf_last && _M_curr() != _M_last)
>> +       if (_M_buf_index + 1 < _M_buf_last)
>> +         ++_M_buf_index; // Move to the next code unit in the buffer.
>> +       else if (_M_curr() != _M_last)
>>           {
>> +           // Advance past the current code point (for non-forward iterators
>> +           // we already moved there after decoding the last code point).
>>             if constexpr (forward_iterator<_Iter>)
>>               std::advance(_M_curr(), _M_to_increment);
>>             if (_M_curr() == _M_last)
>>               _M_buf_index = 0;
>> -           else
>> +           else // Decode next code point from the input and update buffer.
>>               _M_read();
>>           }
>> -       else if (_M_buf_index + 1 < _M_buf_last)
>> -         ++_M_buf_index;
>> +       // else erroneous, but ignored for now.
>>         return *this;
>>        }
>>
>> @@ -487,8 +509,11 @@ namespace __unicode
>>        constexpr _Iter
>>        _M_curr() const { return _M_first_and_curr._M_curr; }
>>
>> +      // Buffer holding the individual code units of the current code point.
>>        array<value_type, 4 / sizeof(_ToFmt)> _M_buf;
>>
>> +      // As an optimization we do not store _M_first for input ranges,
>> +      // because there is no need for it in a single-pass range.
>>        template<typename _It>
>>         struct _First_and_curr
>>         {
>> @@ -502,7 +527,7 @@ namespace __unicode
>>             _First_and_curr(const _First_and_curr<_It2>& __other)
>>             : _M_curr(__other._M_curr) { }
>>
>> -         _It _M_curr;
>> +         _It _M_curr;  // First code unit of the current code point.
>
> This is the iterator after the last code unit of the current_code point
> for input_iterators.



Good catch, thanks.

So how about this for that primary template (for non-forward iterators):

      _It _M_curr;  // Points past-the-end of the current code point.

and this where it's used:

      // Iterators pointing to the start of the underlying range and to the
      // start (or end, for non-forward iterators) of the current code point.
      _First_and_curr<_Iter> _M_first_and_curr;

Re: [PATCH] libstdc++: Add comments to __unicode::_Utf_iterator

Reply via email to