On Wed, Jul 16, 2025 at 2:06 PM Jonathan Wakely <jwak...@redhat.com> wrote:
> On Wed, 16 Jul 2025 at 12:30, Tomasz Kaminski <tkami...@redhat.com> wrote: > > > > > > > > > > On Wed, Jul 16, 2025 at 12:05 PM Jonathan Wakely <jwak...@redhat.com> > wrote: > >> > >> Add comments documenting what it does and how it does it. > >> > >> Also reorder the if-else in operator++ so that we check whether to > >> iterate over code units in the local buffer before checking whether to > >> refill that buffer. That seems the more natural way to structure the > >> function. > >> > >> libstdc++-v3/ChangeLog: > >> > >> * include/bits/unicode.h (__unicode::_Utf_iterator): Add > >> comments. > >> (__unicode:_Utf_iterator::operator++()): Check whether to > >> iterate over the buffer first. > >> --- > >> > >> Tested x86_64-linux. Are these comments clear? Helpful enough? > > > > I have used the iterator without the comments, so my option is biased. > > They are clear for me, except one, as noted below. > >> > >> > >> libstdc++-v3/include/bits/unicode.h | 48 +++++++++++++++++++++++------ > >> 1 file changed, 38 insertions(+), 10 deletions(-) > >> > >> diff --git a/libstdc++-v3/include/bits/unicode.h > b/libstdc++-v3/include/bits/unicode.h > >> index f1b6bf49c54c..d4e379938c20 100644 > >> --- a/libstdc++-v3/include/bits/unicode.h > >> +++ b/libstdc++-v3/include/bits/unicode.h > >> @@ -86,6 +86,25 @@ namespace __unicode > >> { return *__it == iter_value_t<_It>{}; } > >> }; > >> > >> + // An iterator over an input range of FromFmt code units that yields > either > >> + // UTF-8, UTF-16, or UTF-32, as a range of ToFmt code units. > >> + // The code units from the input range are interpreted as Unicode > code points > >> + // and the iterator produces the individual code unit for each code > point. > >> + // Invalid sequences in the input are replaced with U+FFDD so that > the result > >> + // is always valid UTF-8, UTF-16, or UTF-32. > >> + // > >> + // The iterator knows the bounds of the underlying input range and > will not > >> + // read outside those bounds (incrementing or decrementing at the > boundary > >> + // is erroneously idempotent). > >> + // > >> + // On construction, the iterator attemps to decode a single code > point from > >> + // the input range and then encode it into an internal buffer in the > output > >> + // format, e.g. if the input is UTF-8 and the output is UTF-16, it > might read > >> + // three char8_t code units from the input and store two char16_t > code units > >> + // in its buffer. Incrementing the iterator will first iterate over > buffer, > >> + // yielding each code unit in turn, and then extract another code > point from > >> + // the input. Failure to extract a valid code point from the input > will store > >> + // U+FFFD in the buffer, encoded as the appropriate code units of > type ToFmt. > >> template<typename _FromFmt, typename _ToFmt, > >> input_iterator _Iter, sentinel_for<_Iter> _Sent = _Iter, > >> typename _ErrorHandler = _Repl> > >> @@ -162,17 +181,20 @@ namespace __unicode > >> constexpr _Utf_iterator& > >> operator++() > >> { > >> - if (_M_buf_index + 1 == _M_buf_last && _M_curr() != _M_last) > >> + if (_M_buf_index + 1 < _M_buf_last) > >> + ++_M_buf_index; // Move to the next code unit in the buffer. > >> + else if (_M_curr() != _M_last) > >> { > >> + // Advance past the current code point (for non-forward > iterators > >> + // we already moved there after decoding the last code > point). > >> if constexpr (forward_iterator<_Iter>) > >> std::advance(_M_curr(), _M_to_increment); > >> if (_M_curr() == _M_last) > >> _M_buf_index = 0; > >> - else > >> + else // Decode next code point from the input and update > buffer. > >> _M_read(); > >> } > >> - else if (_M_buf_index + 1 < _M_buf_last) > >> - ++_M_buf_index; > >> + // else erroneous, but ignored for now. > >> return *this; > >> } > >> > >> @@ -487,8 +509,11 @@ namespace __unicode > >> constexpr _Iter > >> _M_curr() const { return _M_first_and_curr._M_curr; } > >> > >> + // Buffer holding the individual code units of the current code > point. > >> array<value_type, 4 / sizeof(_ToFmt)> _M_buf; > >> > >> + // As an optimization we do not store _M_first for input ranges, > >> + // because there is no need for it in a single-pass range. > >> template<typename _It> > >> struct _First_and_curr > >> { > >> @@ -502,7 +527,7 @@ namespace __unicode > >> _First_and_curr(const _First_and_curr<_It2>& __other) > >> : _M_curr(__other._M_curr) { } > >> > >> - _It _M_curr; > >> + _It _M_curr; // First code unit of the current code point. > > > > This is the iterator after the last code unit of the current_code point > > for input_iterators. > > > Good catch, thanks. > > So how about this for that primary template (for non-forward iterators): > > _It _M_curr; // Points past-the-end of the current code point. > Primary template is used for non-bidirectional iterators, so also for forward ones, so we need to describe two cases here. > > and this where it's used: > > // Iterators pointing to the start of the underlying range and to the > // start (or end, for non-forward iterators) of the current code > point. > _First_and_curr<_Iter> _M_first_and_curr; > >