[Perl/perl5] 0401b9: utf8_to_uv_msgs: Move code to ensure initialization

Karl Williamson via perl5-changes Mon, 02 Dec 2024 09:48:34 -0800

  Branch: refs/heads/blead
  Home:   https://github.com/Perl/perl5
  Commit: 0401b9c802499120e98cb9792159086a07fefb79
      
https://github.com/Perl/perl5/commit/0401b9c802499120e98cb9792159086a07fefb79
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)


  Changed paths:
    M utf8.c

  Log Message:
  -----------
  utf8_to_uv_msgs: Move code to ensure initialization

There was a path through this function in which the caller's parameter
it asked to be set, &msgs, did not get set.  And doing it at the
beginning means not needing a second place.

Similarly for &errors.  There is no path where it didn't get set, but it
is cleaner to do it in at the same time as doing msgs.


  Commit: bb07e051599ac6ef0b043d378e133aa19156e7ec
      
https://github.com/Perl/perl5/commit/bb07e051599ac6ef0b043d378e133aa19156e7ec
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M utf8.c

  Log Message:
  -----------
  utf8_to_uv_msgs: Add branch predictions

These two input parameters are for very specialized uses.


  Commit: be1548c6afe60f52ec4e2fa49a90b1fc6ec7f813
      
https://github.com/Perl/perl5/commit/be1548c6afe60f52ec4e2fa49a90b1fc6ec7f813
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M embed.fnc
    M inline.h
    M proto.h
    M utf8.c

  Log Message:
  -----------
  Inline utf8_to_uvchr_buf

This is a one line function that just calls another function.


  Commit: be98641b0d29edf79bb33f46445d6139c02b1b7b
      
https://github.com/Perl/perl5/commit/be98641b0d29edf79bb33f46445d6139c02b1b7b
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M embed.fnc
    M embed.h
    M inline.h
    M proto.h
    M utf8.h

  Log Message:
  -----------
  Merge utf8_to_uvchr_buf() and its helper

The helper adds no value


  Commit: 395e3b63ac5b0925e14d1347803dd3aca9892e0f
      
https://github.com/Perl/perl5/commit/395e3b63ac5b0925e14d1347803dd3aca9892e0f
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M embed.fnc
    M embed.h
    M proto.h
    M utf8.c
    M utf8.h

  Log Message:
  -----------
  Convert utf8n_to_uvchr_error to macro

It was a macro, but had a long-name function as well.  This converts to
using two macros.


  Commit: ddfa240a44db6da8bd50dabce39a9b39616ddadd
      
https://github.com/Perl/perl5/commit/ddfa240a44db6da8bd50dabce39a9b39616ddadd
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M embed.fnc
    M embed.h
    M proto.h
    M utf8.c
    M utf8.h

  Log Message:
  -----------
  Convert utf8n_to_uvchr() to macro

It was a macro, but had a long-name function as well.  This converts to
using two macros.


  Commit: 0187f3d9c375c14d50be5f9d9c75bd696bba8b0a
      
https://github.com/Perl/perl5/commit/0187f3d9c375c14d50be5f9d9c75bd696bba8b0a
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M embed.fnc
    M embed.h
    M inline.h
    M proto.h
    M utf8.c

  Log Message:
  -----------
  Add utf8_to_uv_msgs()

This is the first of several functions with the naming style
utf8_to_uv(), and which are designed to be used instead of the
problematic current ones that are like utf8_to_uvchr().

The previous ones basically throw away crucial information in their
returns upon failure, creating hassles for the caller.  It is hard to
recover from malformed input with them to keep going to continue
parsing.  That is what modern UTF-8 handlers have settled on doing.

Originally I planned to replace just the most problematic one,
utf8_to_uvchr_buf(), but I realized that each level threw away
information, so it would be better to start at the base level one, which
utf8_to_uvchr_buf() eventually calls with a bunch of 0 parameters.  The
previous functions all had to disambiguate failure returns.  This stops
that at the root.

The new series all return a boolean as to their success, with a
consistent API throughout.  The old series had one outlier, again
utf8_to_uvchr_buf(), which had a different calling convention and
returns.

The basic logic in the base level function, which this commit handles,
was sound.  It just failed to return relevant information upon failure.

The new API has somewhat different formal parameter names and uses
Size_t instead of STRLEN for one of the parameters.  It also passes the
end of string position instead of a length.  The latter is problematic
when it could go negative, and instead becomes a huge positive number.

The old base function now merely calls the new one, and throws away the
relevant information, as it always has.


  Commit: e6e110f113b55f3f9b842a5a000b2c2063932429
      
https://github.com/Perl/perl5/commit/e6e110f113b55f3f9b842a5a000b2c2063932429
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M embed.fnc
    M embed.h
    M proto.h
    M utf8.h

  Log Message:
  -----------
  Add utf8_to_uv_error(s)

This is just utf8n_to_uvchr_error() with a more convenient API that is
harder to misuse.

New code should use this new function instead of the old.


  Commit: 16d0f3cb1f7f954597e48cb4ea5d7e1c97bf5ffa
      
https://github.com/Perl/perl5/commit/16d0f3cb1f7f954597e48cb4ea5d7e1c97bf5ffa
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M embed.fnc
    M embed.h
    M proto.h
    M utf8.h

  Log Message:
  -----------
  Add utf8_to_uv_flags()

This is just utf8n_to_uvchr() with a more convenient API that is harder
to misuse.

New code should use this new function instead of the old.


  Commit: 95f8a0bcabcf4b646686b22122f5a38a014bf369
      
https://github.com/Perl/perl5/commit/95f8a0bcabcf4b646686b22122f5a38a014bf369
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M embed.fnc
    M embed.h
    M proto.h
    M utf8.h

  Log Message:
  -----------
  Add utf8_to_uv()

This performs the same function as utf8_to_uvchr_buf() with a more
convenient API that is much harder to misuse.

All code should convert to use this new function instead of the old.

The behavior of utf8_to_uvchr_buf()  varies depending on if <utf8>
warnings are enabled or not, and no code in core actually takes that
into account

If warnings are enabled:

 A zero return can mean both success or failure

     Hence a zero return must be disambiguated.  Success would come
     from the next character being a NUL.

 If failure, <retlen> will be -1, so can't be used to find where to
 start parsing again.

If disabled:

 Both the return and <retlen> will be usable values, but the return
 of the REPLACEMENT CHARACTER is ambiguous.  It could mean failure,
 or it could mean that that was the next character in the input and
 was successfully decoded.  It may very well not matter to you what
 the source of this particular value was.  It likely means a failure
 somewhere.  But there are occasions where you might care.

The new function returns true upon success; false on failure.  And it is
passed pointers to return the computed code point and byte length into.
These values always contain the correct information, regardless of if
the input is malformed or not.

It is easy to test for failure in a conditional and then to take
appropriate action.  However, most often it seems the appropriate action
is to use, going forward, the REPLACEMENT CHARACTER returned in failure
cases.

And if you don't care particularly if it succeeds or not, you just use
it without testing the result.  This happens when you are confident that
the input is well-formed, or say in converting a string for display.


  Commit: 137c0f08f75d51af8799a78b69158b6a7e40d2ff
      
https://github.com/Perl/perl5/commit/137c0f08f75d51af8799a78b69158b6a7e40d2ff
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M inline.h

  Log Message:
  -----------
  Implement utf8_to_uvchr_buf in terms of utf8_to_uv_flags

This is simpler than the existing one.


  Commit: 77b3314b8c71cddecf3ca7f12754a560d115cf08
      
https://github.com/Perl/perl5/commit/77b3314b8c71cddecf3ca7f12754a560d115cf08
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M embed.fnc
    M embed.h
    M proto.h
    M utf8.h

  Log Message:
  -----------
  Add utf8_to_uv() flavors

One of these is a more explicit synonym for that function; the other two
restrict what's acceptable to Unicode's legal interchange or their C9
legal interchange.


  Commit: ae865e73a1fb69da88b4ad0f4b1a2443c73ab8fc
      
https://github.com/Perl/perl5/commit/ae865e73a1fb69da88b4ad0f4b1a2443c73ab8fc
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M embed.fnc
    M inline.h
    M mathoms.c
    M utf8.c

  Log Message:
  -----------
  Document new utf8_to_uv function family


  Commit: cffb5af552dad60649b537b4ffa7cbe9d2f5fcfa
      
https://github.com/Perl/perl5/commit/cffb5af552dad60649b537b4ffa7cbe9d2f5fcfa
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-12-02 (Mon, 02 Dec 2024)

  Changed paths:
    M pod/perldelta.pod
    M utf8.c

  Log Message:
  -----------
  perldelta for utf8_to_uv() family


Compare: https://github.com/Perl/perl5/compare/6a9d009e69dd...cffb5af552da

To unsubscribe from these emails, change your notification settings at 
https://github.com/Perl/perl5/settings/notifications

[Perl/perl5] 0401b9: utf8_to_uv_msgs: Move code to ensure initialization

Reply via email to