Short answer: ti my knowledge, if you can make a string contain invalid
codepoints, it is a bug and should be reported so that it can be fixed.
> On 15 Sep 2019, at 23:08, Darren Duncan wrote:
>
> I'm defining an API that takes only well formed Str objects, meaning it would
> only accept Str whose Unicode codepoints are all in the set
> {0..0xD7FF,0xE000..0x10} and in particular there are no UTF-16 surrogate
> characters, and it would do so as a yes/no stricture without coercing
> anything outside of the set.
>
> I am aware of how behind the scenes Perl 6 uses multiple levels of
> abstraction for Str objects, and in particular may often use Normal Form G to
> utilize codepoints above 0x10 to be able to represent graphemes in
> constant space.
>
> I have a few questions:
>
> 1. Do I even have to test the Str at all? Does Perl 6 guarantee that all Str
> are well formed, such that for example if one tried to decode UTF-16 that
> contained invalid surrogate codepoints (single ones or ones not properly
> paired up) that this would fail early, or is it possible that a Str could be
> created without fuss that contains the invalid surrogates? I suspect Perl 6'
> inherent laziness would make passing through invalid codepoints more likely,
> but perhaps that isn't the case.
>
> 2. Does Perl 6 ever have Str that are not internally in some normal form?
> That is, if a file contains say a mixture of NFC and NFD, the actual
> codepoints will be preserved at the start until some operation requires them
> to be in a normal form? I'm thinking this may be a good case for laziness,
> eg you don't need normal forms to just move data around, but it can help if
> you want to count graphemes, so it only normalizes when such an operation
> happens.
>
> 3. If a Str can contain invalid surrogates or be wrong in some other way,
> what is the best / most performant way to test that a Str is only valid?
> Context is akin to a "Str where ..." and what we put in the "...".
>
> 4. How can I get the actual codepoints from a Str without normalizing them
> first? I realize for typical use cases, explicitly using the NFC/NFD etc
> methods, or "ords" which uses NFC, is the most correct, but if say I just
> want what we already have, how would I do that? I realize the result may not
> be particularly useful in the face of NFG.
>
> For a wider context, I know that in other programming languages like .NET or
> Java it is possible for their strings to have invalid surrogates, and I'm
> trying to figure out if Perl 6 can have the same problem or not.
>
> Thank you.
>
> -- Darren Duncan