Re: Is it possible for Str to not be well formed?

2019-09-17 Thread Elizabeth Mattijsen
Short answer: ti my knowledge, if you can make a string contain invalid 
codepoints, it is a bug and should be reported so that it can be fixed.

> On 15 Sep 2019, at 23:08, Darren Duncan  wrote:
> 
> I'm defining an API that takes only well formed Str objects, meaning it would 
> only accept Str whose Unicode codepoints are all in the set 
> {0..0xD7FF,0xE000..0x10} and in particular there are no UTF-16 surrogate 
> characters, and it would do so as a yes/no stricture without coercing 
> anything outside of the set.
> 
> I am aware of how behind the scenes Perl 6 uses multiple levels of 
> abstraction for Str objects, and in particular may often use Normal Form G to 
> utilize codepoints above 0x10 to be able to represent graphemes in 
> constant space.
> 
> I have a few questions:
> 
> 1. Do I even have to test the Str at all?  Does Perl 6 guarantee that all Str 
> are well formed, such that for example if one tried to decode UTF-16 that 
> contained invalid surrogate codepoints (single ones or ones not properly 
> paired up) that this would fail early, or is it possible that a Str could be 
> created without fuss that contains the invalid surrogates?  I suspect Perl 6' 
> inherent laziness would make passing through invalid codepoints more likely, 
> but perhaps that isn't the case.
> 
> 2. Does Perl 6 ever have Str that are not internally in some normal form?  
> That is, if a file contains say a mixture of NFC and NFD, the actual 
> codepoints will be preserved at the start until some operation requires them 
> to be in a normal form?  I'm thinking this may be a good case for laziness, 
> eg you don't need normal forms to just move data around, but it can help if 
> you want to count graphemes, so it only normalizes when such an operation 
> happens.
> 
> 3. If a Str can contain invalid surrogates or be wrong in some other way, 
> what is the best / most performant way to test that a Str is only valid?  
> Context is akin to a "Str where ..." and what we put in the "...".
> 
> 4. How can I get the actual codepoints from a Str without normalizing them 
> first?  I realize for typical use cases, explicitly using the NFC/NFD etc 
> methods, or "ords" which uses NFC, is the most correct, but if say I just 
> want what we already have, how would I do that?  I realize the result may not 
> be particularly useful in the face of NFG.
> 
> For a wider context, I know that in other programming languages like .NET or 
> Java it is possible for their strings to have invalid surrogates, and I'm 
> trying to figure out if Perl 6 can have the same problem or not.
> 
> Thank you.
> 
> -- Darren Duncan


Is it possible for Str to not be well formed?

2019-09-15 Thread Darren Duncan
I'm defining an API that takes only well formed Str objects, meaning it would 
only accept Str whose Unicode codepoints are all in the set 
{0..0xD7FF,0xE000..0x10} and in particular there are no UTF-16 surrogate 
characters, and it would do so as a yes/no stricture without coercing anything 
outside of the set.


I am aware of how behind the scenes Perl 6 uses multiple levels of abstraction 
for Str objects, and in particular may often use Normal Form G to utilize 
codepoints above 0x10 to be able to represent graphemes in constant space.


I have a few questions:

1. Do I even have to test the Str at all?  Does Perl 6 guarantee that all Str 
are well formed, such that for example if one tried to decode UTF-16 that 
contained invalid surrogate codepoints (single ones or ones not properly paired 
up) that this would fail early, or is it possible that a Str could be created 
without fuss that contains the invalid surrogates?  I suspect Perl 6' inherent 
laziness would make passing through invalid codepoints more likely, but perhaps 
that isn't the case.


2. Does Perl 6 ever have Str that are not internally in some normal form?  That 
is, if a file contains say a mixture of NFC and NFD, the actual codepoints will 
be preserved at the start until some operation requires them to be in a normal 
form?  I'm thinking this may be a good case for laziness, eg you don't need 
normal forms to just move data around, but it can help if you want to count 
graphemes, so it only normalizes when such an operation happens.


3. If a Str can contain invalid surrogates or be wrong in some other way, what 
is the best / most performant way to test that a Str is only valid?  Context is 
akin to a "Str where ..." and what we put in the "...".


4. How can I get the actual codepoints from a Str without normalizing them 
first?  I realize for typical use cases, explicitly using the NFC/NFD etc 
methods, or "ords" which uses NFC, is the most correct, but if say I just want 
what we already have, how would I do that?  I realize the result may not be 
particularly useful in the face of NFG.


For a wider context, I know that in other programming languages like .NET or 
Java it is possible for their strings to have invalid surrogates, and I'm trying 
to figure out if Perl 6 can have the same problem or not.


Thank you.

-- Darren Duncan