On Mon, 20 Mar 2017 01:19:43 -0700, j...@durchholz.org wrote:
> I was wondering how the Unicode consortium might extend this limit, so I 
> investigated a bit.
> 
> TL;DR
> 
> I can confirm that 10ffff is going to remain the maximum for the 
> foreseeable future.

Thanks for sharing your findings!

I looked some more at our code and the tests we have in roast. Things are 
complicated ... Probably it wasn't wise from me to mix the original point of 
this issue ("chr() is reducing the supplied codepoint number mod 2**32") with 
the question of the maximum allowed code point. But here we go.

At one point in 2014 we had additional validity checks for nqp::chr. Those 
checks also looked for the upper limit of 0x10ffff. According to the IRC logs 
[1] the checks were removed [2], because the Unicode Consortium made it clear 
in "Corrigendum #9: Clarification About Noncharacters" [3] that Noncharacters 
are not illegal, but reserved for private use. (See also the answer to the 
question "Are noncharacters invalid in Unicode strings and UTFs?" in the FAQ 
[4].)

AFAIU the check for the upper limit was useful, since 0x110000 and above are 
illegal (as opposed to the Noncharacters). Trying to add those checks back, I 
found failures in S32-str/encode.t on MoarVM. There are tests, that expect the 
following code to live. The tests where added for RT 123673: 
https://rt.perl.org/Ticket/Display.html?id=123673

$ ./perl6-m -e '"\x[FFFFFF]".sink; say "alive"'  # .sink to avoid warning
alive

Another thing to note in this context: Since we have \x, the patch from lizmat 
didn't fix the whole mod 2**32 thing:

$ ./perl6-m -e 'chr(0x100000063).sink; say "alive"'  # dies as expected
chr codepoint too large: 4294967395
  in block <unit> at -e line 1
$ ./perl6-m -e '"\x[100000063]".sink; say "alive"'   # does not die
alive

So, adding the check for the upper limit for MoarVM [5] led to failing tests in 
S32-str/encode.t and did not help with the mod 2**32 problem. (AFAIU the 
conversion to 32 bit is done before the code from [5] in src/strings/ops.c 
runs.)

On the JVM backend things look a bit better. Adding similiar code to method chr 
in src/vm/jvm/runtime/org/perl6/nqp/runtime/Ops.java helps with the upper limit 
for code points and helps with the mod 2**32 problem (since we cast to int 
after said check. The tests from S32-str/encode.t were failing before (they 
have been fudged for a while).

I'd be glad if someone with a deeper knowledge would double check if these 
tests are correct wrt "\x[FFFFFF]": 
https://github.com/perl6/roast/blob/add852b082a2fca83dbefe03d890dd5939c5ff45/S32-str/encode.t#L70-L89

In case they are dubious, I'd propose to add a validity check for the upper 
limit to MVM_string_chr (MoarVM) and chr (JVM). That would only leave the mod 
2**32 problem on MoarVM.


[1] https://irclog.perlgeek.de/perl6/2014-03-28#i_8509990 (and below)

[2] https://github.com/usev6/nqp/commit/a4eda0bcd2 (JVM) and 
https://github.com/MoarVM/MoarVM/commit/d93a73303f (MoarVM)

[3] http://www.unicode.org/versions/corrigendum9.html

[4] http://www.unicode.org/faq/private_use.html#nonchar8

[5] $ git diff
diff --git a/src/strings/ops.c b/src/strings/ops.c
index 9bfa536..7e77d21 100644
--- a/src/strings/ops.c
+++ b/src/strings/ops.c
@@ -1919,6 +1919,8 @@ MVMString * MVM_string_chr(MVMThreadContext *tc, 
MVMCodepoint cp) {
 
     if (cp < 0)
         MVM_exception_throw_adhoc(tc, "chr codepoint cannot be negative");
+    else if (cp > 0x10ffff)
+        MVM_exception_throw_adhoc(tc, "chr codepoint cannot be greater than 
0x10FFFF");
 
     MVM_unicode_normalizer_init(tc, &norm, MVM_NORMALIZE_NFG);
     if (!MVM_unicode_normalizer_process_codepoint_to_grapheme(tc, &norm, cp, 
&g)) {

Reply via email to