Re: split and nils?

Brad Gilbert Thu, 07 Feb 2019 08:36:28 -0800

First off a Str is a singular value, not a list.

Which is a good thing.


    my $a = "abc";
    my $b = "\c[COMBINING ACUTE ACCENT]";

    say $a.chars; # 3
    say $b.chars; # 1

If they were a list, then combining them should create something that
is 4 chars long, but it doesn't.

    say ($a ~ $b).chars; # 3
    say "$a$b".chars; # 3

Also let's look at the Unicode names for the characters

    .say for $a.uninames;
    # LATIN SMALL LETTER A
    # LATIN SMALL LETTER B
    # LATIN SMALL LETTER C

    .say for $b.uninames;
    # COMBINING ACUTE ACCENT

Now what is the Unicode names for the combined Str?

    .say for ($a ~ $b).uninames;
    # LATIN SMALL LETTER A
    # LATIN SMALL LETTER B
    # LATIN SMALL LETTER C WITH ACUTE

---

The `@` is also perfectly consistent.

    my $a = [1,2];
    my $b = 'abc';
    my $c = 45;
    my $d = List;
    my $e = List.new;

    say $a.perl;   # $[1, 2] # The `$` makes it act like a singular value
    say $b.perl;   # "abc"
    say $c.perl;   # 45
    say $d.perl;   # List
    say $e.perl;   # $( )

    say @$a.perl;  # [1, 2]
    say @$b.perl;  # ("abc",)
    say @$c.perl;  # (45,)
    say @$d.perl;  # (List,)
    say @$e.perl;  # ()

The `@$…` is short for `@($…)`

    say @('abc').perl; # ("abc",)
    say @(45).perl; #(45,)
    say @(List).perl; # (List,)
    say @(List.new).perl; # ()

It removes the item context from something that is [an instance of] a list, or
it creates a single element list with that value.

Basically @(…) always returns a list, and singular items act like a
list with a single value in them.

    say 123.elems; # 1
    say "abc".elems; # 1

    say "".elems; # 1

---

Since a Str is an opaque object, the internals can store them however they like.

Someone posted Perl6 code and the C[++] equivalent in #perl6 once.
They reported that the Perl6 code was faster.
My guess is that since C[++] treats strings as an array it has to copy
the strings repeatedly.

So Perl6 is faster because a Str is a singular object.

The way MoarVM deals with strings structurally similar to the
following Perl6ish pseudo code:
(There are mistakes, but they should not detract from it I hope. Some
of the mistakes are even intentional.)

    role STRING {}

    # strings that are valid ASCII
    class STRING_RAW8 does STRING {
        has int32 $.length;
        has Buf[int8] $.buffer;
    }
    # strings which contain Unicode outside of the ASCII range
    class STRING_NFG32 does STRING {
        has int32 $.length;
        has Buf[int32] $.buffer;
    }
    # strings made from other strings
    class STRING_CONCAT does STRING {
        has STRING @.a;
    }
    # a string made out of part of another string
    class STRING_SUBSTR does STRING {
        has STRING $.ref;
        has int32 $.position;
        has int32 $.length;
    }

So when you write something like this:

    us v6;
    my $a = "123";
    my $b = "⅒";
    my $c = "$a and $b";

That turns into this:

    # pseudo Perl6ish
    my Str $a = STRING_RAW8( "123" );
    my Str $b = STRING_NFG32( "⅒" );
    my Str $TEMP = STRING_RAW8( " and " );
    my Str $c = STRING_CONCAT( $a, $TEMP, $b );

If you then get a substring out of it:

    us v6;
    my Str $d = $c.substr(0,2);

It does something like

    # pseudo Perl6ish
    my STRING $d = STRING_SUBSTR( $c, 0, 2 );

    # the whole structure
    STRING_SUBSTR(
        STRING_CONCAT(
            STRING_RAW8( "123" ),
            STRING_RAW8( " and " ),
            STRING_NFG32( "⅒" )
        ),
        0,
        2
    )

At no point was the contents of the STRING ever copied.
In fact it didn't have to read the contents of the STRING at all.

(In the Real World, string concatenation does have to look at the
first and last characters of each segment for ones that will combine.)

---

Basically C has to copy the contents of strings while MoarVM can just
copy pointers to string objects.

If Perl6 treated strings as an Array then some of this performance
improvement wouldn't quite work as well.

Let's pretend that it acts like an Array:

    us v6;
    my Str $e = $c[0,1];

That would result in the following Per6ish VM code:

    # pseudo Perl6ish
    my $e = STRING_CONCAT( STRING_SUBSTR( $c, 0, 1 ), STRING_SUBSTR(
$c, 1, 1 ) );

    # the whole structure
    STRING_CONCAT(
        STRING_SUBSTR(
            STRING_CONCAT(
                STRING_RAW8( "123" ),
                STRING_RAW8( " and " ),
                STRING_NFG32( "⅒" )
            ),
            0,
            1
        ),
        STRING_SUBSTR(
            STRING_CONCAT(
                STRING_RAW8( "123" ),
                STRING_RAW8( " and " ),
                STRING_NFG32( "⅒" )
            ),
            1,
            1
        )
    )

Translating that back into real Perl6

    use v6;
    my Str $e = substr( $c, 0, 1 ) ~ substring( $c, 1, 1 );

So if Perl6 did treat Str as an Array, then it would be slower, and
use more memory.
It also might not be able handle Unicode correctly.

Also my guess is that the majority of string related bugs in other languages are
caused by them treating strings as an array of characters.


On Wed, Feb 6, 2019 at 1:56 PM ToddAndMargo via perl6-users
<perl6-users@perl.org> wrote:
>
>  > On Tue, Feb 5, 2019 at 11:05 PM ToddAndMargo via perl6-users
>  > <perl6-users@perl.org> wrote:
>  >>
>  >> Hi All,
>  >>
>  >> What is with the starting ending Nils?  There are only four
>  >> elements, why now six?
>  >>
>  >> And how to I correct this?
>  >>
>  >> $ p6 'my Str $x="abcd";
>  >>        for split( "",@$x ).kv -> $i,$j {
>  >>        say "Index <$i> = <$j> = ord <" ~ ord($j) ~ ">";}'
>  >>
>  >> Use of Nil in string context
>  >>     in block  at -e line 1
>  >> Index <0> = <> = ord <>         <----------------- nil ???
>  >> Index <1> = <a> = ord <97>
>  >> Index <2> = <b> = ord <98>
>  >> Index <3> = <c> = ord <99>
>  >> Index <4> = <d> = ord <100>
>  >> Use of Nil in string context
>  >>     in block  at -e line 1
>  >> Index <5> = <> = ord <>         <----------------- nil ???
>  >>
>  >>
>  >> Many thanks,
>  >> -T
>
> On 2/6/19 5:19 AM, Brad Gilbert wrote:
> > The reason there is a Nil, is you asked for the ord of an empty string.
> >
> >      "".ord =:= Nil
> >
> > The reason there are two empty strings is you asked for them.
> >
> > When you split with "", it will split on every character boundary,
> > which includes before the first character, and after the last.
> > That's literally what you asked for.
> >
> >      my Str $x = "abcd";
> >      say split( "", $x ).perl;
> >      # ("", "a", "b", "c", "d", "").Seq
> >
> > Perl6 doesn't treat this as a special case like other languages do.
> > You basically asked for this:
> >
> >      say split( / <after .> | <before .> /, $x ).perl;
> >      # ("", "a", "b", "c", "d", "").Seq
> >
> > Perl6 gave you what you asked for.
> >
> > That is actually useful btw:
> >
> >      say split( "", "abcd" ).join("|");
> >      # |a|b|c|d|
> >
> > You should be using `comb` if you want a list of characters not `split`.
> >
> >      # these are all identical
> >      'abcd'.comb.kv
> >      'abcd'.comb(1).kv
> >      comb( 1, 'abcd' ).kv
> >
> > Also why did you add a pointless `@` to `$x` ?
> > (Actually I'm fairly sure I know why.)
> >
>
> Hi Brad,
>
> Thank you!
>
> So it is a "feature" of split.  Split sees the non-existent
> index before the start and the non-existent index after
> the end as something.  Mumble. Mumble.
>
> To answer you question about the stray "@".  I forgot
> to remove it.
>
> But it brings up an inconsistency in Perl 6.
>
> This works and also is the source of the stay "@" I forgot
> to remove from the split example.
>
>
> $ p6 'my Buf $x=Buf.new(0x66,0x61,0x62,0x63); for @$x.kv -> $i, $j {say
> "Index <$i> = <$j> = chr <" ~ chr($j) ~ ">";}'
>
> Index <0> = <102> = chr <f>
> Index <1> = <97> = chr <a>
> Index <2> = <98> = chr <b>
> Index <3> = <99> = chr <c>
>
>
>
> So, this should also work, but does not:
>
> $ p6 'my Str $x="abcd"; for @$x.kv -> $i, $j {say "Index <$i> = <$j> =
> ord <" ~ ord($j) ~ ">";}'
>
> Index <0> = <abcd> = ord <97>
>
>
> Strings only have one index (0) and why we have the substr command.
>
> $ p6 'my Str $x="abcd"; say $x[0];'
> abcd
>
>
> So all the rules for other arrays go out the window for
> a Str.  A string is an array of one cell.  And if I
> truly want an array of characters, I need to use Buf
> and not Str.  Only problem is that Str has all the cool
> tools.
>
> -T

Re: split and nils?

Reply via email to