Re: Primitive benchmark comparison (parsing LDIF)

yary Thu, 28 Oct 2021 06:46:51 -0700

A small thing to begin with in the regex  m/ ^ (@attributes) ':' \s (.+) $
/;


All the string examples use the literal ': ' colon+space, so how about
making the regex more consistent? And also allowing the empty string as a
value, which the string examples allow.

m/ ^ (@attributes) ': ' (.*) $ /;

Next, how about adding a 2nd regex test similar to the "split" that also
relies on User ignoring unknown fields? This accepts an empty-string key,
which the "split" string handler does too.

m/ ^ (<-[:]>*) ': ' (.*) /;


-y


On Thu, Oct 28, 2021 at 2:14 AM Norman Gaywood <ngayw...@une.edu.au> wrote:

> Oh, and I welcome suggestions on how I might do the task more quickly,
> elegantly, differently, etc :-)
> And critiques of the code also welcome. I still have a strong perl5 accent
> I suspect.
>
> On Thu, 28 Oct 2021 at 13:15, Norman Gaywood <ngayw...@une.edu.au> wrote:
>
>> Executive summary:
>>      - comparing raku 2021.10 with raku 2021.9
>>      -comparing 3 ways of parsing (although the 2 string function ways
>> are similar)
>>     - raku 2021.10 is better than 2 times as fast as 2021.9 using the
>> string functions
>>     - raku 2021.10 is about the same as 2021.9 using a more general
>> regular expression
>>     - regular expressions are still slow in 2021.10
>>
>> Side note: not shown here is also parsing with Text::LDIF. In 2021.9 it
>> was comparable to the regex method. Not tried with 2021.10.
>>
>> I need to parse a 40K entry LDIF file.
>>
>> Below is some code that uses 3 ways to parse.
>> There are 3 MAIN subs that differ in a few last lines of the for loop.
>> The loop reads the LDIF entries and populates %ldap keyed on the "uid" of
>> the LDIF entry.
>> The values of %ldap are User objects.
>> A %f hash is used to build the values of User on each LDIF entry
>>
>> The aim is to show the difference in timings between 3 ways of parsing
>> the LDIF
>>
>> The 1st MAIN (regex) uses this general regular expression to build %f
>>          next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /;
>>         %f{$0} = "$1";
>>
>> The "starts" MAIN uses starts-with() to build %f
>>        for @attributes -> $a {
>>             if $line.starts-with( $a ~ ": " ) {
>>                %f{$a} = (split( ": ", $line, 2))[1];
>>                last;
>>     }
>>
>> And finally the "split" MAIN uses split() but also uses the feature that
>> User.new() will ignore attributes that are not used.
>>         ($k, $v) = split( ": ", $line, 2);
>>         %f{$k} = $v;
>>
>> That's the difference between the MAIN()'s below. Sorry I couldn't golf
>> it down more.
>> Running the benchmarks multiple times does vary the times slightly but
>> not significantly.
>>
>> Results for rakudo-pkg-2021.9.0-01:
>> $ ./icheck.raku regex
>> 41391 entries by regex in 27.859560887 seconds
>> $ ./icheck.raku starts
>> 41391 entries by starts-with in 5.970667533 seconds
>> $ ./icheck.raku split
>> 41391 entries by split in 5.12252741 seconds
>>
>> Results for rakudo-pkg-2021.10.0-01
>> $ ./icheck.raku regex
>> 41391 entries by regex in 27.833870158 seconds
>> $ ./icheck.raku starts
>> 41391 entries by starts-with in 2.560101599 seconds
>> $ ./icheck.raku split
>> 41391 entries by split in 2.307679407 seconds
>>
>> -------------------------------------
>> #!/usr/bin/env raku
>>
>> class User {
>>     has $.uid;
>>     has $.uidNumber;
>>     has $.gidNumber;
>>     has $.homeDirectory;
>>     has $.mode = 0;
>>
>>     method attributes {
>>        # return <uid uidNumber gidNumber homeDirectory mode>;
>>        User.^attributes(:local)>>.name>>.substr(2);  # Is the order
>> guaranteed?
>>     }
>> }
>>
>> # Read user info from LDIF file
>> my %ldap;
>> my @attributes = User.attributes;
>>
>> multi MAIN ( "regex", $ldif-fn = "db/icheck.ldif" ) {
>>     my ( %f );
>>     for $ldif-fn.IO.lines -> $line {
>>         when not $line {  # blank line is LDIF entry terminator
>>             %ldap{%f<uid>} = User.new( |%f );
>>         }
>>         when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
>> entry
>>
>>         next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /;
>>         %f{$0} = "$1";
>>     }
>>     say "{%ldap.elems} entries by regex in {now - BEGIN now} seconds";
>> }
>>
>> multi MAIN ( "starts", $ldif-fn = "db/icheck.ldif" ) {
>>     my ( %f );
>>     for $ldif-fn.IO.lines -> $line {
>>         when not $line {  # blank line is LDIF entry terminator
>>             %ldap{%f<uid>} = User.new( |%f );
>>         }
>>         when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
>> entry
>>
>>         for @attributes -> $a {
>>             if $line.starts-with( $a ~ ": " ) {
>>                %f{$a} = (split( ": ", $line, 2))[1];
>>                last;
>>             }
>>          }
>>
>>     }
>>     say "{%ldap.elems} entries by starts-with in {now - BEGIN now}
>> seconds";
>> }
>>
>> multi MAIN ( "split", $ldif-fn = "db/icheck.ldif" ) {
>>     my ( %f, $k, $v );
>>     for $ldif-fn.IO.lines -> $line {
>>         when not $line {  # blank line is LDIF entry terminator
>>             %ldap{%f<uid>} = User.new( |%f );         # attributes not
>> used are ignored
>>         }
>>         when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
>> entry
>>
>>         ($k, $v) = split( ": ", $line, 2);
>>         %f{$k} = $v;
>>     }
>>     say "{%ldap.elems} entries by split in {now - BEGIN now} seconds";
>> }
>>
>> --
>> Norman Gaywood, Computer Systems Officer
>> School of Science and Technology
>> University of New England
>> Armidale NSW 2351, Australia
>>
>> ngayw...@une.edu.au  http://turing.une.edu.au/~ngaywood
>> Phone: +61 (0)2 6773 2412  Mobile: +61 (0)4 7862 0062
>>
>> Please avoid sending me Word or Power Point attachments.
>> See http://www.gnu.org/philosophy/no-word-attachments.html
>>
>
>
> --
> Norman Gaywood, Computer Systems Officer
> School of Science and Technology
> University of New England
> Armidale NSW 2351, Australia
>
> ngayw...@une.edu.au  http://turing.une.edu.au/~ngaywood
> Phone: +61 (0)2 6773 2412  Mobile: +61 (0)4 7862 0062
>
> Please avoid sending me Word or Power Point attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
>

Re: Primitive benchmark comparison (parsing LDIF)

Reply via email to