After a hiatus of 10+ years from writing any programs I am now writing some
programs to generate synthetic test data for a performance bakeoff between
two (or more) different systems.  So we want exactly the same data on each
system.  One principle I am following is never to call a random number
generator.
The following subroutine deterministic_shuffle uses some of the digits in
fractional part of the log of a number in base e as a pseudo-random number.
I have forgotten most of the Perl I once knew -- even some elementary stuff
like array references. 

I am looking for two different kinds of feedback:
1. How to improve the code.  It is not to be more "perlish" or "perly", but
better from a software engineering perspective.  In fact the less perlish
the better, because it is possible this algorithm may need to be implemented
in other languages.  I suppose I should remove all the warn statement with
info I used to debug this.  But they don't seem to hurt too much.
2. How to improve the mathematics used. I think these digits will be random
enough for my purposes.  See especially the comment starting:    # Maybe
later have this multiplier be something like 10 * $array_size

#!/usr/local/bin/perl -w
# This file is deterministic_shuffle.pl
# Written by Steven Tolkin starting 26-June-2017

use strict;

sub deterministic_shuffle {

# The irrational number e was chosen because almost all systems have
# a way to take natural logarithms, e.g., using the log() or ln() function,
# and the results should all agree to many decimal places

# The idea is to take part of the fractional part, to get a pseudo-random
# number, and then shuffle by that.

# Later do a better job checking arguments
my $array_ref = shift;
my @in_array = @$array_ref;
## warn "dbg in sub in_array:\n", join("\n",@in_array), "\n";

# The first part only cares about the size of the array -- not its contents

my %fraction; # this contains the fractional part of the log

# We must have the same number of elements in the output array as in the
input
# array we are shuffling.

my $array_size = scalar @in_array; # 

for (my $i=0; $i < $array_size; $i++ ) {
    # Must make sure this sub works for arrays of size 0 or 1, as well as
    # other typical sizes. So add 2 to each number to prevent taking the log
    # of 0 (undefined) or the log of 1 (always exactly 0).

    # Maybe later add the array size also, so arrays of different sizes have
    # very different shuffles.
    
    my $log = log($i+2);
    
    # Maybe later have this multiplier be something like 10 * $array_size
    # rather than always be 10000 -- that is suitable for arrays up to about
    # 1000 elements.
    
    # The multiplier is used to avoid the leading digits after the decimal
    # point.  They will be monotonically increasing for each integer portion
    # of the log, which is undesirable.  We do want the output to be
    # pseudo-random.
    my $multiplier = 10000;
    $fraction{$i} = $log*$multiplier - int($log*$multiplier);
    ## warn "dbg ", $fraction{$i},"\n";
}

my %hash;
my $key;
my $value; 

%hash = %fraction; # Just to use the copied code below without editing it.

## while (($key, $value) = each %hash) {warn "dbg hash key:$key
value:$value\n";} # # Just to debug

# Since my values are numeric rather than strings I could use the spaceship
# comparison operator <=> rather than cmp operator.  But I decided to stay
# with cmp since conversion to a string takes little time for the smallish
# arrays (about a thousand elements) we typically use.  We do want to break
# ties deterministically, even in the extremely unlikely case of two
different
# numbers having the exact same fraction.  See also
# http://www.perlmonks.org/?node_id=841581# 
# <q>If this matters, you can add an extra term to your sort so it sorts
first
# by value then by a guaranteed unique quantity, the key</q>

my @keys = sort { $hash{$a} cmp $hash{$b} or $a cmp $b} keys %hash;

# The contents of the @keys is now an array with numbers like: 5,1,3,6,0,4,2
# [for array size of 7]

## warn "dbg keys:\n", join("\n",@keys), "\n";

# Finally do the shuffle part

my @out_array;
for (my $i = 0; $i < $array_size; $i++) {
        $out_array[$i] = $in_array[$keys[$i]];
        ## warn "dbg in sub out_array $i: ", $out_array[$i], "\n";
};

## warn "dbg in sub out_array:\n", join("\n",@out_array), "\n";

return @out_array;
}

# Various tests including empty array and array with just one element.

# my @test_in_array = ();
# my @test_in_array = (1); 
# my @test_in_array = (1,1,1,2,2,3,3,4);
my @test_in_array = (0,1,2,3,4,5,6);

warn "dbg test_in_array:\n", join("\n",@test_in_array), "\n";

# pass in a reference to the array - note the backslash
my @test_out_array = deterministic_shuffle(\@test_in_array); 

warn "dbg test_out_array:\n", join("\n",@test_out_array), "\n";

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to