After a hiatus of 10+ years from writing any programs I am now writing some
programs to generate synthetic test data for a performance bakeoff between
two (or more) different systems. So we want exactly the same data on each
system. One principle I am following is never to call a random number
generator.
The following subroutine deterministic_shuffle uses some of the digits in
fractional part of the log of a number in base e as a pseudo-random number.
I have forgotten most of the Perl I once knew -- even some elementary stuff
like array references.
I am looking for two different kinds of feedback:
1. How to improve the code. It is not to be more "perlish" or "perly", but
better from a software engineering perspective. In fact the less perlish
the better, because it is possible this algorithm may need to be implemented
in other languages. I suppose I should remove all the warn statement with
info I used to debug this. But they don't seem to hurt too much.
2. How to improve the mathematics used. I think these digits will be random
enough for my purposes. See especially the comment starting: # Maybe
later have this multiplier be something like 10 * $array_size
#!/usr/local/bin/perl -w
# This file is deterministic_shuffle.pl
# Written by Steven Tolkin starting 26-June-2017
use strict;
sub deterministic_shuffle {
# The irrational number e was chosen because almost all systems have
# a way to take natural logarithms, e.g., using the log() or ln() function,
# and the results should all agree to many decimal places
# The idea is to take part of the fractional part, to get a pseudo-random
# number, and then shuffle by that.
# Later do a better job checking arguments
my $array_ref = shift;
my @in_array = @$array_ref;
## warn "dbg in sub in_array:\n", join("\n",@in_array), "\n";
# The first part only cares about the size of the array -- not its contents
my %fraction; # this contains the fractional part of the log
# We must have the same number of elements in the output array as in the
input
# array we are shuffling.
my $array_size = scalar @in_array; #
for (my $i=0; $i < $array_size; $i++ ) {
# Must make sure this sub works for arrays of size 0 or 1, as well as
# other typical sizes. So add 2 to each number to prevent taking the log
# of 0 (undefined) or the log of 1 (always exactly 0).
# Maybe later add the array size also, so arrays of different sizes have
# very different shuffles.
my $log = log($i+2);
# Maybe later have this multiplier be something like 10 * $array_size
# rather than always be 10000 -- that is suitable for arrays up to about
# 1000 elements.
# The multiplier is used to avoid the leading digits after the decimal
# point. They will be monotonically increasing for each integer portion
# of the log, which is undesirable. We do want the output to be
# pseudo-random.
my $multiplier = 10000;
$fraction{$i} = $log*$multiplier - int($log*$multiplier);
## warn "dbg ", $fraction{$i},"\n";
}
my %hash;
my $key;
my $value;
%hash = %fraction; # Just to use the copied code below without editing it.
## while (($key, $value) = each %hash) {warn "dbg hash key:$key
value:$value\n";} # # Just to debug
# Since my values are numeric rather than strings I could use the spaceship
# comparison operator <=> rather than cmp operator. But I decided to stay
# with cmp since conversion to a string takes little time for the smallish
# arrays (about a thousand elements) we typically use. We do want to break
# ties deterministically, even in the extremely unlikely case of two
different
# numbers having the exact same fraction. See also
# http://www.perlmonks.org/?node_id=841581#
# <q>If this matters, you can add an extra term to your sort so it sorts
first
# by value then by a guaranteed unique quantity, the key</q>
my @keys = sort { $hash{$a} cmp $hash{$b} or $a cmp $b} keys %hash;
# The contents of the @keys is now an array with numbers like: 5,1,3,6,0,4,2
# [for array size of 7]
## warn "dbg keys:\n", join("\n",@keys), "\n";
# Finally do the shuffle part
my @out_array;
for (my $i = 0; $i < $array_size; $i++) {
$out_array[$i] = $in_array[$keys[$i]];
## warn "dbg in sub out_array $i: ", $out_array[$i], "\n";
};
## warn "dbg in sub out_array:\n", join("\n",@out_array), "\n";
return @out_array;
}
# Various tests including empty array and array with just one element.
# my @test_in_array = ();
# my @test_in_array = (1);
# my @test_in_array = (1,1,1,2,2,3,3,4);
my @test_in_array = (0,1,2,3,4,5,6);
warn "dbg test_in_array:\n", join("\n",@test_in_array), "\n";
# pass in a reference to the array - note the backslash
my @test_out_array = deterministic_shuffle(\@test_in_array);
warn "dbg test_out_array:\n", join("\n",@test_out_array), "\n";
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm