RE: How to manage around 1000 entries which will constantly bechanging

Wagner, David --- Senior Programmer Analyst --- WGO Mon, 03 Apr 2006 11:22:43 -0700

Mr. Shawn H. Corey wrote:
> On Sat, 2006-01-04 at 09:57 -0500, Frank Bax wrote:
>> I'm not the OP, but I have a script with a similar problem.  The
>> script has some logic that generates many (thousands of billions) of
>> combinations from a little bit of data and only the best 100 combos
>> are output. For each combination produced by script, a value is
>> calculated.  I maintain an array of "best" results so far; if size
>> of this array is <100, then a "new" result is automatically added. 
>> If there are already 100 items, I find the "worst" entry in the
>> array.  If the "new" entry is better than the "worst", the "new" one
>> replaces the "worst". 
>> 
>> I am not sure how your reference to "fail early" and "succeed early"
>> apply to this situation. 
>> 
>> At the moment, the array is left unsorted.  If I use a sorted array,
>> it needs to resorted every time a "worst" entry is replaced by a
>> "new" 
>> entry.  Can I avoid sorting the array every iteration?  How else to
>> find the "worst" entry?  If I replace "worst" with a "new" entry,
>> doesn't the array need to be resorted?  How does the cost of
>> searching an unsorted array compare to resorting on every update and
>> searching a sorted array. 
>> 
>> Tom's idea about growing to 200, then chopping back to 100 also
>> sounds interesting. 
>> 
>> Frank
> 
> Below is the algorithm I was talking about.
> 
        I appreciate the input from the list. I am looking at the replies and 
trying to get a handle on what is being suggested. My situation is similar to 
Frank Bax and have the same 'billions of combinations' of which I am looking 
for the best 1000 to pass on to the business partner. They come in random order 
and have a set of data assoicated with them which I will need to maintain, but 
was thinking of just tying the number  and a key to the data so I am not moving 
the key/data constantly(600 to 800 bytes for each key).


        Still trying to a grasp of what is being suggested. Also any time the 
keys are equal I do not want( or need to keep ) the data.

        Again I thank the list for the input and now just need to see what I 
will attempt to do.

Wags ;)

> Let's say you want to keep the k best items of a list of size n, where
> best means the highest or lowest of them.
> 
> If n <= k: sort.
> 
> If k < n <= k * log2( k ): sort, discard excess.
> 
> If n > k *log2( k ): use below.
> 
> If k = 1_000 and n = 1_000_000_000, then 999_999_000 items are not
> going to be in the best list. Therefore, you want your algorithm to
> discard them as quickly as possible, hopefully, with just one test.
> This is what I mean by fail early; if most of the items are going to
> fail, design the algorithm so that failure is detected as early as
> possible. 
> 
> Similarly, succeed early means to design the algorithm to detect
> success as early as possible.
> 
> The script below uses a linear search to insert an item into the best
> list. You could use a binary search or a heap instead. I would not
> use a binary tree or a balanced binary tree. These structures work
> best when you are doing more searches than insertions and in this
> case we are doing an insertion every time.
> 
> 
> #!/usr/bin/perl
> # --------------------------------------
> # best_k -- Find the best k items of a set.
> 
> # --------------------------------------
> # Pragmas
> use strict;
> use warnings;
> 
> # --------------------------------------
> # Modules
> use Data::Dumper;
> use Getopt::Long;
> use POSIX;
> 
> # --------------------------------------
> # Configuration Parameters
> 
> my $K = 10; # top ten
> 
> $Data::Dumper::Sortkeys = 1;
> $Data::Dumper::Indent   = 1;
> $Data::Dumper::Maxdepth = 0;
> 
> # --------------------------------------
> # Globals Variables
> 
> my $Lowest = 0;
> 
> # --------------------------------------
> # Subroutines
> 
> # --------------------------------------
> # help();
> #   Print help via pod2text.
> # --------------------------------------
> sub help {
>   system( "pod2text -t $0" );
>   exit 0;
> }
> 
> # --------------------------------------
> # usage();
> #   Print usage via pod2usage.
> # --------------------------------------
> sub usage {
>   system( "pod2usage $0" );
>   exit 1;
> }
> 
> # --------------------------------------
> # init();
> #   Do things only possible when running.
> # --------------------------------------
> sub init {
>   unless( GetOptions(
>       help => \&help,
>       'top|highest' => sub { $Lowest = 0; },
>       'worst|lowest' => sub { $Lowest = 1; },
>       'k=i' => \$K,
>   )){
>     usage();
>   }
> 
>   die "bad k $K\n" if $K <= 1;
> }
> 
> # --------------------------------------
> # Main
> {
>   my @items = ();
>   my %item = ();
> 
>   init;
> 
>   # Get the best k
>   while( <> ){
>     /^\D*(\d+)/;
>     my $nbr = $1;
>     %item = (
>       nbr => $nbr,
>       orig => $_,
>     );
> 
>     if( @items == 0 ){
>       push @items, { %item };
>       next;
>     }
> 
>     if( @items < $K
>     || ( $Lowest && $nbr < $items[-1]{nbr} )
>     || ( ! $Lowest && $nbr > $items[-1]{nbr} )
>     ){
>       my $i = $#items;
>       while( $i >= 0
>       && ( ( $Lowest && $nbr < $items[$i]{nbr} )
>         || ( ! $Lowest && $nbr > $items[$i]{nbr} ) )
>       ){
>         $i --;
>       }
>       if( $i < 0 ){
>         unshift @items, { %item };
>       }else{
>         splice( @items, $i+1, 0, { %item } );
>       }
>     }
>     $#items = $K - 1 if $#items >= $K;
> 
>   }
> 
>   # Display the results
>   for my $item ( @items ){
>     print $item->{orig};
>   }
> }
> 
> __END__
> 
> =head1 NAME
> 
> best_k -- Find the best k items of a set.
> 
> =head1 SYNOPSIS
> 
>   best_k [--top|highest|worst|lowest] [--k=<k>] [<file>] ...
>   best_k --help
> 
> =head2 Options
> 
> =over 4
> 
> =item --top|highest|worst|lowest
> 
> If top or highest, then the highest K items are chosen.
> If worst or lowest, then the lowest K items are chosen.
> Default it top.
> 
> =item --k=<k>
> 
> Set the number of items to choose.
> Default is 10.
> 
> =item --help
> 
> Print this message.
> 
> =back
> 
> =head1 REQUIRES
> 
> =head1 DESCRIPTION
> 
> Find the best k items of a set.
> 
> =head1 FILES
> 
> =head1 SEE ALSO
> 
> =head1 AUTHOR
> 
> Mr. Shawn H. Corey, B.Sc.
> 
> =head1 HISTORY
> 
>   $Log$
> 
> =cut
> 
> --
> __END__
> 
> Just my 0.00000002 million dollars worth,
>    --- Shawn
> 
> "For the things we have to learn before we can do them,
> we learn by doing them."
>   Aristotle
> 
> * Perl tutorials at http://perlmonks.org/?node=Tutorials
> * A searchable perldoc is at http://perldoc.perl.org/



*******************************************************
This message contains information that is confidential
and proprietary to FedEx Freight or its affiliates.
It is intended only for the recipient named and for
the express purpose(s) described therein.
Any other use is prohibited.
*******************************************************


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

RE: How to manage around 1000 entries which will constantly bechanging

Reply via email to