[algogeeks] Re: programming pearls duplicate value

Gene Tue, 30 May 2006 07:00:46 -0700


Terry wrote:
> Gene wrote:
> > I'm sorry, but the response below is wrong on several levels.  See
> > notes.
> >
> > C Erler wrote:
> > > Because there are infinite integers, there is no need for a list of
> > > 4,300,000,000 of them to contain a duplicate.  If there are
> > > 4,299,999,999 choices, there will be one duplicate.  Because the
> > > solution uses binary search, the list is sorted.
> >
> > The problem is talking about 32-bit integers.  There are 4,294,967,296
> > of these, so 4,300,000,000 must contain a duplicate.
> >
> > Binary search for a _solution_ only means you repeatedly halve the
> > candidate set.  Data need to be sorted when the problem is to find a
> > particular element. Here you are solving a different problem entirely.
> > Indeed if the ints are sorted, the algorithm is trivial: make one pass
> > through them and look for a duplicate adjacent pair.
> >
> > >
> > > So, the full problem is probably something like the following.  There
> > > is a sorted list of the first 4,299,999,999 positive integers.  Someone
> > > duplicates one of them and inserts it into the list right next to the
> > > original.  Figure out which integer is duplicated.
> > >
> > > With that, it is easy to solve.  If you look at the first 5000
> > > integers, the last one should be 5000.  If the last one is 4999, you
> > > know that the duplicate is in that sublist.
> > >
> > > So, basically, you take the first half or so of the list and see if the
> > > last one is too low.  That will tell you which half of the list has the
> > > duplicate.  Repeat until your sublist has only the duplicated entry.
> >
> > As stated above, the data aren't sorted.
> >
> > The algorithm that Bentley is talking about works by repeatedly halving
> > the candidate range in which the duplicate element must lie.  Initially
> > this range is 0..2^32-1.   On each pass it throws away the half of the
> > range that contains fewer data values and it also throws away the data
> > lying in that half.  Eventually the range decreases to a single value,
> > which must be a duplicate because the remaining data has at least two
> > elements.  The problem that Bentley notes is that the data may still
> > have 4 billion elements at this stage!  The final improvement he
> > mentions is that you can throw away data _inside_ the candidate range
> > as long as you keep enough data around to ensure that at least one
> > duplicate occurs in it.  This is equal to the size of the current
> > candidate range plus 1!
>
> Hi Gene,
>  I don't clearly understand what you are saying .
>  What if we had a list of 8 elements like
>  1 3 2 4 2 5 6 7
>  How are we going to find the element.


Hi Terry.  Your example doesn't really apply to this problem.  If you
assume integers have 3 bits (rather than 32), then Bentley's algorithm
will work only if you have a list longer than 2^3=8 elements.  So let's
fix up the example by adding a zero at the end of your list.

So here is pseudocode for Bentley's proposed algorithm (without the
final improvement):

  lo = 0;  // bottom of range
  m = 8; // number of integers currently in range
  tape = [1,3,2,4,2,5,6,7,0]; // list to find dup in

  while (m > 1) {
    s = lo + m/2;  // base of hi half of range
    tape_lo = [];  // set work tapes to empty
    tape_hi = [];
    foreach i from tape
      if (i < s) add i to tape_lo else add i to tape_hi
    if (tape_lo has more integers than tape_hi
      use tape_lo for tape in next pass
    else {
      use tape_hi for tape in next pass
      lo = s;
    }
    m = m/2;
  }
  // answer is in lo

On your data, you get

lo=0 m=8 s=4 tape=(1 3 2 4 2 5 6 7 0)
lo=0 m=4 s=2 tape=(0 2 2 3 1)
lo=2 m=2 s=3 tape=(3 2 2)
answer=2

Now the need for improvement shows up if you use a list
 [7,7,7,0,7,7,7,7]
as input.

lo=0 m=8 s=4 tape=(7 7 7 0 7 7 7 7 7)
lo=4 m=4 s=6 tape=(7 7 7 7 7 7 7 7)
lo=6 m=2 s=7 tape=(7 7 7 7 7 7 7 7)
7

You can see that all the 7's get copied over and over, so the run time
per pass is not decreasing.  The improvement is to note that once you
have m+1 characters on tape_lo or tape_hi you can stop writing values
and the algorithm still works fine.

In case you want to try other examples, here is a perl code that
implements the algorithm for 3 bits.  It includes the improvement to
limit tape length.

sub main {

  my $lo = 0;
  my $m = 8;
  my $tape = [1,6,7,0,2,4,5,3,1];

  while ($m > 1) {
    my $s = $lo + $m/2;
    print "lo=$lo m=$m s=$s tape=(@$tape)\n";
    my $tape_lo = [];
    my $tape_hi = [];
    foreach my $i (@$tape) {
      if ($i < $s) {
        push @$tape_lo, $i unless scalar @$tape_lo > $m
      }
      else {
        push @$tape_hi, $i unless scalar @$tape_hi > $m
      }
    }
    if (scalar @$tape_lo > scalar @$tape_hi) {
      $tape = $tape_lo;
    }
    else {
      $tape = $tape_hi;
      $lo = $s;
    }
    $m /= 2;
  }
  print "$lo\n";
}

main;


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---

[algogeeks] Re: programming pearls duplicate value

Reply via email to