krisp wrote:
> Please see the answers below.
>
> Gene wrote:
> > krisp wrote:
> > > I am looking for an efficient algorithm to search for k-median across
> > > disjointed vectors.
> > >
> > > For example, I have 10 different size arrays of floats (size varies
> > > from 20k to 100k elements)  from which I would like to get 1000th
> > > element of the combined set (total set is 500k elements).
> > >
> > > Currently I am extracting 1000 elements from each array and doing
> > > merge-sort. Since this operation is performed atleast 100k times (i.e.
> > > a set of 100k * 500k elements), the constant factor of shipping
> > > elements across a different machine is very high. I am wondering if
> > > there is a better approach to limit the amount of data passed around
> > > while keeping the efficiency.
> >
> > This looks like an interesting problem, but you don't provide enough
> > information for a useful answer.  You imply that the independent
> > vectors are sorted (otherwise how do you peel off the top k elements).
> > True?
>
> Not sorted. But ordered with select K-Median on individual arrays such
> that a[0],.....a[k-1] <= a[k] < a[k+1],....a[n]
>
> >You imply vectors are separated by relatively slo
> > communications.  True?  If so, how slow?  I.e. what is the throughput
> > for a large message?  What is the turn-around time for a tiny message?
>
> Currently I am passing 1000 floats on 100Mbps link from each vector.
> That amounts to 4 (bytes) * 1000 (elements) * 10 (vectors) * 100,000
> (number of sets) = 4 GBytes on a highly congested network.
>
> > The reason for these questions is that if it's cheaper to send O(log N)
> > separate short messages than to send M messages with k floats (M is the
> > number of vectors) as you are currently doing, then a speedup is
> > possible.  Otherwise it probably isn't.
>
> If I can reduce the order to log N, then that amounts to 10 times less
> data!

Ok.  But increasing number of messages may defeat the speedup due to
reducing volume.

Anyway, the idea is to query the P processors and use the answers to
guide a slightly tricky binary search.  Presort the top 1000 elements
on each processor to speed the queries.  They are

Mi(N) - return the N'th largest number (on the local machine i) and
Ci(X) - return the count of numbers at least X.

The algorithm locally tracks only the low and high indices of a search
bracket on each machine.  Initially each is [0..999].  The invariant is
that the median is always somewhere within these brackets.

1.  Collect the middle element (Mi((lo[i] + hi[i]) / 2) of the search
bracket from each machine and then find the median M of these values.

2.  Collect values Ci(M) for all i and compute their sum S.

3.  If S = K, return M.  You're done.

4.  If S < K, all the elements >= M are >  the K'th element.  Adjust
the brackets accordingly (using the Ci(M) values collected above).

5.  If S > K, all the elements < M are <= than the K'th element.
Adjust the brackets accordingly.

6.  If the total size of brackets is now suitably small, collect all
their elements and finish counting centrally.

7.  Otherwise go to 1.

Duplicate elements create some special cases I have ignored.

Since steps 4 and 5 rule out roughly about 1/2 of 1/2 of the elements
in each iteration, the average number of messages will be O(P log  KP)
.

Along similar lines you could collect every, say, Y'th element from
each machine and use these to request Y elements from each processor
for central counting.  A value of Y~=sqrt(K) minimizes data volume with
this approach and you'd need only 2P queries.

Happy to continue the discussion.


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---

Reply via email to