Hi John,

Sorry about the delay -- I was away for holidays and away... from everything. Hope you understand.

What's the best way to merge/combine HitsCluster[]
(from different search servers)? I am looking at the problem
of distributed clustering (in Nutch).

I don't know, actually. I never thought _clustering_ will be performed in a distributed fashion with a merge phase at the end... Now that I think of it it doesn't even make much sense; think this way: clustering should bring together related snippets into groups. If you have a distributes search cluster with servers A, B and C and thus have snippets sets S(A), S(B) and S(C) and clusters that correspond to them -- C[S(A)], C[S(B)], C[S(C)] then you can't say anything about the relations between the cluster sets C[n] and hence it seems unreasonable to "merge" them. This is especially true if you create a final snippets that is different than the original sets --


set S(X) = merge( S(A), S(B), S(C) ).

My suggestion would be to decouple the clustering process and perform it on the merged snippets set S(X). If you really badly want to perform a merge phase and have distributed clustering on end-servers, then I'd suggest the following algorithm:

pick n-topmost clusters from C[S(A)], C[S(B)] and C[S(C)], order them as follows:

1st cluster from C[S(A)]
1st cluster from C[S(B)]
1st cluster from C[S(C)]
2nd cluster from C[S(A)]
2nd cluster from C[S(B)]
2nd cluster from C[S(C)]
...
nth cluster from C[S(A)]
nth cluster from C[S(B)]
nth cluster from C[S(C)]

A "pruning" phase can remove those clusters from the above list that contain overlapping sets of snippets (let's say at some threshold) or similar enough labels.

Is this reasonable?

D.


-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to