I think the bigger issue here is we are doing extra work to calculate
distance. I'd suggest hanging on a few days to see if we can get that
straightened out.
On Jul 27, 2009, at 2:33 PM, nfantone wrote:
Well, it does matter to some degree since picking random vectors
tends to give you dense vectors whereas text gives you very sparse
vectors.
Different patterns of sparsity can cause radically different time
complexity
for the clustering.
I have yet to find a random combination of vectors that actually
benefits substantially the performance of kMeans. I have also tried
real datasets (like the one I was initially using from large amounts
of data defining consumer's buying habits) to no avail. How should a
collection of vectors be created to, say, not compromise the algorithm
functionality significantly?
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search