Well, it does matter to some degree since picking random vectors tends to give you dense vectors whereas text gives you very sparse vectors.
Another issue is that raw text without a kill list gives you sparse vectors with common words always non-zero. Different patterns of sparsity can cause radically different time complexity for the clustering. On Mon, Jul 27, 2009 at 11:05 AM, nfantone <[email protected]> wrote: > > I'm not sure why testing with Random vectors would be all that useful > other than it shows it > runs. I wouldn't expect anything useful to come > out of it, though. > > Well... my point was that it really doesn't matter how you create the > Vectors: it's the size of the final file/s that's relevant. Then > again, that IS the problem behind all: it runs - and that's about all > it does, for now. > -- Ted Dunning, CTO DeepDyve
