On Jul 23, 2009, at 10:20 AM, nfantone wrote:
That does seem like a long time.
Is your data sparse or dense?
I would say sparse. My vectors are high dimensional and most of their
values are zero.
Perhaps a larger convergence value might help (-d, I believe).
I'll try that.
Is there any chance your data is publicly shareable? Come to think
of it,
with the vector representations, as long as you don't publish the
key (which
terms map to which index), I would think most all data is publicly
shareable.
I'm sorry, I don't quite understand what you're asking. Publicly
shareable? As in user-permissions to access/read/write the data?
As in post a copy of the SequenceFile somewhere for download, assuming
you can. Then others could presumably try it out.
Are you on trunk of Mahout? I think we still need more profiling
to get a
better idea of where improvements can be made.
I am. Updated this morning.
I still insist on the configuration issue, and have never considered
Mahout's algorithms implementation to be the actual cause of poor
performance. For now, I've been running kMeans exclusively. Perhaps, I
should try with different clustering methods and see if it takes a
similar amount of time to complete.
Well KMeans actually runs two algorithms normally: canopy and then
KMeans. You could try the Random seed approach, which would skip the
canopy run first.