Re: Clustering from DB

Grant Ingersoll Thu, 23 Jul 2009 09:51:12 -0700


On Jul 23, 2009, at 10:20 AM, nfantone wrote:

That does seem like a long time.

Is your data sparse or dense?


I would say sparse. My vectors are high dimensional and most of their
values are zero.

Perhaps a larger convergence value might help (-d, I believe).


I'll try that.

Is there any chance your data is publicly shareable? Come to thinkof it,with the vector representations, as long as you don't publish thekey (which
terms map to which index), I would think most all data is publicly
shareable.


I'm sorry, I don't quite understand what you're asking. Publicly
shareable? As in user-permissions to access/read/write the data?

As in post a copy of the SequenceFile somewhere for download, assumingyou can. Then others could presumably try it out.

Are you on trunk of Mahout? I think we still need more profilingto get a
better idea of where improvements can be made.


I am. Updated this morning.

I still insist on the configuration issue, and have never considered
Mahout's algorithms implementation to be the actual cause of poor
performance. For now, I've been running kMeans exclusively. Perhaps, I
should try with different clustering methods and see if it takes a
similar amount of time to complete.

Well KMeans actually runs two algorithms normally: canopy and thenKMeans. You could try the Random seed approach, which would skip thecanopy run first.

Re: Clustering from DB

Reply via email to