Re: Clustering from DB

Jeff Eastman Thu, 23 Jul 2009 09:51:08 -0700

nfantone wrote:

That does seem like a long time.


Is your data sparse or dense?


I would say sparse. My vectors are high dimensional and most of their
values are zero.

Perhaps a larger convergence value might help (-d, I believe).


I'll try that.

Is there any chance your data is publicly shareable?  Come to think of it,
with the vector representations, as long as you don't publish the key (which
terms map to which index), I would think most all data is publicly
shareable.


I'm sorry, I don't quite understand what you're asking. Publicly
shareable? As in user-permissions to access/read/write the data?

Are you on trunk of Mahout?  I think we still need more profiling to get a
better idea of where improvements can be made.


I am. Updated this morning.

I still insist on the configuration issue, and have never considered
Mahout's algorithms implementation to be the actual cause of poor
performance. For now, I've been running kMeans exclusively. Perhaps, I
should try with different clustering methods and see if it takes a
similar amount of time to complete.

That does seem like an awfully long time for 62 MB on a 6 node cluster.How many iterations are running? Were they capped at 32 or did it runlonger? How did you generate your initial clusters? Where are theiteration jobs spending most of their time (map vs. reduce) Could youshare a copy of your data file so we can take a look at it? If it isjust un-annotated vectors there should be no IP issues.

I've run KMeans over gigabytes of data on 10-node clusters and the jobsterminate in a few minutes. That is what I would expect from your job.

You could try Canopy on your data. This is a single-pass algorithm thatshould take approximately as long as one iteration of KMeans.


Jeff

Re: Clustering from DB

Reply via email to