nfantone wrote:
That does seem like a long time.

Is your data sparse or dense?

I would say sparse. My vectors are high dimensional and most of their
values are zero.

Perhaps a larger convergence value might help (-d, I believe).

I'll try that.

Is there any chance your data is publicly shareable?  Come to think of it,
with the vector representations, as long as you don't publish the key (which
terms map to which index), I would think most all data is publicly
shareable.

I'm sorry, I don't quite understand what you're asking. Publicly
shareable? As in user-permissions to access/read/write the data?

Are you on trunk of Mahout?  I think we still need more profiling to get a
better idea of where improvements can be made.

I am. Updated this morning.

I still insist on the configuration issue, and have never considered
Mahout's algorithms implementation to be the actual cause of poor
performance. For now, I've been running kMeans exclusively. Perhaps, I
should try with different clustering methods and see if it takes a
similar amount of time to complete.


That does seem like an awfully long time for 62 MB on a 6 node cluster. How many iterations are running? Were they capped at 32 or did it run longer? How did you generate your initial clusters? Where are the iteration jobs spending most of their time (map vs. reduce) Could you share a copy of your data file so we can take a look at it? If it is just un-annotated vectors there should be no IP issues.

I've run KMeans over gigabytes of data on 10-node clusters and the jobs terminate in a few minutes. That is what I would expect from your job.

You could try Canopy on your data. This is a single-pass algorithm that should take approximately as long as one iteration of KMeans.

Jeff

Reply via email to