On Sep 23, 2009, at 6:05 AM, Levy, Mark wrote:
I've started to experiment with LDA and am finding that it creates
only
a single long-running map task for each iteration, which doesn't scale
well. The map is taking 20mins for 10k of my input SparseVectors,
and 5
hours for 100k (the vocabulary size also grows when there are more
vectors).
Is this expected or am I doing something wrong? Are there any
existing
performance benchmarks?
That's pretty new code, so I doubt there is much for benchmarks. If
you can share your vectors (the serialized ones, not the originals
with text) than we can profile and look into it a bit more.
Also, you may want to look at MAHOUT-165 in JIRA, as there are some
performance improvements for sparse vector using primitives.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search