Hi Drew, Below is a mail I sent to this list a while back. Is this consistent with your experience?
Cheers, Mark On Sep 23, 2009, at 6:05 AM, Levy, Mark wrote: > I've started to experiment with LDA and am finding that it creates > only > a single long-running map task for each iteration, which doesn't scale > well. The map is taking 20mins for 10k of my input SparseVectors, > and 5 > hours for 100k (the vocabulary size also grows when there are more > vectors). > > Is this expected or am I doing something wrong? Are there any > existing > performance benchmarks? > > -----Original Message----- > From: Drew Farris [mailto:[email protected]] > Sent: 18 December 2009 13:59 > To: [email protected] > Subject: Re: Cluster text docs > > Hi Shashi, > > On Fri, Dec 18, 2009 at 1:36 AM, Shashikant Kore <[email protected]> > wrote: > > > (.. cluster assignment is already there. Wonder why you had to redo > > it.) > > Ahh, yes. I didn't have to re-do it, but I did wanted to learn the > internal structure of the data files and to point out that it was easy > enough to achieve. The code is quite straightforward. > > > Drew, are you using the latest code? Overnight sounds too long. > > That's good to know. This was a couple month or two ago before the > matrix/math stuff was rolled in. I'll collect exact times on the next > run I do. > > Has anyone else run LDA outside of the canned Reuters example? I would > be interested to hear about corpus characteristics and processing > power required to successfully produce LDA clusters. I've had all > sorts of issues, but mostly related to hadoop configuration nits > related to my environment however > > Drew
