RE: Cluster text docs

Levy, Mark Fri, 18 Dec 2009 07:04:14 -0800

Hi Drew,

Below is a mail I sent to this list a while back.  Is this consistent with your 
experience?


Cheers,

Mark


On Sep 23, 2009, at 6:05 AM, Levy, Mark wrote:

> I've started to experiment with LDA and am finding that it creates  
> only
> a single long-running map task for each iteration, which doesn't scale
> well.  The map is taking 20mins for 10k of my input SparseVectors,  
> and 5
> hours for 100k (the vocabulary size also grows when there are more
> vectors).
>
> Is this expected or am I doing something wrong?  Are there any  
> existing
> performance benchmarks?
>


> -----Original Message-----
> From: Drew Farris [mailto:[email protected]]
> Sent: 18 December 2009 13:59
> To: [email protected]
> Subject: Re: Cluster text docs
> 
> Hi Shashi,
> 
> On Fri, Dec 18, 2009 at 1:36 AM, Shashikant Kore <[email protected]>
> wrote:
> 
> > (.. cluster assignment is already there. Wonder why you had to redo
> > it.)
> 
> Ahh, yes. I didn't have to re-do it, but I did wanted to learn the
> internal structure of the data files and to point out that it was easy
> enough to achieve. The code is quite straightforward.
> 
> > Drew, are you using the latest code? Overnight sounds too long.
> 
> That's good to know. This was a couple month or two ago before the
> matrix/math stuff was rolled in. I'll collect exact times on the next
> run I do.
> 
> Has anyone else run LDA outside of the canned Reuters example? I would
> be interested to hear about corpus characteristics and processing
> power required to successfully produce LDA clusters. I've had all
> sorts of issues, but mostly related to hadoop configuration nits
> related to my environment however
> 
> Drew

RE: Cluster text docs

Reply via email to