Hi guys
I was running lda with 2000 topics on 6G compressed data, roughly 1.2
million docs. I used aws 3 r3.8xlarge machines as core nodes. It turned out
spark applications crushed after 3 or 4 iterations. From ganglia, it
indicated the disk space was all consumed. I believe it’s the shuffle
> first contact with ML).
>
> Ok, I am trying to write a DSL where you can run some commands.
>
> I did a command that trains the Spark LDA and it produces the topics I want
> and I saved it using the save method provided by the LDAModel.
>
> Now I want to load this LDAModel and u
Hi all - I'm running the Spark LDA algorithm on a dataset of roughly 3
million terms with a resulting RDD of approximately 20 GB on a 5 node
cluster with 10 executors (3 cores each) and 14gb of memory per executor.
As the application runs, I'm seeing progressively longer execution times