Spark EM LDA Optimizer support

Barona, Ricardo Fri, 21 Jul 2017 11:11:38 -0700

During the last 9 days, I've been working on modifying Apache Spot LDA wrapper 
to enable the possibility of saving models and load existing models and then 
get topic distributions for the same corpus or for new documents (see 
https://issues.apache.org/jira/browse/SPOT-196). Until now, Apache Spot ML 
module has been running in batch mode training and getting topic distributions 
with the same documents it trained but that needs to change soon as we are 
looking forward to achieving near real time.


Since this year, Apache Spot enabled Online optimizer so users can select 
whether to run LDA using EM or Online; EM was the first option we implemented 
and then we decided it was a good idea to offer Online as well.

In my intention for keep supporting both, EM and Online optimizer, I modified 
the code in such way that you can train with either one but only get topic 
distributions with LocalLDAModel. The reason for that is that only 
LocalLDAModel supports getting topic distributions for new documents. The 
problem with that approach is that a very simple unit test we have is failing 
now and the it is because when I convert DistributedLDAModel to LocalLDAModel, 
the document concentration parameter remains the same as it was originally 
provided for EM but it doesn't necessarily work for 
LocalLDAModel.topicDistributions method.

Take a look at 
https://issues.apache.org/jira/secure/attachment/12878382/everythingOK.png. 
There you can see the expected result from training and getting topic 
distributions with EM only or Online only in a two document one word each 
document data set.

Then, here is the problem I explained before about converting 
DistributedLDAModel to LocalLDAModel: 
https://issues.apache.org/jira/secure/attachment/12878381/notSoOk.png

A possible solution for this is to use the following code to implement a custom 
function to convert DistributedLDAModel to LocalLDAModel (see 
https://issues.apache.org/jira/secure/attachment/12878380/possibleSolution.png 
and the code below):

package org.apache.spark.mllib.clustering

import org.apache.spark.mllib.linalg.{Matrix, Vector}

object SpotLDA {
  /**
    * Creates a new LocalLDAModel but it can reset alpha and beta (although we 
just need alpha).
    * @param topicsMatrix Distributed LDA Model topicsMatrix
    * @param alpha New value for alpha i.e. If Model was trained with 1.002 for 
alpha using EM optimizer, this method
    *              allows you to reset alpha to something like 0.0009 and get 
topic distributions with the desired
    *              document concentration.
    * @param beta New value for beta
    * @return LocalLDAModel
    */
  def toLocal(topicsMatrix: Matrix, alpha: Vector, beta: Double): LocalLDAModel 
={

    new LocalLDAModel(topicsMatrix, alpha, beta)
  }
}

The only disadvantage I see here is that users will need to provide 3 
parameters if they are using EM optimizer instead of only 2:

-          EM alpha

-          EM beta

-          Online alpha
Or provide only 2 parameters if they prefer to work with Online Optimizer only

-          Online alpha

-          Online beta

Discussing this with Gustavo, he suggested we even set a “default” number for 
Online alpha so if users only configure EM alpha and EM beta the application 
will keep working.

Being said all that, here is the big question I’d like to ask: should we keep 
supporting both, EM Optimizer and Online Optimizer and have users to configure 
the required parameters or do you think is time to let EM go and just keep 
Online optimizer?

My vote is for keep both but let me know if what you think.

Thanks,
Ricardo Barona

Spark EM LDA Optimizer support

Reply via email to