Re: Spark EM LDA Optimizer support

Giacomo Bernardi Thu, 27 Jul 2017 07:04:05 -0700

That's a very interesting development!

However let me do a step back: why do we even need EM? From a user
perspective, what would be the advantage of running anomaly detection on
1-day batches rather than on a continuously online-learning model? I'm
probably missing something because I don't see value for the latter use
case.


Giacomo



On 21 July 2017 at 20:06, Lujan Moreno, Gustavo <
[email protected]> wrote:

> I would suggest supporting both for now. In my experiments online is
> taking more iterations to converge (although I haven’t measured time,
> online is supposed to be faster). The spark.mllib doesn’t allow to score
> unseen records with EM, only train. The new spark.ml does allow to train
> with EM and score unseen documents with EM but Ricardo and I found that it
> is really using online under the hood. I consider that to be a bug from
> Spark side. Therefore, what Ricardo is suggesting is a workaround for this
> bug.
>
>
>
>
> On 7/21/17, 1:44 PM, "Barona, Ricardo" <[email protected]> wrote:
>
> >Once a saved model is loaded it needs to be converted to LocalLDAModel if
> it’s a DistributedLDAModel but from what I heard, the importance of what
> you used for training, EM and Online is in the topics matrix that generates
> one and the other. I’m not exactly and expert but I’d think they are going
> to be different, right? The topics matrix of a LocalLDAModel coming from
> DistributedLDAModel will remain the same and topic distributions will be
> calculated based on that.
> >
> >On 7/21/17, 1:26 PM, "Edwards, Brandon" <[email protected]>
> wrote:
> >
> >    A question just came up for me. Is there a true use case for
> utilizing EM that allows one to carry context from previous models into the
> future? It seems that once you save to a local model in order to utilize it
> for future data, from then on you only can use the Online optimizer. If
> this is correct, I vote for getting rid of EM. I don’t see value in
> supporting a use case that does not carry context into future models.
> >
> >    On 7/21/17, 11:08 AM, "Barona, Ricardo" <[email protected]>
> wrote:
> >
> >        During the last 9 days, I've been working on modifying Apache
> Spot LDA wrapper to enable the possibility of saving models and load
> existing models and then get topic distributions for the same corpus or for
> new documents (see https://issues.apache.org/jira/browse/SPOT-196). Until
> now, Apache Spot ML module has been running in batch mode training and
> getting topic distributions with the same documents it trained but that
> needs to change soon as we are looking forward to achieving near real time.
> >
> >        Since this year, Apache Spot enabled Online optimizer so users
> can select whether to run LDA using EM or Online; EM was the first option
> we implemented and then we decided it was a good idea to offer Online as
> well.
> >
> >        In my intention for keep supporting both, EM and Online
> optimizer, I modified the code in such way that you can train with either
> one but only get topic distributions with LocalLDAModel. The reason for
> that is that only LocalLDAModel supports getting topic distributions for
> new documents. The problem with that approach is that a very simple unit
> test we have is failing now and the it is because when I convert
> DistributedLDAModel to LocalLDAModel, the document concentration parameter
> remains the same as it was originally provided for EM but it doesn't
> necessarily work for LocalLDAModel.topicDistributions method.
> >
> >        Take a look at https://issues.apache.org/jira/secure/attachment/
> 12878382/everythingOK.png. There you can see the expected result from
> training and getting topic distributions with EM only or Online only in a
> two document one word each document data set.
> >
> >        Then, here is the problem I explained before about converting
> DistributedLDAModel to LocalLDAModel: https://issues.apache.org/
> jira/secure/attachment/12878381/notSoOk.png
> >
> >        A possible solution for this is to use the following code to
> implement a custom function to convert DistributedLDAModel to LocalLDAModel
> (see https://issues.apache.org/jira/secure/attachment/
> 12878380/possibleSolution.png and the code below):
> >
> >        package org.apache.spark.mllib.clustering
> >
> >        import org.apache.spark.mllib.linalg.{Matrix, Vector}
> >
> >        object SpotLDA {
> >          /**
> >            * Creates a new LocalLDAModel but it can reset alpha and beta
> (although we just need alpha).
> >            * @param topicsMatrix Distributed LDA Model topicsMatrix
> >            * @param alpha New value for alpha i.e. If Model was trained
> with 1.002 for alpha using EM optimizer, this method
> >            *              allows you to reset alpha to something like
> 0.0009 and get topic distributions with the desired
> >            *              document concentration.
> >            * @param beta New value for beta
> >            * @return LocalLDAModel
> >            */
> >          def toLocal(topicsMatrix: Matrix, alpha: Vector, beta: Double):
> LocalLDAModel ={
> >
> >            new LocalLDAModel(topicsMatrix, alpha, beta)
> >          }
> >        }
> >
> >        The only disadvantage I see here is that users will need to
> provide 3 parameters if they are using EM optimizer instead of only 2:
> >
> >        -          EM alpha
> >
> >        -          EM beta
> >
> >        -          Online alpha
> >        Or provide only 2 parameters if they prefer to work with Online
> Optimizer only
> >
> >        -          Online alpha
> >
> >        -          Online beta
> >
> >        Discussing this with Gustavo, he suggested we even set a
> “default” number for Online alpha so if users only configure EM alpha and
> EM beta the application will keep working.
> >
> >        Being said all that, here is the big question I’d like to ask:
> should we keep supporting both, EM Optimizer and Online Optimizer and have
> users to configure the required parameters or do you think is time to let
> EM go and just keep Online optimizer?
> >
> >        My vote is for keep both but let me know if what you think.
> >
> >        Thanks,
> >        Ricardo Barona
>

Re: Spark EM LDA Optimizer support

Reply via email to