That's a very interesting development! However let me do a step back: why do we even need EM? From a user perspective, what would be the advantage of running anomaly detection on 1-day batches rather than on a continuously online-learning model? I'm probably missing something because I don't see value for the latter use case.
Giacomo On 21 July 2017 at 20:06, Lujan Moreno, Gustavo < [email protected]> wrote: > I would suggest supporting both for now. In my experiments online is > taking more iterations to converge (although I haven’t measured time, > online is supposed to be faster). The spark.mllib doesn’t allow to score > unseen records with EM, only train. The new spark.ml does allow to train > with EM and score unseen documents with EM but Ricardo and I found that it > is really using online under the hood. I consider that to be a bug from > Spark side. Therefore, what Ricardo is suggesting is a workaround for this > bug. > > > > > On 7/21/17, 1:44 PM, "Barona, Ricardo" <[email protected]> wrote: > > >Once a saved model is loaded it needs to be converted to LocalLDAModel if > it’s a DistributedLDAModel but from what I heard, the importance of what > you used for training, EM and Online is in the topics matrix that generates > one and the other. I’m not exactly and expert but I’d think they are going > to be different, right? The topics matrix of a LocalLDAModel coming from > DistributedLDAModel will remain the same and topic distributions will be > calculated based on that. > > > >On 7/21/17, 1:26 PM, "Edwards, Brandon" <[email protected]> > wrote: > > > > A question just came up for me. Is there a true use case for > utilizing EM that allows one to carry context from previous models into the > future? It seems that once you save to a local model in order to utilize it > for future data, from then on you only can use the Online optimizer. If > this is correct, I vote for getting rid of EM. I don’t see value in > supporting a use case that does not carry context into future models. > > > > On 7/21/17, 11:08 AM, "Barona, Ricardo" <[email protected]> > wrote: > > > > During the last 9 days, I've been working on modifying Apache > Spot LDA wrapper to enable the possibility of saving models and load > existing models and then get topic distributions for the same corpus or for > new documents (see https://issues.apache.org/jira/browse/SPOT-196). Until > now, Apache Spot ML module has been running in batch mode training and > getting topic distributions with the same documents it trained but that > needs to change soon as we are looking forward to achieving near real time. > > > > Since this year, Apache Spot enabled Online optimizer so users > can select whether to run LDA using EM or Online; EM was the first option > we implemented and then we decided it was a good idea to offer Online as > well. > > > > In my intention for keep supporting both, EM and Online > optimizer, I modified the code in such way that you can train with either > one but only get topic distributions with LocalLDAModel. The reason for > that is that only LocalLDAModel supports getting topic distributions for > new documents. The problem with that approach is that a very simple unit > test we have is failing now and the it is because when I convert > DistributedLDAModel to LocalLDAModel, the document concentration parameter > remains the same as it was originally provided for EM but it doesn't > necessarily work for LocalLDAModel.topicDistributions method. > > > > Take a look at https://issues.apache.org/jira/secure/attachment/ > 12878382/everythingOK.png. There you can see the expected result from > training and getting topic distributions with EM only or Online only in a > two document one word each document data set. > > > > Then, here is the problem I explained before about converting > DistributedLDAModel to LocalLDAModel: https://issues.apache.org/ > jira/secure/attachment/12878381/notSoOk.png > > > > A possible solution for this is to use the following code to > implement a custom function to convert DistributedLDAModel to LocalLDAModel > (see https://issues.apache.org/jira/secure/attachment/ > 12878380/possibleSolution.png and the code below): > > > > package org.apache.spark.mllib.clustering > > > > import org.apache.spark.mllib.linalg.{Matrix, Vector} > > > > object SpotLDA { > > /** > > * Creates a new LocalLDAModel but it can reset alpha and beta > (although we just need alpha). > > * @param topicsMatrix Distributed LDA Model topicsMatrix > > * @param alpha New value for alpha i.e. If Model was trained > with 1.002 for alpha using EM optimizer, this method > > * allows you to reset alpha to something like > 0.0009 and get topic distributions with the desired > > * document concentration. > > * @param beta New value for beta > > * @return LocalLDAModel > > */ > > def toLocal(topicsMatrix: Matrix, alpha: Vector, beta: Double): > LocalLDAModel ={ > > > > new LocalLDAModel(topicsMatrix, alpha, beta) > > } > > } > > > > The only disadvantage I see here is that users will need to > provide 3 parameters if they are using EM optimizer instead of only 2: > > > > - EM alpha > > > > - EM beta > > > > - Online alpha > > Or provide only 2 parameters if they prefer to work with Online > Optimizer only > > > > - Online alpha > > > > - Online beta > > > > Discussing this with Gustavo, he suggested we even set a > “default” number for Online alpha so if users only configure EM alpha and > EM beta the application will keep working. > > > > Being said all that, here is the big question I’d like to ask: > should we keep supporting both, EM Optimizer and Online Optimizer and have > users to configure the required parameters or do you think is time to let > EM go and just keep Online optimizer? > > > > My vote is for keep both but let me know if what you think. > > > > Thanks, > > Ricardo Barona >
