Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/71
I've realized that the main difference between following two papers is in
**how to initialize P(w|z) for newly observed words**.
- [Incremental Probabilistic Latent Semantic Analysis for
Automatic Question
Recommendation](https://pdfs.semanticscholar.org/b66e/c7faf2e4888503f7ad1537d284f350fb3e58.pdf)
- [Using Incremental PLSI for Threshold-Resilient Online Event
Analysis](https://pdfs.semanticscholar.org/a258/b33e285da2e93b59e50311d50ff46045a38b.pdf)
The former (i.e., current implementation) simply initializes w/ random
values. Previous P(w|z) could be incorporated by setting a hyper-parameter
`alpha` if we wanted (, and `alpha=0` is also possible).
Meanwhile, the latter requires to undergo certain fold-in procedure to
compute "better" P(w|z) by setting a window size. IMO, this approach is too
much complex to achieve our goal (=implement pLSA UDTF which repeats EM
iterations over the same set of mini-batches).
Thus, I will finalize this PR with current implementation.
Todo:
- [ ] Double-check if the algorithm is implemented correctly
- [ ] Documentation
- Difference with LDA
- Explain the effect of `alpha`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---