[GitHub] incubator-hivemall issue #66: [WIP][HIVEMALL-91] Implement Online LDA

takuti Wed, 05 Apr 2017 22:35:31 -0700

Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/66
  
    @myui I considered to design prediction UDAF, but IMO your suggestion above 
`sum(t.value * m.score) as score` is better for now. 
    
    In order to compute topic distribution based on the `lambda` values (i.e., 
LDA model), I actually like to launch the E step for a test sample as 
[scikit-learn](https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/decomposition/online_lda.py#L546-L577)
 and current `getTopicDistribution()` 
(80a31539bf653e50471346777842ff9478ae352d) do. However, it requires prediction 
UDAF to know hyper-parameters (e.g., number of topics, alpha) which were used 
for training, and it's essentially infeasible. In addition, since users 
sometimes want to know posterior probabilities and their labels for each of all 
topics as follows, so single-column-output of UDAF is not sufficient.
    
    | docid | label | prob |
    |:---:|:---:|:---|
    |1  |     0      | 0.9957867647115234
    |1   |    1     |  0.004213235288476648
    |2    |   0    |   0.0014898943734896843
    |2     |  1   |    0.9985101056265103
    
    See [HERE](https://gist.github.com/takuti/d24324e76d4b2ec7dc4b1d50a4d192d8) 
for detail.
    
    Of course, since we do not launch the "expectation" step as theory 
suggests, `prob` is approximated value in some sense. But, I guess it's 
sufficient in practice.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #66: [WIP][HIVEMALL-91] Implement Online LDA

Reply via email to