[GitHub] incubator-hivemall pull request #71: [HIVEMALL-74] Implement pLSA

takuti Wed, 12 Apr 2017 19:33:58 -0700

GitHub user takuti opened a pull request:

    https://github.com/apache/incubator-hivemall/pull/71


    [HIVEMALL-74] Implement pLSA

    ## What changes were proposed in this pull request?
    
    Implement (incremental) probabilistic latent semantic analysis (pLSA) 
algorithm:
    
    - Original papers:
      - [Probabilistic Latent Semantic 
Indexing](http://dl.acm.org/citation.cfm?id=312649)
      - [Probabilistic Latent Semantic 
Analysis](http://www.iro.umontreal.ca/~nie/IFT6255/Hofmann-UAI99.pdf)
    - Incremental variant which is implemented in this PR:
      - [Incremental Probabilistic Latent Semantic Analysis for Automatic 
Question 
Recommendation](https://pdfs.semanticscholar.org/b66e/c7faf2e4888503f7ad1537d284f350fb3e58.pdf)
    
    ## What type of PR is it?
    
    Feature
    
    ## What is the Jira issue?
    
    https://issues.apache.org/jira/browse/HIVEMALL-74
    
    ## How was this patch tested?
    
    - unit tests
    - manual tests on local
    
    ## How to use this feature?
    
    Basically, interfaces are similar to Online LDA implemented in #66; we have 
`train_plsa()` and `plsa_predict()` functions.
    
    For a `docs` table:
    
    | docid | doc  |
    |:---:|:---|
    | 1  | "Fruits and vegetables are healthy." |
    |2 | "I like apples, oranges, and avocados. I do not like the flu or 
colds." |
    
    the following query successfully learns topics behind the two documents:
    
    ```sql
    with word_counts as (
      select
        docid,
        feature(word, count(word)) as f
      from docs t1 lateral view explode(tokenize(doc, true)) t2 as word
      where
        not is_stopword(word)
      group by
        docid, word
    ),
    plsa_model as (
      select
        train_plsa(feature, "-topic 2 -iter 10000 -eps 0.00001 -delta 0.00001 
-alpha 0.00001") as (label, word, prob)
      from (
        select docid, collect_set(f) as feature
        from word_counts
        group by docid
      ) t
    ),
    test as (
      select
        docid,
        word,
        count(word) as value
      from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word
      where
        not is_stopword(word)
      group by
        docid, word
    ),
    topic as (
      select
        t.docid,
        plsa_predict(t.word, t.value, m.label, m.prob, "-topic 2 -delta 0.01 
-alpha 0.00001") as probabilities
      from
        test t
        JOIN plsa_model m ON (t.word = m.word)
      group by
        t.docid
    )
    select docid, probabilities, probabilities[0].label, m.words -- topic each 
document should be assigned
    from topic t
    join (
      select label, collect_set(feature(word, prob)) as words
      from plsa_model
      group by label
    ) m on t.probabilities[0].label = m.label
    ;
    ```
    
    | docid | doc  |
    |:---:|:---|
    | 1  | "Fruits and vegetables are healthy." |
    |2 | "I like apples, oranges, and avocados. I do not like the flu or 
colds." |
    
    |docid  | probabilities  | label |  words |
    |:---:|:---|:---:|:---|
    |1   |    
[{"label":0,"probability":1.0},{"label":1,"probability":1.1246405E-32}]  | 
0|["fruits:0.33333066","healthy:0.33333066","vegetables:0.33333066","like:2.4555745E-6","avocados:2.0867487E-6","colds:1.665714E-6","flu:1.0358361E-6","apples:5.5809795E-7","oranges:2.2456922E-7"]
 |
    |2 |      
[{"label":1,"probability":0.9999961},{"label":0,"probability":3.886718E-6}]  |  
 1     |  
["like:0.28571412","oranges:0.14285718","colds:0.14285718","avocados:0.14285718","apples:0.14285718","flu:0.14285718","healthy:1.841767E-32","vegetables:1.2376679E-32","fruits:7.812756E-34"]
 |


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/takuti/incubator-hivemall plsa

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hivemall/pull/71.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #71
    
----
commit afac23fc002d9cbc7cd446a713cba9607564fe54
Author: Takuya Kitazawa <[email protected]>
Date:   2017-04-12T05:25:21Z

    Implement Incremental pLSA model

commit 0af3783ec89c56e09c410f39bc1a4b359cffc2d9
Author: Takuya Kitazawa <[email protected]>
Date:   2017-04-12T08:57:26Z

    Implement `train_plsa` UDTF and its test

commit d7ebee3d8d3c8e950f803fe07aefaee12ae2c806
Author: Takuya Kitazawa <[email protected]>
Date:   2017-04-12T09:26:05Z

    Off debug print

commit f247542f88397f28c93a53d8a0b5140f785b131b
Author: Takuya Kitazawa <[email protected]>
Date:   2017-04-13T02:12:22Z

    Implement `plsa_predict` UDAF and its test

commit 82597a2ca4ffefefb301b89aba57e24960fb0436
Author: Takuya Kitazawa <[email protected]>
Date:   2017-04-13T02:21:38Z

    Add pLSA UDFs to ddl

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall pull request #71: [HIVEMALL-74] Implement pLSA

Reply via email to