GitHub user takuti opened a pull request:
https://github.com/apache/incubator-hivemall/pull/71
[HIVEMALL-74] Implement pLSA
## What changes were proposed in this pull request?
Implement (incremental) probabilistic latent semantic analysis (pLSA)
algorithm:
- Original papers:
- [Probabilistic Latent Semantic
Indexing](http://dl.acm.org/citation.cfm?id=312649)
- [Probabilistic Latent Semantic
Analysis](http://www.iro.umontreal.ca/~nie/IFT6255/Hofmann-UAI99.pdf)
- Incremental variant which is implemented in this PR:
- [Incremental Probabilistic Latent Semantic Analysis for Automatic
Question
Recommendation](https://pdfs.semanticscholar.org/b66e/c7faf2e4888503f7ad1537d284f350fb3e58.pdf)
## What type of PR is it?
Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-74
## How was this patch tested?
- unit tests
- manual tests on local
## How to use this feature?
Basically, interfaces are similar to Online LDA implemented in #66; we have
`train_plsa()` and `plsa_predict()` functions.
For a `docs` table:
| docid | doc |
|:---:|:---|
| 1 | "Fruits and vegetables are healthy." |
|2 | "I like apples, oranges, and avocados. I do not like the flu or
colds." |
the following query successfully learns topics behind the two documents:
```sql
with word_counts as (
select
docid,
feature(word, count(word)) as f
from docs t1 lateral view explode(tokenize(doc, true)) t2 as word
where
not is_stopword(word)
group by
docid, word
),
plsa_model as (
select
train_plsa(feature, "-topic 2 -iter 10000 -eps 0.00001 -delta 0.00001
-alpha 0.00001") as (label, word, prob)
from (
select docid, collect_set(f) as feature
from word_counts
group by docid
) t
),
test as (
select
docid,
word,
count(word) as value
from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word
where
not is_stopword(word)
group by
docid, word
),
topic as (
select
t.docid,
plsa_predict(t.word, t.value, m.label, m.prob, "-topic 2 -delta 0.01
-alpha 0.00001") as probabilities
from
test t
JOIN plsa_model m ON (t.word = m.word)
group by
t.docid
)
select docid, probabilities, probabilities[0].label, m.words -- topic each
document should be assigned
from topic t
join (
select label, collect_set(feature(word, prob)) as words
from plsa_model
group by label
) m on t.probabilities[0].label = m.label
;
```
| docid | doc |
|:---:|:---|
| 1 | "Fruits and vegetables are healthy." |
|2 | "I like apples, oranges, and avocados. I do not like the flu or
colds." |
|docid | probabilities | label | words |
|:---:|:---|:---:|:---|
|1 |
[{"label":0,"probability":1.0},{"label":1,"probability":1.1246405E-32}] |
0|["fruits:0.33333066","healthy:0.33333066","vegetables:0.33333066","like:2.4555745E-6","avocados:2.0867487E-6","colds:1.665714E-6","flu:1.0358361E-6","apples:5.5809795E-7","oranges:2.2456922E-7"]
|
|2 |
[{"label":1,"probability":0.9999961},{"label":0,"probability":3.886718E-6}] |
1 |
["like:0.28571412","oranges:0.14285718","colds:0.14285718","avocados:0.14285718","apples:0.14285718","flu:0.14285718","healthy:1.841767E-32","vegetables:1.2376679E-32","fruits:7.812756E-34"]
|
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/takuti/incubator-hivemall plsa
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-hivemall/pull/71.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #71
----
commit afac23fc002d9cbc7cd446a713cba9607564fe54
Author: Takuya Kitazawa <[email protected]>
Date: 2017-04-12T05:25:21Z
Implement Incremental pLSA model
commit 0af3783ec89c56e09c410f39bc1a4b359cffc2d9
Author: Takuya Kitazawa <[email protected]>
Date: 2017-04-12T08:57:26Z
Implement `train_plsa` UDTF and its test
commit d7ebee3d8d3c8e950f803fe07aefaee12ae2c806
Author: Takuya Kitazawa <[email protected]>
Date: 2017-04-12T09:26:05Z
Off debug print
commit f247542f88397f28c93a53d8a0b5140f785b131b
Author: Takuya Kitazawa <[email protected]>
Date: 2017-04-13T02:12:22Z
Implement `plsa_predict` UDAF and its test
commit 82597a2ca4ffefefb301b89aba57e24960fb0436
Author: Takuya Kitazawa <[email protected]>
Date: 2017-04-13T02:21:38Z
Add pLSA UDFs to ddl
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---