Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/76
I first [implemented news20 LDA code with
scikit-learn](https://github.com/takuti-sandbox/tmp/blob/0b60c0352a006783dbebc57f6b2115c7c78e9d22/python/lda/news20.py),
and the result was:
```
=== Topic 0 ===
game
team
year
games
play
=== Topic 1 ===
windows
thanks
dos
window
does
=== Topic 2 ===
car
good
like
bike
cars
=== Topic 3 ===
don
just
know
like
think
=== Topic 4 ===
university
new
information
1993
mail
=== Topic 5 ===
does
true
think
wrong
point
=== Topic 6 ===
00
10
15
25
12
=== Topic 7 ===
com
edu
list
cs
send
=== Topic 8 ===
god
people
believe
life
does
=== Topic 9 ===
mr
stephanopoulos
health
medical
high
=== Topic 10 ===
people
armenian
israel
said
armenians
=== Topic 11 ===
key
chip
scsi
encryption
drive
=== Topic 12 ===
db
cx
period
w7
17
=== Topic 13 ===
gun
control
guns
crime
law
=== Topic 14 ===
card
drive
new
video
apple
=== Topic 15 ===
people
don
make
government
think
=== Topic 16 ===
ax
max
g9v
b8f
pl
=== Topic 17 ===
file
files
program
available
use
=== Topic 18 ===
space
nasa
launch
earth
data
=== Topic 19 ===
jesus
christian
come
church
christ
```
Looks fine.
As a result of experiments on EMR, our LDA UDFs show the very similar
results by the following queries:
```sql
select
label, word, avg(lambda) as lambda
from (
select train_lda(features, "-topics 20 -iter 5 -tau0 40 -kappa 0.8 -s 128
-num_docs 10906") as (label, word, lambda)
from news20_raw_multiclass
) t
group by
label, word
;
```
```sql
select
label,
word,
lambda
from (
select
label,
word,
lambda,
rank() over ( partition by label order by lambda desc) as rank
from
lda_model
) t
where rank <= 5
;
```
However, pLSA was not good as I expected. This issue is partially related
to [the above hotfix
commit](https://github.com/apache/incubator-hivemall/pull/76/commits/43404eab16416774bfa830db11027b37c0a010ea),
or it might be the limitation of current incremental pLSA algorithm. This
point should be discussed more.
Here is a training query and top-5 topic words I obtained:
```sql
create table plsa_model as
select
label, word, avg(prob) as prob
from (
select train_plsa(features, "-topics 20 -iter 5 -s 128 -alpha 0.1 -eps
0.001") as (label, word, prob)
from news20_raw_multiclass
) t
group by
label, word
;
```
```
label word prob
0 believe 0.20311906933784485
0 faith 0.0937182679772377
0 following 0.0937182679772377
0 means 0.0679437667131424
0 accept 0.06247884780168533
1 does 0.2778182625770569
1 believe 0.1281130313873291
1 means 0.061893973499536514
1 following 0.04547790810465813
1 faith 0.04547790810465813
2 believe 0.20221209526062012
2 following 0.09071450680494308
2 faith 0.09071450680494308
2 means 0.07135600596666336
2 accept 0.06047634407877922
3 believe 0.20150509476661682
3 following 0.09483356028795242
3 faith 0.09483356028795242
3 means 0.06596288830041885
3 accept 0.06322237104177475
4 believe 0.2015036791563034
4 following 0.09475517272949219
4 faith 0.09475517272949219
4 means 0.06645548343658447
4 accept 0.06317011266946793
5 god 0.29527589678764343
5 believe 0.13040651381015778
5 faith 0.062352463603019714
5 following 0.062352463603019714
5 means 0.04641261696815491
6 make 0.19650737941265106
6 believe 0.13145385682582855
6 require 0.06410617381334305
6 faith 0.05998770147562027
6 following 0.05998770147562027
7 believe 0.2038946896791458
7 faith 0.09458644688129425
7 following 0.09458644688129425
7 means 0.06742037832736969
7 accept 0.06305762380361557
8 require 0.3033565282821655
8 god 0.1163133978843689
8 believe 0.08740611374378204
8 faith 0.05364157631993294
8 following 0.05364157631993294
9 believe 0.1609620600938797
9 faith 0.08727779984474182
9 following 0.08727779984474182
9 make 0.07537844032049179
9 accept 0.05818519368767738
10 say 0.17923133075237274
10 god 0.09497514367103577
10 believe 0.08155602961778641
10 example 0.07857266813516617
10 require 0.05061650276184082
11 does 0.19571512937545776
11 possibly 0.17480693757534027
11 post 0.17480693757534027
11 make 0.1343548595905304
11 require 0.0641941949725151
12 believe 0.2017214149236679
12 following 0.09417476505041122
12 faith 0.09417476505041122
12 means 0.06638321280479431
12 accept 0.06278317421674728
13 believe 0.20257475972175598
13 faith 0.0937185287475586
13 following 0.0937185287475586
13 means 0.06732570379972458
13 accept 0.062479015439748764
14 believe 0.16900449991226196
14 means 0.08356806635856628
14 following 0.05885913223028183
14 faith 0.05885913223028183
14 accept 0.039239417761564255
15 believe 0.1978302150964737
15 following 0.09184517711400986
15 faith 0.09184517711400986
15 means 0.06536071747541428
15 accept 0.06123011186718941
16 question 0.20881573855876923
16 believe 0.14541400969028473
16 following 0.06769093871116638
16 faith 0.06769093871116638
16 example 0.05178104341030121
17 example 0.3560795783996582
17 following 0.05472135171294212
17 faith 0.05472135171294212
17 believe 0.047892987728118896
17 make 0.03850702941417694
18 god 0.45429855585098267
18 believe 0.07022545486688614
18 possibly 0.0596764013171196
18 post 0.0596764013171196
18 require 0.03801883012056351
19 believe 0.20550598204135895
19 following 0.09249072521924973
19 faith 0.09249072521924973
19 means 0.0718531385064125
19 accept 0.061660487204790115
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---