[GitHub] incubator-hivemall issue #76: [HIVEMALL-74-2][HIVEMALL-91-2] Revise topic mo...

takuti Tue, 02 May 2017 02:15:08 -0700

Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/76
  
    I first [implemented news20 LDA code with 
scikit-learn](https://github.com/takuti-sandbox/tmp/blob/0b60c0352a006783dbebc57f6b2115c7c78e9d22/python/lda/news20.py),
 and the result was:
    
    ```
    === Topic  0 ===
    game
    team
    year
    games
    play
    === Topic  1 ===
    windows
    thanks
    dos
    window
    does
    === Topic  2 ===
    car
    good
    like
    bike
    cars
    === Topic  3 ===
    don
    just
    know
    like
    think
    === Topic  4 ===
    university
    new
    information
    1993
    mail
    === Topic  5 ===
    does
    true
    think
    wrong
    point
    === Topic  6 ===
    00
    10
    15
    25
    12
    === Topic  7 ===
    com
    edu
    list
    cs
    send
    === Topic  8 ===
    god
    people
    believe
    life
    does
    === Topic  9 ===
    mr
    stephanopoulos
    health
    medical
    high
    === Topic 10 ===
    people
    armenian
    israel
    said
    armenians
    === Topic 11 ===
    key
    chip
    scsi
    encryption
    drive
    === Topic 12 ===
    db
    cx
    period
    w7
    17
    === Topic 13 ===
    gun
    control
    guns
    crime
    law
    === Topic 14 ===
    card
    drive
    new
    video
    apple
    === Topic 15 ===
    people
    don
    make
    government
    think
    === Topic 16 ===
    ax
    max
    g9v
    b8f
    pl
    === Topic 17 ===
    file
    files
    program
    available
    use
    === Topic 18 ===
    space
    nasa
    launch
    earth
    data
    === Topic 19 ===
    jesus
    christian
    come
    church
    christ
    ```
    
    Looks fine.
    
    As a result of experiments on EMR, our LDA UDFs show the very similar 
results by the following queries:
    
    ```sql
    select
      label, word, avg(lambda) as lambda
    from (
      select train_lda(features, "-topics 20 -iter 5 -tau0 40 -kappa 0.8 -s 128 
-num_docs 10906") as (label, word, lambda)
      from news20_raw_multiclass
    ) t
    group by
      label, word
    ;
    ```
    
    ```sql
    select
      label,
      word,
      lambda
    from (
      select
        label,
        word,
        lambda,
        rank() over ( partition by label order by lambda desc) as rank
      from
        lda_model
    ) t
      where rank <= 5
    ;
    ```
    
    However, pLSA was not good as I expected. This issue is partially related 
to [the above hotfix 
commit](https://github.com/apache/incubator-hivemall/pull/76/commits/43404eab16416774bfa830db11027b37c0a010ea),
 or it might be the limitation of current incremental pLSA algorithm. This 
point should be discussed more.
    
    Here is a training query and top-5 topic words I obtained:
    
    ```sql
    create table plsa_model as
    select
      label, word, avg(prob) as prob
    from (
      select train_plsa(features, "-topics 20 -iter 5 -s 128 -alpha 0.1 -eps 
0.001") as (label, word, prob)
      from news20_raw_multiclass
    ) t
    group by
      label, word
    ;
    ```
    
    ```
    label   word    prob
    0       believe 0.20311906933784485
    0       faith   0.0937182679772377
    0       following       0.0937182679772377
    0       means   0.0679437667131424
    0       accept  0.06247884780168533
    1       does    0.2778182625770569
    1       believe 0.1281130313873291
    1       means   0.061893973499536514
    1       following       0.04547790810465813
    1       faith   0.04547790810465813
    2       believe 0.20221209526062012
    2       following       0.09071450680494308
    2       faith   0.09071450680494308
    2       means   0.07135600596666336
    2       accept  0.06047634407877922
    3       believe 0.20150509476661682
    3       following       0.09483356028795242
    3       faith   0.09483356028795242
    3       means   0.06596288830041885
    3       accept  0.06322237104177475
    4       believe 0.2015036791563034
    4       following       0.09475517272949219
    4       faith   0.09475517272949219
    4       means   0.06645548343658447
    4       accept  0.06317011266946793
    5       god     0.29527589678764343
    5       believe 0.13040651381015778
    5       faith   0.062352463603019714
    5       following       0.062352463603019714
    5       means   0.04641261696815491
    6       make    0.19650737941265106
    6       believe 0.13145385682582855
    6       require 0.06410617381334305
    6       faith   0.05998770147562027
    6       following       0.05998770147562027
    7       believe 0.2038946896791458
    7       faith   0.09458644688129425
    7       following       0.09458644688129425
    7       means   0.06742037832736969
    7       accept  0.06305762380361557
    8       require 0.3033565282821655
    8       god     0.1163133978843689
    8       believe 0.08740611374378204
    8       faith   0.05364157631993294
    8       following       0.05364157631993294
    9       believe 0.1609620600938797
    9       faith   0.08727779984474182
    9       following       0.08727779984474182
    9       make    0.07537844032049179
    9       accept  0.05818519368767738
    10      say     0.17923133075237274
    10      god     0.09497514367103577
    10      believe 0.08155602961778641
    10      example 0.07857266813516617
    10      require 0.05061650276184082
    11      does    0.19571512937545776
    11      possibly        0.17480693757534027
    11      post    0.17480693757534027
    11      make    0.1343548595905304
    11      require 0.0641941949725151
    12      believe 0.2017214149236679
    12      following       0.09417476505041122
    12      faith   0.09417476505041122
    12      means   0.06638321280479431
    12      accept  0.06278317421674728
    13      believe 0.20257475972175598
    13      faith   0.0937185287475586
    13      following       0.0937185287475586
    13      means   0.06732570379972458
    13      accept  0.062479015439748764
    14      believe 0.16900449991226196
    14      means   0.08356806635856628
    14      following       0.05885913223028183
    14      faith   0.05885913223028183
    14      accept  0.039239417761564255
    15      believe 0.1978302150964737
    15      following       0.09184517711400986
    15      faith   0.09184517711400986
    15      means   0.06536071747541428
    15      accept  0.06123011186718941
    16      question        0.20881573855876923
    16      believe 0.14541400969028473
    16      following       0.06769093871116638
    16      faith   0.06769093871116638
    16      example 0.05178104341030121
    17      example 0.3560795783996582
    17      following       0.05472135171294212
    17      faith   0.05472135171294212
    17      believe 0.047892987728118896
    17      make    0.03850702941417694
    18      god     0.45429855585098267
    18      believe 0.07022545486688614
    18      possibly        0.0596764013171196
    18      post    0.0596764013171196
    18      require 0.03801883012056351
    19      believe 0.20550598204135895
    19      following       0.09249072521924973
    19      faith   0.09249072521924973
    19      means   0.0718531385064125
    19      accept  0.061660487204790115
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #76: [HIVEMALL-74-2][HIVEMALL-91-2] Revise topic mo...

Reply via email to