[ 
https://issues.apache.org/jira/browse/MADLIB-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-899:
-----------------------------------
    Labels: starter  (was: gsoc2016 starter)

> LDA (parsed) model table and output table disagree
> --------------------------------------------------
>
>                 Key: MADLIB-899
>                 URL: https://issues.apache.org/jira/browse/MADLIB-899
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Parallel Latent Dirichlet Allocation
>            Reporter: Steve Ziegler
>            Assignee: Rahul Iyer
>            Priority: Minor
>              Labels: starter
>
> {code:sql}
> select * from tester;                                                         
>                                                                               
> docid |            documents             |               words
> -------+----------------------------------+------------------------------------
>      2 | Sam ate ham for lunch            | {sam,ate,ham,for,lunch}
>      1 | Monday morning. I ate breakfast! | {monday,morning.,i,ate,breakfast!}
> SELECT madlib.term_frequency('tester','docid','words','my_training',TRUE);
>                                      term_frequency
> ----------------------------------------------------------------------------------------
>  Term frequency output in table my_training, vocabulary in table 
> my_training_vocabulary
> (1 row)
> select madlib.lda_train('my_training','my_model','my_outdata',9,5,10,1,0.1);
>             lda_train
> ----------------------------------
>  (my_model,"model table")
>  (my_outdata,"output data table")
> (2 rows)
> madlib-pg93=# select (madlib.lda_parse_model(model, voc_size, topic_num)).* 
> from my_model;
>                 model_matrix_part1                 |                      
> model_matrix_part2                       | total_topic_counts
> ---------------------------------------------------+---------------------------------------------------------------+--------------------
>  {{2,0,0,0,0},{0,0,0,0,1},{0,0,1,0,0},{0,0,0,0,1}} | 
> {{0,1,0,0,0},{0,0,1,0,0},{0,1,0,0,0},{0,0,0,0,1},{0,0,0,1,0}} | {2,2,2,1,3}
> (1 row)
> madlib-pg93=# select * from my_outdata;
>  docid | wordcount |    words    |   counts    | topic_count | 
> topic_assignment
> -------+-----------+-------------+-------------+-------------+------------------
>      1 |         5 | {0,1,4,6,7} | {1,1,1,1,1} | {2,1,0,0,2} | {0,4,1,4,0}
>      2 |         5 | {8,0,5,2,3} | {1,1,1,1,1} | {0,2,1,1,1} | {1,1,4,3,2}
> (2 rows)
> madlib-pg93=# select * from my_model
> madlib-pg93-# ;
>  voc_size | topic_num | alpha | beta |                                       
> model
> ----------+-----------+-------+------+------------------------------------------------------------------------------------
>         9 |         5 |     1 |  0.1 | 
> {2,0,0,0,0,1,0,1,0,0,0,1,4294967296,0,0,0,1,0,4294967296,0,0,0,0,1,0,4294967296,0}
> (1 row)
> madlib-pg93=# select * from my_training_vocabulary
> madlib-pg93-# ;
>  wordid |    word
> --------+------------
>       0 | ate
>       1 | breakfast!
>       2 | for
>       3 | ham
>       4 | i
>       5 | lunch
>       6 | monday
>       7 | morning.
>       8 | sam
> (9 rows)
> {code}
> total_topic_counts array from model does not match the sum of the 
> topic_counts arrays from the output_table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to