[
https://issues.apache.org/jira/browse/MADLIB-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank McQuillan updated MADLIB-899:
-----------------------------------
Labels: starter (was: gsoc2016 starter)
> LDA (parsed) model table and output table disagree
> --------------------------------------------------
>
> Key: MADLIB-899
> URL: https://issues.apache.org/jira/browse/MADLIB-899
> Project: Apache MADlib
> Issue Type: Bug
> Components: Module: Parallel Latent Dirichlet Allocation
> Reporter: Steve Ziegler
> Assignee: Rahul Iyer
> Priority: Minor
> Labels: starter
>
> {code:sql}
> select * from tester;
>
> docid | documents | words
> -------+----------------------------------+------------------------------------
> 2 | Sam ate ham for lunch | {sam,ate,ham,for,lunch}
> 1 | Monday morning. I ate breakfast! | {monday,morning.,i,ate,breakfast!}
> SELECT madlib.term_frequency('tester','docid','words','my_training',TRUE);
> term_frequency
> ----------------------------------------------------------------------------------------
> Term frequency output in table my_training, vocabulary in table
> my_training_vocabulary
> (1 row)
> select madlib.lda_train('my_training','my_model','my_outdata',9,5,10,1,0.1);
> lda_train
> ----------------------------------
> (my_model,"model table")
> (my_outdata,"output data table")
> (2 rows)
> madlib-pg93=# select (madlib.lda_parse_model(model, voc_size, topic_num)).*
> from my_model;
> model_matrix_part1 |
> model_matrix_part2 | total_topic_counts
> ---------------------------------------------------+---------------------------------------------------------------+--------------------
> {{2,0,0,0,0},{0,0,0,0,1},{0,0,1,0,0},{0,0,0,0,1}} |
> {{0,1,0,0,0},{0,0,1,0,0},{0,1,0,0,0},{0,0,0,0,1},{0,0,0,1,0}} | {2,2,2,1,3}
> (1 row)
> madlib-pg93=# select * from my_outdata;
> docid | wordcount | words | counts | topic_count |
> topic_assignment
> -------+-----------+-------------+-------------+-------------+------------------
> 1 | 5 | {0,1,4,6,7} | {1,1,1,1,1} | {2,1,0,0,2} | {0,4,1,4,0}
> 2 | 5 | {8,0,5,2,3} | {1,1,1,1,1} | {0,2,1,1,1} | {1,1,4,3,2}
> (2 rows)
> madlib-pg93=# select * from my_model
> madlib-pg93-# ;
> voc_size | topic_num | alpha | beta |
> model
> ----------+-----------+-------+------+------------------------------------------------------------------------------------
> 9 | 5 | 1 | 0.1 |
> {2,0,0,0,0,1,0,1,0,0,0,1,4294967296,0,0,0,1,0,4294967296,0,0,0,0,1,0,4294967296,0}
> (1 row)
> madlib-pg93=# select * from my_training_vocabulary
> madlib-pg93-# ;
> wordid | word
> --------+------------
> 0 | ate
> 1 | breakfast!
> 2 | for
> 3 | ham
> 4 | i
> 5 | lunch
> 6 | monday
> 7 | morning.
> 8 | sam
> (9 rows)
> {code}
> total_topic_counts array from model does not match the sum of the
> topic_counts arrays from the output_table.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)