Nikhil created MADLIB-1201:
------------------------------

             Summary: Inconsistent lda output tables
                 Key: MADLIB-1201
                 URL: https://issues.apache.org/jira/browse/MADLIB-1201
             Project: Apache MADlib
          Issue Type: Bug
          Components: Module: Parallel Latent Dirichlet Allocation
            Reporter: Jingyi Mei
             Fix For: 1.14


We found an inconsistency in the LDA module between the outputs of lda_train 
and lda_get_word_topic_count. 

Repro Steps
{code}
DROP TABLE IF EXISTS documents;
CREATE TABLE documents(docid INT4, contents TEXT);
INSERT INTO documents VALUES
(0, ' b a a c'),
(1, ' d e f f f ');

ALTER TABLE documents ADD COLUMN words TEXT[];
UPDATE documents SET words = regexp_split_to_array(lower(contents), 
E'[\\s+\\.\\,]');

DROP TABLE IF EXISTS my_training, my_training_vocabulary;
SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training', 
TRUE);


DROP TABLE IF EXISTS my_model, my_outdata;
SELECT madlib.lda_train( 'my_training',
                         'my_model',
                         'my_outdata',
                         7,
                         2,
                         1,
                         5,
                         0.01
                       );

select * from my_outdata order by docid;
```
 docid | wordcount |   words   |  counts   | topic_count | topic_assignment
-------+-----------+-----------+-----------+-------------+------------------
     0 |         5 | {2,1,0,3} | {1,2,1,1} | {2,3}       | {0,1,1,1,0}
     1 |         7 | {4,5,0,6} | {1,1,2,3} | {1,6}       | {1,0,1,1,1,1,1}
```


DROP TABLE IF EXISTS my_word_topic_count;
SELECT madlib.lda_get_word_topic_count( 'my_model', 'my_word_topic_count');
SELECT * FROM my_word_topic_count ORDER BY wordid;
```
 wordid | topic_count
--------+-------------
      0 | {1,2}
      1 | {0,2}
      2 | {1,0}
      3 | {0,1}
      4 | {1,0}
      5 | {0,1}
      6 | {0,3}
(7 rows)
```
{code}

The output of 'my_outdata' indicates that wordid 3 gets assigned only to topic 
0 but the output of my_word_topic_count indicates that wordid 3 gets assigned 
only to topic 1. This output seems to be inconsistent with each other. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to