Nikhil created MADLIB-1201:
------------------------------
Summary: Inconsistent lda output tables
Key: MADLIB-1201
URL: https://issues.apache.org/jira/browse/MADLIB-1201
Project: Apache MADlib
Issue Type: Bug
Components: Module: Parallel Latent Dirichlet Allocation
Reporter: Jingyi Mei
Fix For: 1.14
We found an inconsistency in the LDA module between the outputs of lda_train
and lda_get_word_topic_count.
Repro Steps
{code}
DROP TABLE IF EXISTS documents;
CREATE TABLE documents(docid INT4, contents TEXT);
INSERT INTO documents VALUES
(0, ' b a a c'),
(1, ' d e f f f ');
ALTER TABLE documents ADD COLUMN words TEXT[];
UPDATE documents SET words = regexp_split_to_array(lower(contents),
E'[\\s+\\.\\,]');
DROP TABLE IF EXISTS my_training, my_training_vocabulary;
SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training',
TRUE);
DROP TABLE IF EXISTS my_model, my_outdata;
SELECT madlib.lda_train( 'my_training',
'my_model',
'my_outdata',
7,
2,
1,
5,
0.01
);
select * from my_outdata order by docid;
```
docid | wordcount | words | counts | topic_count | topic_assignment
-------+-----------+-----------+-----------+-------------+------------------
0 | 5 | {2,1,0,3} | {1,2,1,1} | {2,3} | {0,1,1,1,0}
1 | 7 | {4,5,0,6} | {1,1,2,3} | {1,6} | {1,0,1,1,1,1,1}
```
DROP TABLE IF EXISTS my_word_topic_count;
SELECT madlib.lda_get_word_topic_count( 'my_model', 'my_word_topic_count');
SELECT * FROM my_word_topic_count ORDER BY wordid;
```
wordid | topic_count
--------+-------------
0 | {1,2}
1 | {0,2}
2 | {1,0}
3 | {0,1}
4 | {1,0}
5 | {0,1}
6 | {0,3}
(7 rows)
```
{code}
The output of 'my_outdata' indicates that wordid 3 gets assigned only to topic
0 but the output of my_word_topic_count indicates that wordid 3 gets assigned
only to topic 1. This output seems to be inconsistent with each other.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)