[ https://issues.apache.org/jira/browse/MADLIB-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank McQuillan closed MADLIB-1160. ----------------------------------- > Usability changes for LDA > ------------------------- > > Key: MADLIB-1160 > URL: https://issues.apache.org/jira/browse/MADLIB-1160 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Utilities > Reporter: Frank McQuillan > Assignee: Jingyi Mei > Priority: Minor > Fix For: v1.14 > > > Context > Please see this thread from the user mailing list > > [http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E] > Tasks > 1) Term frequency > [http://madlib.apache.org/docs/latest/group__grp__text__utilities.html] > and LDA > [http://madlib.apache.org/docs/latest/group__grp__lda.html] > should both creates indexes that start at 1, to make them consistent with > other MADlib modules. One or both of these currently create indexes starting > at 0. > 2) In the output_data_table *topic_assignment* is a dense vector but *words* > is a sparse vector (svec). > We should change *topic_assignment* to be a sparse vector to be consistent. > Note: the reason sparse vectors were used in the first place (I think) is to > keep the model state as small as possible, so it is preferred to dense format > in this case., although svecs are a bit harder to work with. We have hit the > Postgres 1GB field limit size in some use cases. > 3) The user docs could also use some cleanup at the same time. E.g., helper > functions are used in the examples but not described above. > 4) The helper function `madlib.lda_get_topic_desc` should return top k words > (and ties). It seems to returning the top k-1 words (and ties) now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)