[ https://issues.apache.org/jira/browse/MADLIB-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347538#comment-16347538 ]
Jingyi Mei commented on MADLIB-1160: ------------------------------------ [~fmcquillan] Yes, from the code this line [https://github.com/apache/madlib/blob/master/src/ports/postgres/modules/lda/lda.py_in#L615] it ranks prob and then gets rank< top_k. Since rank starts at 1, it should be rank<=top_k to get the top k records. Further question: do we need to use dense rank instead of rank to get the top k words here? which kind of rank makes more sense? Difference is here: http://www.sql-tutorial.ru/en/book_rank_dense_rank_functions.html > Usability changes for LDA > ------------------------- > > Key: MADLIB-1160 > URL: https://issues.apache.org/jira/browse/MADLIB-1160 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Utilities > Reporter: Frank McQuillan > Assignee: Jingyi Mei > Priority: Minor > Fix For: v1.14 > > > Context > Please see this thread from the user mailing list > http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E > Tasks > 1) Term frequency > http://madlib.apache.org/docs/latest/group__grp__text__utilities.html > and LDA > http://madlib.apache.org/docs/latest/group__grp__lda.html > should both creates indexes that start at 1, to make them consistent with > other MADlib modules. One or both of these currently create indexes starting > at 0. > 2) In the output_data_table *topic_assignment* is a dense vector but > *words* is a sparse vector (svec). > We should change *topic_assignment* to be a sparse vector to be consistent. > Note: the reason sparse vectors were used in the first place (I think) is to > keep the model state as small as possible, so it is preferred to dense format > in this case., although svecs are a bit harder to work with. We have hit the > Postgres 1GB field limit size in some use cases. > 3) The user docs could also use some cleanup at the same time. E.g., helper > functions are used in the examples but not described above. -- This message was sent by Atlassian JIRA (v7.6.3#76005)