[jira] [Commented] (MADLIB-1160) Usability changes for LDA

Jingyi Mei (JIRA) Wed, 10 Jan 2018 17:22:36 -0800

    [ 
https://issues.apache.org/jira/browse/MADLIB-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321521#comment-16321521
 ]


Jingyi Mei commented on MADLIB-1160:
------------------------------------

[~fmcquillan]For LDA user doc:
Currently, we only introduce madlib.lda_train, madlib.lda_predict and 
madlib.lda_get_perplexity on top. For some other functions users may need to 
call, such as madlib.lda_get_topic_desc, madlib.lda_get_word_topic_count, we 
directly use it in examples. I would suggest mention them somewhere on top so 
users can know all the tools they can use with lda and also have a clearer mind 
when reading examples.

Also seems we don't have a helper function yet for lda and tf?

> Usability changes for LDA
> -------------------------
>
>                 Key: MADLIB-1160
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1160
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Minor
>             Fix For: v1.14
>
>
> Context
> Please see this thread from the user mailing list
> http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E
> Tasks
> 1)  Term frequency
> http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
> and LDA
> http://madlib.apache.org/docs/latest/group__grp__lda.html
> should both creates indexes that start at 1, to make them consistent with 
> other MADlib modules.  One or both of these currently create indexes starting 
> at 0.
> 2)  In the output_data_table  *topic_assignment* is a dense vector but 
> *words* is a sparse vector (svec).
> We should change *topic_assignment* to be a sparse vector to be consistent.
> Note:  the reason sparse vectors were used in the first place (I think) is to 
> keep the model state as small as possible, so it is preferred to dense format 
> in this case., although svecs are a bit harder to work with.  We have hit the 
> Postgres 1GB field limit size in some use cases.
> 3) The user docs could also use some cleanup at the same time.  E.g., helper 
> functions are used in the examples but not described above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MADLIB-1160) Usability changes for LDA

Reply via email to