[
https://issues.apache.org/jira/browse/MADLIB-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank McQuillan updated MADLIB-1160:
------------------------------------
Description:
Context
Please see this thread from the user mailing list
http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E
Tasks
1) Term frequency
http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
and LDA
http://madlib.apache.org/docs/latest/group__grp__lda.html
should both creates indexes that start at 1, to make them consistent with other
MADlib modules. One or both of these currently create indexes starting at 0.
2) In the output_data_table *topic_assignment* is a dense vector but *words*
is a sparse vector (svec).
We should change *topic_assignment* to be a sparse vector to be consistent.
Note: the reason sparse vectors were used in the first place (I think) is to
keep the model state as small as possible, so it is preferred to dense format
in this case., although svecs are a bit harder to work with. We have hit the
Postgres 1GB field limit size in some use cases.
was:
Context
Please see this thread from the user mailing list
http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E
1) Term frequency
http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
and LDA
http://madlib.apache.org/docs/latest/group__grp__lda.html
should both creates indexes that start at 1, which makes them consistent with
other MADlib modules.
2) In the output_data_table *topic_assignment* is a dense vector but *words*
is a sparse vector (svec).
We should change *topic_assignment* to be a sparse vector also to be consistent.
Note: the reason sparse vectors are used (I think) is to keep the model states
as small as possible, so it is preferred to dense format in this case.,
although svecs are a bit harder to work with.
> Usability changes for LDA
> -------------------------
>
> Key: MADLIB-1160
> URL: https://issues.apache.org/jira/browse/MADLIB-1160
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Priority: Minor
> Fix For: v2.0
>
>
> Context
> Please see this thread from the user mailing list
> http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E
> Tasks
> 1) Term frequency
> http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
> and LDA
> http://madlib.apache.org/docs/latest/group__grp__lda.html
> should both creates indexes that start at 1, to make them consistent with
> other MADlib modules. One or both of these currently create indexes starting
> at 0.
> 2) In the output_data_table *topic_assignment* is a dense vector but
> *words* is a sparse vector (svec).
> We should change *topic_assignment* to be a sparse vector to be consistent.
> Note: the reason sparse vectors were used in the first place (I think) is to
> keep the model state as small as possible, so it is preferred to dense format
> in this case., although svecs are a bit harder to work with. We have hit the
> Postgres 1GB field limit size in some use cases.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)