Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/232#discussion_r168248554 --- Diff: src/ports/postgres/modules/lda/lda.sql_in --- @@ -182,324 +105,789 @@ lda_train( data_table, \b Arguments <dl class="arglist"> <dt>data_table</dt> - <dd>TEXT. The name of the table storing the training dataset. Each row is + <dd>TEXT. Name of the table storing the training dataset. Each row is in the form <tt><docid, wordid, count></tt> where \c docid, \c wordid, and \c count - are non-negative integers. - + are non-negative integers. The \c docid column refers to the document ID, the \c wordid column is the word ID (the index of a word in the vocabulary), and \c count is the - number of occurrences of the word in the document. - - Please note that column names for \c docid, \c wordid, and \c count are currently fixed, so you must use these - exact names in the data_table.</dd> + number of occurrences of the word in the document. Please note: + + - \c wordid must be + contiguous integers going from from 0 to \c voc_size − \c 1. + - column names for \c docid, \c wordid, and \c count are currently fixed, + so you must use these exact names in the data_table. + + The function <a href="group__grp__text__utilities.html">Term Frequency</a> + can be used to generate vocabulary in the required format from raw documents. + </dd> <dt>model_table</dt> - <dd>TEXT. The name of the table storing the learned models. This table has one row and the following columns. + <dd>TEXT. This is an output table generated by LDA which contains the learned model. + It has one row with the following columns: <table class="output"> <tr> <th>voc_size</th> - <td>INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size − \c 1. A data validation routine is called to validate the dataset.</td> + <td>INTEGER. Size of the vocabulary. As mentioned above for the input + table, \c wordid consists of contiguous integers going + from 0 to \c voc_size − \c 1. + </td> </tr> <tr> <th>topic_num</th> <td>INTEGER. Number of topics.</td> </tr> <tr> <th>alpha</th> - <td>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num).</td> + <td>DOUBLE PRECISION. Dirichlet prior for the per-document + topic multinomial.</td> </tr> <tr> <th>beta</th> - <td>DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01).</td> + <td>DOUBLE PRECISION. Dirichlet prior for the per-topic + word multinomial.</td> </tr> <tr> <th>model</th> - <td>BIGINT[].</td> + <td>BIGINT[]. The encoded model description (not human readable).</td> </tr> </table> </dd> <dt>output_data_table</dt> - <dd>TEXT. The name of the table to store the output data. It has the following columns: + <dd>TEXT. The name of the table generated by LDA that stores + the output data. It has the following columns: <table class="output"> <tr> <th>docid</th> - <td>INTEGER.</td> + <td>INTEGER. Document id from input 'data_table'.</td> </tr> <tr> <th>wordcount</th> - <td>INTEGER.</td> + <td>INTEGER. Count of number of words in the document, + including repeats. For example, if a word appears 3 times + in the document, it is counted 3 times.</td> </tr> <tr> <th>words</th> - <td>INTEGER[].</td> + <td>INTEGER[]. Array of \c wordid in the document, not + including repeats. For example, if a word appears 3 times + in the document, it appears only once in the \c words array.</td> </tr> <tr> <th>counts</th> - <td>INTEGER[].</td> + <td>INTEGER[]. Frequency of occurance of a word in the document, + indexed the same as the \c words array above. For example, if the + 2nd element of the \c counts array is 4, it means that the word + in the 2nd element of the \c words array occurs 4 times in the + document.</td> </tr> <tr> <th>topic_count</th> - <td>INTEGER[].</td> + <td>INTEGER[]. Array of the count of words in the document + that correspond to each topic.</td> --- End diff -- Array indexing is a developer thing not a user thing, but I will add: "This array is of length \c topic_num."
---