Github user jingyimei commented on a diff in the pull request:
https://github.com/apache/madlib/pull/232#discussion_r167708065
--- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
@@ -182,324 +105,789 @@ lda_train( data_table,
\b Arguments
<dl class="arglist">
<dt>data_table</dt>
- <dd>TEXT. The name of the table storing the training dataset. Each row
is
+ <dd>TEXT. Name of the table storing the training dataset. Each row is
in the form <tt><docid, wordid, count></tt> where \c docid, \c
wordid, and \c count
- are non-negative integers.
-
+ are non-negative integers.
The \c docid column refers to the document ID, the \c wordid column is
the
word ID (the index of a word in the vocabulary), and \c count is the
- number of occurrences of the word in the document.
-
- Please note that column names for \c docid, \c wordid, and \c count
are currently fixed, so you must use these
- exact names in the data_table.</dd>
+ number of occurrences of the word in the document. Please note:
+
+ - \c wordid must be
+ contiguous integers going from from 0 to \c voc_size − \c 1.
+ - column names for \c docid, \c wordid, and \c count are currently
fixed,
+ so you must use these exact names in the data_table.
+
+ The function <a href="group__grp__text__utilities.html">Term
Frequency</a>
+ can be used to generate vocabulary in the required format from raw
documents.
+ </dd>
<dt>model_table</dt>
- <dd>TEXT. The name of the table storing the learned models. This table
has one row and the following columns.
+ <dd>TEXT. This is an output table generated by LDA which contains the
learned model.
+ It has one row with the following columns:
<table class="output">
<tr>
<th>voc_size</th>
- <td>INTEGER. Size of the vocabulary. Note that the \c
wordid should be continous integers starting from 0 to \c voc_size − \c
1. A data validation routine is called to validate the dataset.</td>
+ <td>INTEGER. Size of the vocabulary. As mentioned above
for the input
+ table, \c wordid consists of contiguous integers going
+ from 0 to \c voc_size − \c 1.
+ </td>
</tr>
<tr>
<th>topic_num</th>
<td>INTEGER. Number of topics.</td>
</tr>
<tr>
<th>alpha</th>
- <td>DOUBLE PRECISION. Dirichlet parameter for the per-doc
topic multinomial (e.g. 50/topic_num).</td>
+ <td>DOUBLE PRECISION. Dirichlet prior for the per-document
+ topic multinomial.</td>
</tr>
<tr>
<th>beta</th>
- <td>DOUBLE PRECISION. Dirichlet parameter for the
per-topic word multinomial (e.g. 0.01).</td>
+ <td>DOUBLE PRECISION. Dirichlet prior for the per-topic
+ word multinomial.</td>
</tr>
<tr>
<th>model</th>
- <td>BIGINT[].</td>
+ <td>BIGINT[]. The encoded model description (not human
readable).</td>
</tr>
</table>
</dd>
<dt>output_data_table</dt>
- <dd>TEXT. The name of the table to store the output data. It has the
following columns:
+ <dd>TEXT. The name of the table generated by LDA that stores
+ the output data. It has the following columns:
<table class="output">
<tr>
<th>docid</th>
- <td>INTEGER.</td>
+ <td>INTEGER. Document id from input 'data_table'.</td>
</tr>
<tr>
<th>wordcount</th>
- <td>INTEGER.</td>
+ <td>INTEGER. Count of number of words in the document,
+ including repeats. For example, if a word appears 3 times
+ in the document, it is counted 3 times.</td>
</tr>
<tr>
<th>words</th>
- <td>INTEGER[].</td>
+ <td>INTEGER[]. Array of \c wordid in the document, not
+ including repeats. For example, if a word appears 3 times
+ in the document, it appears only once in the \c words
array.</td>
</tr>
<tr>
<th>counts</th>
- <td>INTEGER[].</td>
+ <td>INTEGER[]. Frequency of occurance of a word in the
document,
+ indexed the same as the \c words array above. For
example, if the
+ 2nd element of the \c counts array is 4, it means that the
word
+ in the 2nd element of the \c words array occurs 4 times in
the
+ document.</td>
</tr>
<tr>
<th>topic_count</th>
- <td>INTEGER[].</td>
+ <td>INTEGER[]. Array of the count of words in the document
+ that correspond to each topic.</td>
--- End diff --
maybe mention array index corresponds to 0 - topic_num-1
---