[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

jingyimei Mon, 12 Feb 2018 15:40:01 -0800

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r167708544
  
    --- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
    @@ -182,324 +105,789 @@ lda_train( data_table,
     \b Arguments
     <dl class="arglist">
         <dt>data_table</dt>
    -    <dd>TEXT. The name of the table storing the training dataset. Each row 
is
    +    <dd>TEXT. Name of the table storing the training dataset. Each row is
         in the form <tt>&lt;docid, wordid, count&gt;</tt> where \c docid, \c 
wordid, and \c count
    -    are non-negative integers.
    -
    +    are non-negative integers.  
         The \c docid column refers to the document ID, the \c wordid column is 
the
         word ID (the index of a word in the vocabulary), and \c count is the
    -    number of occurrences of the word in the document.
    -
    -    Please note that column names for \c docid, \c wordid, and \c count 
are currently fixed, so you must use these
    -    exact names in the data_table.</dd>
    +    number of occurrences of the word in the document. Please note:
    +    
    +    - \c wordid must be 
    +    contiguous integers going from from 0 to \c voc_size &minus; \c 1.
    +    - column names for \c docid, \c wordid, and \c count are currently 
fixed, 
    +    so you must use these exact names in the data_table.  
    +    
    +    The function <a href="group__grp__text__utilities.html">Term 
Frequency</a>
    +    can be used to generate vocabulary in the required format from raw 
documents.
    +    </dd>
     
         <dt>model_table</dt>
    -    <dd>TEXT. The name of the table storing the learned models. This table 
has one row and the following columns.
    +    <dd>TEXT. This is an output table generated by LDA which contains the 
learned model. 
    +    It has one row with the following columns:
             <table class="output">
                 <tr>
                     <th>voc_size</th>
    -                <td>INTEGER. Size of the vocabulary. Note that the \c 
wordid should be continous integers starting from 0 to \c voc_size &minus; \c 
1.  A data validation routine is called to validate the dataset.</td>
    +                <td>INTEGER. Size of the vocabulary. As mentioned above 
for the input 
    +                table, \c wordid consists of contiguous integers going 
    +                from 0 to \c voc_size &minus; \c 1.  
    +                </td>
                 </tr>
                 <tr>
                     <th>topic_num</th>
                     <td>INTEGER. Number of topics.</td>
                 </tr>
                 <tr>
                     <th>alpha</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-doc 
topic multinomial (e.g. 50/topic_num).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-document 
    +                topic multinomial.</td>
                 </tr>
                 <tr>
                     <th>beta</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the 
per-topic word multinomial (e.g. 0.01).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-topic 
    +                word multinomial.</td>
                 </tr>
                 <tr>
                     <th>model</th>
    -                <td>BIGINT[].</td>
    +                <td>BIGINT[]. The encoded model description (not human 
readable).</td>
                 </tr>
             </table>
         </dd>
         <dt>output_data_table</dt>
    -    <dd>TEXT. The name of the table to store the output data. It has the 
following columns:
    +    <dd>TEXT. The name of the table generated by LDA that stores 
    +    the output data. It has the following columns:
             <table class="output">
                 <tr>
                     <th>docid</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Document id from input 'data_table'.</td>
                 </tr>
                 <tr>
                     <th>wordcount</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Count of number of words in the document, 
    +                including repeats. For example, if a word appears 3 times 
    +                in the document, it is counted 3 times.</td>
                 </tr>
                 <tr>
                     <th>words</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of \c wordid in the document, not
    +                including repeats.  For example, if a word appears 3 times 
    +                in the document, it appears only once in the \c words 
array.</td>
                 </tr>
                 <tr>
                     <th>counts</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Frequency of occurance of a word in the 
document,
    +                indexed the same as the \c words array above.  For 
example, if the
    +                2nd element of the \c counts array is 4, it means that the 
word
    +                in the 2nd element of the \c words array occurs 4 times in 
the
    +                document.</td>
                 </tr>
                 <tr>
                     <th>topic_count</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of the count of words in the document
    +                that correspond to each topic.</td>
                 </tr>
                 <tr>
                     <th>topic_assignment</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array indicating which topic each word in 
the 
    +                document corresponds to.  This array is of length \c  
wordcount.</td>
                 </tr>
             </table>
         </dd>
         <dt>voc_size</dt>
    -    <dd>INTEGER. Size of the vocabulary. Note that the \c wordid should be 
continous integers starting from 0 to \c voc_size &minus; \c 1.  A data 
validation routine is called to validate the dataset.</dd>
    +    <dd>INTEGER. Size of the vocabulary. As mentioned above for the 
    +                input 'data_table', \c wordid consists of continous 
integers going 
    +                from 0 to \c voc_size &minus; \c 1.   
    +    </dd>
         <dt>topic_num</dt>
    -    <dd>INTEGER. Number of topics.</dd>
    +    <dd>INTEGER. Desired number of topics.</dd>
         <dt>iter_num</dt>
    -    <dd>INTEGER. Number of iterations (e.g. 60).</dd>
    +    <dd>INTEGER. Desired number of iterations.</dd>
         <dt>alpha</dt>
    -    <dd>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic 
multinomial (e.g. 50/topic_num).</dd>
    +    <dd>DOUBLE PRECISION. Dirichlet prior for the per-document topic 
    +    multinomial (e.g., 50/topic_num is a typical value to start with).</dd>
         <dt>beta</dt>
    -    <dd>DOUBLE PRECISION. Dirichlet parameter for the per-topic word 
multinomial (e.g. 0.01).</dd>
    +    <dd>DOUBLE PRECISION. Dirichlet prior for the per-topic 
    +    word multinomial (e.g., 0.01 is a typical value to start with).</dd>
     </dl>
     
     @anchor predict
     @par Prediction Function
     
    -Prediction&mdash;labelling test documents using a learned LDA 
model&mdash;is accomplished with the following function:
    +Prediction involves labelling test documents using a learned LDA model:
     <pre class="syntax">
     lda_predict( data_table,
                  model_table,
    -             output_table
    +             output_predict_table
                );
     </pre>
    -
    -This function stores the prediction results in
    -<tt><em>output_table</em></tt>. Each row in the table stores the topic
    -distribution and the topic assignments for a document in the dataset. The
    -table has the following columns:
    -<table class="output">
    -    <tr>
    -        <th>docid</th>
    -        <td>INTEGER.</td>
    -    </tr>
    -    <tr>
    -        <th>wordcount</th>
    -        <td>INTEGER.</td>
    -    </tr>
    -    <tr>
    -        <th>words</th>
    -        <td>INTEGER[]. List of word IDs in this document.</td>
    -    </tr>
    -    <tr>
    -        <th>counts</th>
    -        <td>INTEGER[]. List of word counts in this document.</td>
    -    </tr>
    -    <tr>
    -        <th>topic_count</th>
    -        <td>INTEGER[]. Of length topic_num, list of topic counts in this 
document.</td>
    -    </tr>
    -    <tr>
    -        <th>topic_assignment</th>
    -        <td>INTEGER[]. Of length wordcount, list of topic index for each 
word.</td>
    -    </tr>
    -</table>
    +\b Arguments
    +<dl class="arglist">
    +<dt>data_table</dt>
    +    <dd>TEXT. Name of the table storing the test dataset 
    +    (new document to be labeled).
    +    </dd>
    +<dt>model_table</dt>
    +    <dd>TEXT. The model table generated by the training process.
    +    </dd>
    +<dt>output_predict_table</dt>
    +    <dd>TEXT. The prediction output table. 
    +    Each row in the table stores the topic 
    +    distribution and the topic assignments for a 
    +    document in the dataset. This table has the exact 
    +    same columns and interpretation as 
    +    the 'output_data_table' from the training function above. 
    +    </dd>
    +</dl>
     
     @anchor perplexity
    -@par Perplexity Function
    -This module provides a function for computing the perplexity.
    +@par Perplexity
    +Perplexity describes how well the model fits the data by 
    +computing word likelihoods averaged over the test documents.
    +This function returns a single perplexity value. 
     <pre class="syntax">
     lda_get_perplexity( model_table,
    -                    output_data_table
    +                    output_predict_table
                       );
     </pre>
    +\b Arguments
    +<dl class="arglist">
    +<dt>model_table</dt>
    +    <dd>TEXT. The model table generated by the training process.
    +    </dd>
    +<dt>output_predict_table</dt>
    +    <dd>TEXT. The prediction output table generated by the 
    +    predict function above.
    +    </dd>
    +</dl>
    +
    +@anchor helper
    +@par Helper Functions
    +
    +The helper functions can help to interpret the output 
    +from LDA training and LDA prediction.
    +
    +<b>Topic description by top-k words</b>
    +
    +Applies to LDA training only.
    +
    +<pre class="syntax">
    +lda_get_topic_desc( model_table,
    +                    vocab_table,
    +                    output_table,
    +                    top_k
    +                  )
    +</pre>
    +\b Arguments
    +<dl class="arglist">
    +<dt>model_table</dt>
    +    <dd>TEXT. The model table generated by the training process.
    +    </dd>
    +<dt>vocab_table</dt>
    +    <dd>TEXT. The vocabulary table in the form <wordid, word>. 
    --- End diff --
    
    Can mention this table can generated from term_frequency.

---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Reply via email to