[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

jingyimei Mon, 12 Feb 2018 15:39:56 -0800

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r167708360
  
    --- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
    @@ -182,324 +105,789 @@ lda_train( data_table,
     \b Arguments
     <dl class="arglist">
         <dt>data_table</dt>
    -    <dd>TEXT. The name of the table storing the training dataset. Each row 
is
    +    <dd>TEXT. Name of the table storing the training dataset. Each row is
         in the form <tt>&lt;docid, wordid, count&gt;</tt> where \c docid, \c 
wordid, and \c count
    -    are non-negative integers.
    -
    +    are non-negative integers.  
         The \c docid column refers to the document ID, the \c wordid column is 
the
         word ID (the index of a word in the vocabulary), and \c count is the
    -    number of occurrences of the word in the document.
    -
    -    Please note that column names for \c docid, \c wordid, and \c count 
are currently fixed, so you must use these
    -    exact names in the data_table.</dd>
    +    number of occurrences of the word in the document. Please note:
    +    
    +    - \c wordid must be 
    +    contiguous integers going from from 0 to \c voc_size &minus; \c 1.
    +    - column names for \c docid, \c wordid, and \c count are currently 
fixed, 
    +    so you must use these exact names in the data_table.  
    +    
    +    The function <a href="group__grp__text__utilities.html">Term 
Frequency</a>
    +    can be used to generate vocabulary in the required format from raw 
documents.
    +    </dd>
     
         <dt>model_table</dt>
    -    <dd>TEXT. The name of the table storing the learned models. This table 
has one row and the following columns.
    +    <dd>TEXT. This is an output table generated by LDA which contains the 
learned model. 
    +    It has one row with the following columns:
             <table class="output">
                 <tr>
                     <th>voc_size</th>
    -                <td>INTEGER. Size of the vocabulary. Note that the \c 
wordid should be continous integers starting from 0 to \c voc_size &minus; \c 
1.  A data validation routine is called to validate the dataset.</td>
    +                <td>INTEGER. Size of the vocabulary. As mentioned above 
for the input 
    +                table, \c wordid consists of contiguous integers going 
    +                from 0 to \c voc_size &minus; \c 1.  
    +                </td>
                 </tr>
                 <tr>
                     <th>topic_num</th>
                     <td>INTEGER. Number of topics.</td>
                 </tr>
                 <tr>
                     <th>alpha</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-doc 
topic multinomial (e.g. 50/topic_num).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-document 
    +                topic multinomial.</td>
                 </tr>
                 <tr>
                     <th>beta</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the 
per-topic word multinomial (e.g. 0.01).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-topic 
    +                word multinomial.</td>
                 </tr>
                 <tr>
                     <th>model</th>
    -                <td>BIGINT[].</td>
    +                <td>BIGINT[]. The encoded model description (not human 
readable).</td>
                 </tr>
             </table>
         </dd>
         <dt>output_data_table</dt>
    -    <dd>TEXT. The name of the table to store the output data. It has the 
following columns:
    +    <dd>TEXT. The name of the table generated by LDA that stores 
    +    the output data. It has the following columns:
             <table class="output">
                 <tr>
                     <th>docid</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Document id from input 'data_table'.</td>
                 </tr>
                 <tr>
                     <th>wordcount</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Count of number of words in the document, 
    +                including repeats. For example, if a word appears 3 times 
    +                in the document, it is counted 3 times.</td>
                 </tr>
                 <tr>
                     <th>words</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of \c wordid in the document, not
    +                including repeats.  For example, if a word appears 3 times 
    +                in the document, it appears only once in the \c words 
array.</td>
                 </tr>
                 <tr>
                     <th>counts</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Frequency of occurance of a word in the 
document,
    +                indexed the same as the \c words array above.  For 
example, if the
    +                2nd element of the \c counts array is 4, it means that the 
word
    +                in the 2nd element of the \c words array occurs 4 times in 
the
    +                document.</td>
                 </tr>
                 <tr>
                     <th>topic_count</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of the count of words in the document
    +                that correspond to each topic.</td>
                 </tr>
                 <tr>
                     <th>topic_assignment</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array indicating which topic each word in 
the 
    +                document corresponds to.  This array is of length \c  
wordcount.</td>
    --- End diff --
    
    We can mention repeated word will show N times consecutively in the array.

---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Reply via email to