[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

jingyimei Mon, 12 Feb 2018 15:40:01 -0800

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r167715245
  
    --- Diff: src/ports/postgres/modules/utilities/text_utilities.sql_in ---
    @@ -74,175 +81,231 @@ tasks related to text.
         Flag to indicate if a vocabulary is to be created. If TRUE, an 
additional
         output table is created containing the vocabulary of all words, with 
an id
         assigned to each word. The table is called 
<em>output_table</em>_vocabulary
    -    (suffix added to the <em>output_table</em> name) and contains the
    +    (i.e., suffix added to the <em>output_table</em> name) and contains the
         following columns:
    -        - \c wordid: An id assignment for each word
    -        - \c word: The word/term
    +        - \c wordid: An id for each word.
    +        - \c word: The word/term corresponding to the id.
         </dd>
     </dl>
     
     @anchor examples
     @par Examples
     
    --# Prepare datasets with some example documents
    +-# First we create a document table with one document per row:
     <pre class="example">
     DROP TABLE IF EXISTS documents;
    -CREATE TABLE documents(docid INTEGER, doc_contents TEXT);
    +CREATE TABLE documents(docid INT4, contents TEXT);
     INSERT INTO documents VALUES
    -(1, 'I like to eat broccoli and banana. I ate a banana and spinach 
smoothie for breakfast.'),
    -(2, 'Chinchillas and kittens are cute.'),
    -(3, 'My sister adopted two kittens yesterday'),
    -(4, 'Look at this cute hamster munching on a piece of broccoli');
    +(0, 'I like to eat broccoli and bananas. I ate a banana and spinach 
smoothie for breakfast.'),
    +(1, 'Chinchillas and kittens are cute.'),
    +(2, 'My sister adopted two kittens yesterday.'),
    +(3, 'Look at this cute hamster munching on a piece of broccoli.');
     </pre>
    -
    --# Add a new column containing the words (lower-cased) in a text array
    +You can apply stemming, stop word removal and tokenization at this point 
    +in order to prepare the documents for text processing. 
    +Depending upon your database version, various tools are 
    +available. Databases based on more recent versions of 
    +PostgreSQL may do something like:
    +<pre class="example">
    +SELECT tsvector_to_array(to_tsvector('english',contents)) from documents;
    +</pre>
    +<pre class="result">
    +                    tsvector_to_array                     
    ++----------------------------------------------------------
    + {ate,banana,breakfast,broccoli,eat,like,smoothi,spinach}
    + {chinchilla,cute,kitten}
    + {adopt,kitten,sister,two,yesterday}
    + {broccoli,cute,hamster,look,munch,piec}
    +(4 rows)
    +</pre>
    +In this example, we assume a database based on an older 
    +version of PostgreSQL and just perform basic punctuation 
    +removal and tokenization. The array of words is added as 
    +a new column to the documents table:
     <pre class="example">
     ALTER TABLE documents ADD COLUMN words TEXT[];
    -UPDATE documents SET words = regexp_split_to_array(lower(doc_contents), 
E'[\\\\s+\\\\.]');
    +UPDATE documents SET words = 
    +    regexp_split_to_array(lower(
    +    regexp_replace(contents, E'[,.;\\']','', 'g')
    +    ), E'[\\\\s+]');
    +\\x on   
    +SELECT * FROM documents ORDER BY docid;
    +</pre>
    +<pre class="result">
    +-[ RECORD 1 
]------------------------------------------------------------------------------------
    +docid    | 0
    +contents | I like to eat broccoli and bananas. I ate a banana and spinach 
smoothie for breakfast.
    +words    | 
{i,like,to,eat,broccoli,and,bananas,i,ate,a,banana,and,spinach,smoothie,for,breakfast}
    +-[ RECORD 2 
]------------------------------------------------------------------------------------
    +docid    | 1
    +contents | Chinchillas and kittens are cute.
    +words    | {chinchillas,and,kittens,are,cute}
    +-[ RECORD 3 
]------------------------------------------------------------------------------------
    +docid    | 2
    +contents | My sister adopted two kittens yesterday.
    +words    | {my,sister,adopted,two,kittens,yesterday}
    +-[ RECORD 4 
]------------------------------------------------------------------------------------
    +docid    | 3
    +contents | Look at this cute hamster munching on a piece of broccoli.
    +words    | {look,at,this,cute,hamster,munching,on,a,piece,of,broccoli}
     </pre>
     
    --# Compute the frequency of each word in each document
    +-# Compute the frequency of each word in each document:
     <pre class="example">
    -DROP TABLE IF EXISTS documents_tf;
    -SELECT madlib.term_frequency('documents', 'docid', 'words', 
'documents_tf');
    -SELECT * FROM documents_tf order by docid;
    +DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
    +SELECT madlib.term_frequency('documents',    -- input table
    +                             'docid',        -- document id
    --- End diff --
    
    document id column

---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Reply via email to