[
https://issues.apache.org/jira/browse/MADLIB-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168664#comment-16168664
]
Frank McQuillan commented on MADLIB-1159:
-----------------------------------------
Showing step by step but you could collapse these steps if you want.
1) Create data table
{code}
DROP TABLE IF EXISTS test1;
CREATE TABLE test1(
doc_id integer,
document varchar,
term varchar,
freq integer
);
INSERT INTO test1 VALUES
(1, 'do androids dream of electric sheep', 'rachel', 75),
(1, 'do androids dream of electric sheep', 'andy', 56),
(1, 'do androids dream of electric sheep', 'hands', 128),
(2, 'da vinci code book review', 'vapid',1326),
(2, 'da vinci code book review', 'uninspired',265),
(2, 'da vinci code book review', 'nauseating',879293),
(2, 'da vinci code book review', 'inane',471);
SELECT * FROM test1;
doc_id | document | term | freq
--------+-------------------------------------+------------+--------
1 | do androids dream of electric sheep | rachel | 75
1 | do androids dream of electric sheep | andy | 56
1 | do androids dream of electric sheep | hands | 128
2 | da vinci code book review | vapid | 1326
2 | da vinci code book review | uninspired | 265
2 | da vinci code book review | nauseating | 879293
2 | da vinci code book review | inane | 471
(7 rows)
{code}
2) Encode categorical variables and output as array
http://madlib.apache.org/docs/latest/group__grp__encode__categorical.html
{code}
DROP TABLE IF EXISTS test1_out, test1_out_dictionary;
SELECT madlib.encode_categorical_variables (
'test1', -- Source table
'test1_out', -- Output table
'term', -- Categorical columns
NULL, -- Categorical_cols_to_exclude
NULL, -- Row id
NULL, -- Top
NULL, -- Value to drop
NULL, -- Encode_null
'array'); -- Output_type
SELECT * FROM test1_out;
doc_id | document | freq | __encoded_variables__
--------+-------------------------------------+--------+-----------------------
1 | do androids dream of electric sheep | 75 | {0,0,0,0,1,0,0}
1 | do androids dream of electric sheep | 56 | {1,0,0,0,0,0,0}
1 | do androids dream of electric sheep | 128 | {0,1,0,0,0,0,0}
2 | da vinci code book review | 1326 | {0,0,0,0,0,0,1}
2 | da vinci code book review | 265 | {0,0,0,0,0,1,0}
2 | da vinci code book review | 879293 | {0,0,0,1,0,0,0}
2 | da vinci code book review | 471 | {0,0,1,0,0,0,0}
(7 rows)
{code}
The dictionary into the array is:
{code}
SELECT index, value as ftr FROM test1_out_dictionary;
index | ftr
-------+------------
1 | andy
2 | hands
3 | inane
4 | nauseating
5 | rachel
6 | uninspired
7 | vapid
(7 rows)
{code}
3) Create a dense matrix using MADlib array operation
http://madlib.apache.org/docs/latest/group__grp__array.html
{code}
DROP TABLE IF EXISTS mat_dense;
CREATE TABLE mat_dense AS
SELECT doc_id, madlib.sum(__encoded_variables__) as ftr FROM test1_out GROUP
BY doc_id;
SELECT * FROM mat_dense;
doc_id | ftr_array
--------+-----------------
1 | {1,1,0,0,1,0,0}
2 | {0,0,1,1,0,1,1}
(2 rows)
{code}
You could use the above directly, or sparsify in next step 4.
4) Sparsify the matrix using matrix operations
http://madlib.apache.org/docs/latest/group__grp__matrix.html
{code}
DROP TABLE IF EXISTS mat_sparse;
SELECT madlib.matrix_sparsify('mat_dense', 'row=doc_id, val=ftr',
'mat_sparse');
SELECT * FROM mat_sparse ORDER BY doc_id, col_num;
doc_id | col_num | ftr
--------+---------+-----
1 | 1 | 1
1 | 2 | 1
1 | 5 | 1
2 | 3 | 1
2 | 4 | 1
2 | 6 | 1
2 | 7 | 1
(7 rows)
{code}
where col_num is the index from the dictionary in step 2.
> Provide examples for common sparse matrix cases
> -----------------------------------------------
>
> Key: MADLIB-1159
> URL: https://issues.apache.org/jira/browse/MADLIB-1159
> Project: Apache MADlib
> Issue Type: Documentation
> Reporter: Brian Dolan
>
> A fairly common table structure is of the form `key1, key2, value` like a
> triples in a graph. These are often not normalized.
> It would be useful to provide an example of transforming this class of tables
> into a sparse matrix. Perhaps an example dataset could be a term-document
> matrix.
> TABLE doc_term;
> document, term, freq
> "do androids dream of electric sheep", "rachel", 75
> "do androids dream of electric sheep", "andy", 56
> "do androids dream of electric sheep", "hands", 128
> "da vinci code book review", "vapid",1326
> "da vinci code book review", "uninspired",265
> "da vinci code book review", "nauseating",879293
> "da vinci code book review", "inane",471
> Into a sparse matrix table of documents by features.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)