Altools" by udanax

Apache Wiki Tue, 18 Sep 2007 22:32:30 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by udanax:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseShell/Altools

The comment on the change is:
The name Altools will no longer be used due to trademark issues

------------------------------------------------------------------------------
+ deleted
- [[TableOfContents(5)]]
- ----
-  ''-- This project is currently in the planning stage.  
[https://issues.apache.org/jira/browse/HADOOP-1608 HADOOP-1608] to add 
"Relational Algrebra Operators" is currently in process.
  
- = Hbase Shell Altools Plan =
- 
- Hbase altools is an Hbase Shell sub 'interpreter' (or 'shell)' program to 
provide scalable data processing capabilities like  aggregation, algebraic 
calculation(groups and sets, commutative rings, algebraic geometry, and linear 
algebra) on Hadoop + Hbase based parallel machines. especially, it will focus 
on storing and manipulating numeric, sparse matrices on Hbase.
- 
- Altools operations will show or explain how Google search's LSI, Google 
Earth's algebraic topology, Google News' recommendation system are related to 
Bigtable.
- 
- == Initial Contributor ==
-  * [:udanax:Edward Yoon] (R&D center, NHN corp.)
-   -- If you have constructive ideas, Please advise me. [[MailTo(webmaster AT 
SPAMFREE udanax DOT org)]]''
- 
- == Background ==
- I expect Hadoop + Hbase to handle sparsity and data explosion very well in 
near future. Moreover, i believe the design of the multi-dimensional map 
structure and the 3d space model of the data are optimized for rapid ad-hoc 
information retrieval in any orientation, as well as for fast, flexible 
calculation and transformation of raw data based on formulaic relationships. It 
is advantageous with respect to Analysis Processing as it allows users to 
easily formulate complex queries, and filter or slice data into meaningful 
subsets, among other things.
- 
- === Parallelization ===
- 
- Altools provides automatic parallelization of the most time-consuming 
relational/matrix/vector operations, and will ensure that the iterative solvers 
are scalable.
- 
-  * Survey List
-   * Parallel Processing of Relational Data
-   * Parallel Algorithms of Multi-Dimensional Matrix Operations
-   * Parallel Gaussian Elimination Algorithm
- 
- ----
- 
- = Suggested Hbase altools Syntax =
- '''Note''' that Data should be located by their row, column, and timestamp.
- 
- == Commands ==
- ||<bgcolor="#E5E5E5">'''Command''' ||<bgcolor="#E5E5E5">'''Explanation''' ||
- ||Table ||<99%>'''Table''' command loads specified table. 
[[BR]][[BR]]~-''Table('table_name');''-~ ||
- ||Matrix ||<99%>'''Matrix''' command constructs the configuration of the 
logic matrix.[[BR]]'''Options''' : features not yet. 
[[BR]][[BR]]~-''Matrix(table_name, columnfamily_name[, option]);''-~ ||
- ||Substitute ||<99%>'''Substitute''' specific algebraic expression to 
[A~Z][[BR]][[BR]]~-''A = Projection(column-list);''-~ ||
- ||IF...ELSE ||<99%>'''IF...ELSE''', Imposes conditions on the execution. 
[[BR]][[BR]]~-''IF ( boolean_expression )[[BR]]{{{    }}}B = 
command_statements;[[BR]]ELSE[[BR]]{{{    }}}B = command_statements;''-~||
- ||Store ||<99%>'''Store''' command will store results to specified table or 
file. [[BR]][[BR]]~-''A = Table('table_name'); [[BR]]B = 
A.Selection(condition_expression); [[BR]]Store B TO table(result_table)[or 
file('result_file_name')];''-~ ||
- 
- '''Type''' 'help;' for Hbase altools usage.
- 
- {{{
- HBase > altools;
- 
- Hbase altools, 0.0.1 version
- Type 'help;' for Hbase altools usage.
- 
- Hbase.altools > help;
- }}}
- 
- == Relational Operators ==
- ||<bgcolor="#E5E5E5">'''Operator''' ||<bgcolor="#E5E5E5">'''Explanation''' ||
- ||Projection ||<99%>'''Projection''' of a relation ~+R+~, It makes a new 
relation as the set that is obtained when all tuples(rows) in ~+R+~ are 
restricted to the set 
{columnfamily,,1,,,...,columnfamily,,n,,}.[[BR]][[BR]]~-''A = 
Table('table_name');[[BR]]B = A.Projection(column-list);{{{    
}}}'''//Ï,,column-list,,(A)''' ''-~ ||
- ||Selection ||<99%>'''Selection''' of a relation ~+R+~, It makes a new 
relation as the set of specified tuples(rows) of the relation 
~+R+~.[[BR]]'''Set Operations''' : ~-''OR, AND, NOT''-~[[BR]][[BR]]~-''A = 
Table('table_name');[[BR]]B = A.Selection(condition_expression);{{{    
}}}'''//Ï,,condition,,(A)''' ''-~ ||
- ||JOINs ||<99%>Table '''JOIN''' operations, linking and extracting data from 
two different internal source.[[BR]]'''Operations''' : ~-''naturalJoin(), 
thetaJoin(), cartesianProduct() ''-~ [[BR]][[BR]]~-''R = 
Table('table_name1');[[BR]]S = Table('table_name2');[[BR]]C = 
R.naturalJoin(S);{{{    }}}'''//C = Râ·âS''' ''-~ ||
- ||Group ||<99%>'''Group''' tuples by value of an attribute and apply 
aggregate function independently to each group of tuples.[[BR]]'''Aggregate 
Functions''' : ~-''AVG(attribute), SUM(attribute), COUNT(attribute), 
MIN(attribute), MAX(attribute)''-~[[BR]][[BR]]~-''A = 
Table('table_name');[[BR]]B = A.Group(column-list);{{{    
}}}'''//Î³,,column-list,,(A)''' ''-~ ||
- ||Sort ||<99%>'''Sort''' of tuples(rows) of R, ordered according to 
columnfamilies on columnfamily-list.[[BR]][[BR]]~-''A = 
Table('table_name');[[BR]]B = Sort A by (column-list);{{{    
}}}'''//Ï,,column-list,,(A)''' ''-~ ||
- 
- '''(ex. 1)''' Search the subject and the year of the movies which were 
produced by 'Fox' company and where running time is more than 100 minutes.
- [[BR]]~-'''''Ï ,,title.year,, (Ï ,,length > 100,, (movieLog_table) â© Ï 
,,studioName = 'Fox',, (movieLog_table))'''''-~
- 
- {{{
- Hbase.altools > A = Table('movieLog_table'); 
- Hbase.altools > B = A.Selection(length > 100 AND studioName = 'Fox'); 
- Hbase.altools > C = B.Projection('year'); 
- 
- Hbase.altools > store C to table('result_table'); 
- }}}
- 
- '''(ex. 2)''' Theta Join : â·â,,C,,
- [[BR]]~-'''''movieStars_tableâ·â,,actor < year,,movieLog_table = Ï,,actor 
< year,,(movieStars_table X movieLog_table)'''''-~
- 
- {{{
- Hbase.altools > A = Table('movieStars_table'); 
- Hbase.altools > B = Table('movieLog_table');
- Hbase.altools > C = A.thetaJoin(B);
- 
- Hbase.altools > store C to table('result_table'); 
- }}}
- 
- '''(ex. 3)''' Find the year of the earliest movie for each actor.
- [[BR]]~-'''''Î³ ,,starName.MIN(year) â minYear,, (movieStars_table)'''''-~
- 
- {{{
- Hbase.altools > A = Table('movieStars_table');
- Hbase.altools > B = A.Group('starName',MIN('year'));
- 
- Hbase.altools > store B to table('result_table');
- }}}
- 
- == Matrix Operators ==
- '''Note''' that matrix operations are the core of many linear systems.
- === Arithmetic Operators ===
- ||<bgcolor="#E5E5E5">'''Operator''' ||<bgcolor="#E5E5E5">'''Explanation''' ||
- ||Addition ||<99%>'''Adding''' entries with the same indices. 
[[BR]][[BR]]~-''A = Matrix('table_name1','columnfamily_name1');[[BR]]B = 
Matrix('table_name2','columnfamily_name2');[[BR]]C = A + B;{{{    }}}'''// 
c,,ij,, = a,,ij,, + b,,ij,, (i : row key, j : column key)''' ''-~ ||
- ||Subtraction ||<99%>'''Subtracting''' entries with the same 
indices.[[BR]][[BR]]~-''A = Matrix('table_name1','columnfamily_name1');[[BR]]B 
= Matrix('table_name2','columnfamily_name2');[[BR]]C = A - B;{{{    }}}'''// 
c,,ij,, = a,,ij,, - b,,ij,, (i : row key, j : column key)''' ''-~ ||
- ||Multiplication ||<99%>'''Multiplication''' of two matrices, Product C of 
two matrices A and B.[[BR]][[BR]]~-''A = 
Matrix('table_name1','columnfamily_name1');[[BR]]B = 
Matrix('table_name2','columnfamily_name2');[[BR]]C = A * B;{{{    }}}'''//C = A 
Â· B''' ''-~ ||
- ||Division ||<99%>'''Division''' is solving the matrix equation AX = B for 
X.[[BR]][[BR]]~-''A = Matrix('table_name1','columnfamily_name1');[[BR]]B = 
Matrix('table_name2','columnfamily_name2');[[BR]]C = A /[or \] B;{{{    
}}}'''// C = A / B''' ''-~||
- ||Transpose ||<99%>'''Transpose''' of a Matrix, A matrix which is formed by 
turning all the rows of a given matrix into columns and 
vice-versa.[[BR]][[BR]]~-''A = 
Matrix('table_name1','columnfamily_name1');[[BR]]B = Transpose(A);{{{    
}}}'''// B = A'''' ''-~||
- 
- 
- '''(ex. 1)''' Matrix Addition
- [[BR]]~-'''''C = A + B = (a,,ij,, + b,,ij,,)'''''-~
- 
- {{{
- //Set up the matrix A, B from mapped matrix in hbase.
- 
- Hbase.altools > A = Matrix('m_table','cf_1');
- Hbase.altools > B = Matrix('m_table','cf_2');
- Hbase.altools > C = A + B;
- }}}
- 
- 
- '''(ex. 2)''' The product C of two matrices A and B
- [[BR]]~-'''''C,,ij,, = Î£A,,ik,,B,,kj,, (1 â¤ i â¤ m , 1 â¤ j â¤n)'''''-~
- 
- {{{
- //Set up the matrix A, B from mapped matrix in hbase.
- 
- Hbase.altools > A = Matrix('m_table','cf_1');
- Hbase.altools > B = Matrix('m_table','cf_2');
- Hbase.altools > C = A * B;
- }}}
- 
- === Factorization and Decomposition Operators ===
- 
- ||<bgcolor="#E5E5E5">'''Function''' ||<bgcolor="#E5E5E5">'''Explanation''' ||
- ||LU ||<99%>'''LU Decomposition'''[[BR]]A procedure for decomposing an N by N 
matrix A into a product of a lower triangular matrix L and an upper triangular 
matrix U, LU = A.[[BR]]'''Functions''' : ~-''getL(), getU(), isSingular(), 
getPivot()''-~ [[BR]][[BR]]~-''A = 
Matrix('table_name','columnfamily_name');[[BR]]B = LUDecomposition(A);[[BR]]C = 
B.getU();[[BR]]D = B.getL();''-~||
- ||QR ||<99%>'''QR Decomposition'''[[BR]]For an m-by-n matrix A with m >= n, 
the QR decomposition is an m-by-n orthogonal matrix Q and an n-by-n upper 
triangular matrix R so that A = Q*R.[[BR]]'''Functions''' : ~-''getH(), getQ(), 
getR()''-~[[BR]][[BR]]~-''A = Matrix('table_name','columnfamily_name');[[BR]]B 
= QRDecomposition(A);[[BR]]C = B.getH();''-~||
- ||Cholesky ||<99%>'''Cholesky Decomposition'''[[BR]]It is a special case of 
LU decomposition applicable only if matrix to be decomposed is symmetric 
positive definite.[[BR]]'''Functions''' : ~-''getL(), getU(), isSPD()''-~ 
[[BR]][[BR]]~-''A = Matrix('table_name','columnfamily_name');[[BR]]B = 
CholeskyDecomposition(A);[[BR]]C = B.getL();''-~||
- ||SVD ||<99%>'''SVD(Singular Value Decomposition)'''[[BR]]For an m-by-n 
matrix A with m >= n, the singular value decomposition is an m-by-n orthogonal 
matrix U, an n-by-n diagonal matrix S, and an n-by-n orthogonal matrix V so 
that A = U*S*V'.[[BR]]'''Functions''' : ~-''getS(), getU(), getV(), norm2(), 
rank()''-~ [[BR]][[BR]]~-''A = Matrix('table_name','columnfamily_name');[[BR]]B 
= SVDdecomposition(A);[[BR]]C = B.getU();''-~||
- 
- {{{
- //Set up the matrix M from mapped matrix in hbase.
- Hbase.altools > M = Matrix('m_table','cf_1'); 
- 
- M ([1, 2],
-    [3, 4])
- }}}
- 
- '''(ex. 1)''' To find the Singular Value decomposition in Altools, do the 
following:
- [[BR]]~-'''''M = UÎ£V*'''''-~
- 
- {{{
- Hbase.altools > A = M.SVDdecomposition();
- Hbase.altools > U = A.getU();
- Hbase.altools > S = A.getS();
- Hbase.altools > V = A.getV();
- 
- U ([[-0.40455358, -0.9145143 ],
-     [-0.9145143 ,  0.40455358]])
- 
- S ([ 5.4649857 ,  0.36596619])
- 
- V ([[-0.57604844, -0.81741556],
-     [ 0.81741556, -0.57604844]])
- }}}
- 
- '''(ex. 2)''' To find the QR decomposition in Altools, do the following:
- [[BR]]~-'''''M = QR'''''-~
- 
- {{{
- Hbase.altools > A = M.QRDecomposition();
- Hbase.altools > U = A.getQ();
- Hbase.altools > U = A.getR();
- 
- Q ([[-0.31622777, -0.9486833 ],
-     [-0.9486833 ,  0.31622777]])
- 
- R ([[-3.16227766, -4.42718872],
-     [ 0.        , -0.63245553]])
- }}}
- ----
- = Example Of Altools Use =
- == Latent Semantic Analysis by SVD ==
- Latent semantic analysis (LSA) is a technique in natural language processing, 
in particular in vectorial semantics, of analyzing relationships between a set 
of documents and the terms they contain by producing a set of concepts related 
to the documents and terms. LSA was patented in 1988 by Scott Deerwester, Susan 
Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and 
Lynn Streeter. In the context of its application to information retrieval, it 
is sometimes called latent semantic indexing (LSI). -- wikipedia
- 
- This LSA example used SVD decomposition with k=3. 
- 
- The SVD is typically computed using large matrix methods (for example, 
Lanczos methods) but may also be computed incrementally and with greatly 
reduced resources via a neural network-like approach which does not require the 
large, full-rank matrix to be held in memory -- wikipedia
- 
-  * ~-'''NOTATION'''-~
-   * ~-''T,,0,,'' : orthogonal, unit-length columns-~
-   * ~-''D,,0,,'' : orthogonal, unit-length columns-~
-   * ~-''S,,0,,'' : eigenvalues of a diagonal matrix-~
-   * ~-''t'' : Row number of matrix X-~
-   * ~-''d'' : Column number of matrix X-~
-   * ~-''m'' : Rank of matrix X ( â¤ min(t,d) )-~
- 
- I made a new relation "t_table" using "relational operators" from raw 
data(web_table) as described below :
- [[BR]]~-''TODO: do explain "theoretical way of manipulating table(web_table) 
using relational operators"''-~
- 
- ||Row Key ||<-6>Column Families ||
- ||<rowbgcolor="#ececec">term ||<-2> corpus: ||<-2>link: ||<-2>.. ||
- ||t01 ||corpus:c01 ||0 ||link:cnn.com ||1 ||.. ||.. ||
- || ||corpus:c03 ||0 ||  ||  ||.. ||.. ||
- ||t02 ||corpus:c01 ||1 || || ||.. ||.. ||
- || ||corpus:c02 ||0 ||link:google.com  ||0  || || ||
- ||.. ||.. ||.. ||.. ||.. ||.. ||.. ||
- 
- {{{
- //Set up the matrix X
- Hbase.altools > X = Matrix('t_table','corpus'); 
- }}}
- 
- This example used tf*idf for more efficient and economical calculations.
- 
-  * ~-''w,,ij,, = tf,,ij,, '''Â·''' log( N/df,,j,, )''-~
-   * ~-''w'' : weighted value-~
-   * ~-''tf'' : the number of times the word appears in a i-th document-~
-   * ~-''N'' : the number of documents of an entire corpus-~
-   * ~-''df'' : the number of documents containing the j-th term-~
- 
- ||<rowbgcolor="#ececec"> ||c01 ||c02 ||c03 ||.. ||
- ||<bgcolor="#ececec">t01 ||0.00 ||0.00 ||0.15 ||.. ||
- ||<bgcolor="#ececec">t02 ||0.08 ||0.00 ||0.00 ||.. ||
- ||<bgcolor="#ececec">t02 ||0.00 ||0.26 ||0.00 ||.. ||
- ||<bgcolor="#ececec">.. ||.. ||.. ||.. ||.. ||
- 
- Weighted Matrix W
- 
- {{{
- //Diagonal eigenvalue S
- 
- Hbase.altools > M = W.SVDdecomposition(); 
- Hbase.altools > S = M.getS();
- }}}
- 
-  * ~-''X = T,,0,,S,,0,,D,,0,,''-~
- 
- {{{
- //Othonormal eigenvector T, D
- 
- Hbase.altools > T = M.getU();
- Hbase.altools > D = M.getV();
- }}}
- 
- SVD allows a simple strategy for optimal approximate fit using smaller 
matrices. If the singular
- values in S0 are ordered by size, the first k largest values may be kept and 
the remaining smaller ones
- set to zero. The product of the resulting matrices is a matrix Xâ² which is 
only approximately equal to X,
- and is of rank k. Since zeros were introduced into S0, the representation can 
be simplified by deleting
- the zero rows and columns of S0 to obtain a new diagonal matrix S, and 
deleting the corresponding
- columns of T0 and D0 to obtain T and D respectively. The result is a reduced 
model :
- 
-  * ~-''Xâ² = TSDT â X''-~
- 
- which is the rank-k model with the best possible least square fit to X.
- 
- ||<rowbgcolor="#ececec"> ||c01 ||c02 ||c03 ||.. ||
- ||<bgcolor="#ececec">t01 ||0.18 ||0.03 ||0.12 ||.. ||
- ||<bgcolor="#ececec">t02 ||0.08 ||0.00 ||0.30 ||.. ||
- ||<bgcolor="#ececec">t02 ||0.00 ||0.26 ||0.07 ||.. ||
- ||<bgcolor="#ececec">.. ||.. ||.. ||.. ||.. ||
- 
- Reduced Matrix R
- 
- ----
- = Papers =
-  * [http://www.uib.no/People/nmabh/art/hpj.pdf High performance numerical 
libraries in Java]
-  * ''[http://labs.google.com/papers/bigtable.html Bigtable] : A Distributed 
Storage System for Structured Data''
-  * ''Interpreting the Data: Parallel Analysis with 
[http://labs.google.com/papers/sawzall.html Sawzall]''
-  * ''Y!'s Research Project : [http://research.yahoo.com/project/pig Pig] 
Document''
-  * ''[http://portal.acm.org/citation.cfm?doid=1247480.1247602 
Map-Reduce-Merge] : Simplified Relational Data Processing on Large Clusters''
-  * ''[http://www.pytables.org/ PyTables] : Hierarchical Datasets in Python''
-  * ''[http://numpy.scipy.org/ Numpy] : Scientific Tools for Python''
-  * ''[http://db.lcs.mit.edu/projects/cstore/ C-Store] : A Column Oriented 
DBMS''
-

[Lucene-hadoop Wiki] Update of "Hbase/HbaseShell/Altools" by udanax

Reply via email to