Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hama Wiki" for change 
notification.

The "SpMV" page has been changed by Mikalai Parafeniuk:
http://wiki.apache.org/hama/SpMV?action=diff&rev1=8&rev2=9

  == Distributed Sparse Matrix-Vector Multiplication on Hama ==
  
  === Introduction ===
- In further description we will research problem in form u = Av. Most 
computational algoritms spend large percent of time for solving systems of 
linear equations. In general, linear system of equations can be represented in 
matrix form Ax = b, where A is matrix with n rows and n columns, b - vector of 
size n, x - unknown solution vector which we are searching. Some approaches for 
solving linear systems has iterative nature. Assume, we know the initial 
approximation of x = x0. After that we represen our system in form xn = Bxn-1 + 
c, where c - vector of size n. After that we have to found next approximations 
of x till the convergence. In real world most of matrices contain relatively 
small number of non-zero items in comparison to total number of matrix items. 
Such matrices are called sparse matrices, matrices which filled most with 
non-zero items are called dense. So this page will describe the problem of 
sparse matrix vector multiplication(SpMV) with use of Bulk Synchronous 
Programming(BSP) model implemented in Apache Hama project. As shown above, SpMV 
can be used in different iterative solvers for system of linear equations.
+ In further description we will research problem in form u = Av. Most 
computational algoritms spends large percent of time for solving large systems 
of linear equations. In general, system of linear equations can be represented 
in matrix form Ax = b, where A is matrix with n rows and n columns, b - vector 
of size n, x - unknown solution vector which we are searching. Some approaches 
for solving linear systems has iterative nature. Assume, we know the initial 
approximation of x = x0. After that we represent our system in form xn = Bxn-1 
+ c, where c - vector of size n. After that we have to found next 
approximations of x till the convergence. In real world most of matrices 
contain relatively small number of non-zero items in comparison to total number 
of matrix items. Such matrices are called sparse matrices, matrices which 
filled most with non-zero items are called dense. Sparse matrices arise when 
each variable from the set is connected with small subset of variables (for 
example, differential equation of heat conduction). So, this page will describe 
the problem of sparse matrix vector multiplication(SpMV) with use of Bulk 
Synchronous Programming(BSP) model implemented in Apache Hama project. As shown 
above, SpMV can be used in different iterative solvers for system of linear 
equations.
- Bulk Synchronous model proposes it's own smart way of parallelization of 
programs. The input problem is separated by peers. Peers can be a processors, 
threads, separate machines, different items of cloud. BSP algorithm is divided 
in sequence of supersteps. Barrier synchronization of all peers is made after 
each superstep. The implementation of BSP(Apache Hama) contains primitives for 
defining peer number, communication with other peers with different 
communication primitives, optimizations of communication between peers.
+ Bulk Synchronous model proposes it's own smart way of parallelization of 
programs. We can specify input path for problem and number of peers. Framework 
reads the input and divides it between peers. Peers can be a processors, 
threads, separate machines, different items of cloud. BSP algorithm is divided 
in sequence of supersteps. Barrier synchronization of all peers is made after 
each superstep. The implementation of BSP(Apache Hama) contains primitives for 
defining peer number, communication with other peers with different 
communication primitives, optimizations of communication between peers, also it 
inherits most of features and approaches of Hadoop project.
  
  === Problem description ===
  As a sequential problem SpMV is almost trivial problem. But in case of 
parallel version we should think about some additional aspects:
-  1. Partitioning of matrix and vector components. This means that we should 
split the input matrix and vectors by peers, if we want to have benefits from 
usage of parallel algorithm. Wise partitioning should be taken or communication 
time will rise very much and algorithm will be inefficient.
+  1. Partitioning of matrix and vector components. This means that we should 
split the input matrix and vectors by peers, if we want to have benefits from 
usage of parallel algorithm. Wise partitioning should be taken or communication 
time will rise very much or we will get great load imbalance and algorithm will 
be inefficient.
   2. Load balancing. This means that each peer must perform nearly the same 
amount of work, and none of them should idle.
-  3. We should keep communication in bounds. In case of paralel SpMV we should 
take partitioning wise to keep communication in appropriate bounds 
independently of sparsity patterns of input matrix and vector.
+  3. We must consider Hadoop and Hama approach for parallelization.
  
  === Implementation tips ===
-  1. Order of distribution and representation. We have two choices in this 
aspect: represent matrix first and distribute later, or distribute matrix first 
and represent later. In first case  (represent first, distribute later) all 
simple operations will be non-local and will bring some unnecessary overhead. 
In other case (distribute first, represent later) all local operations on 
processor remain local: algorithm first determines responsible processor and it 
performs operation locally. Thats why I prefer distribution first 
representation later approach.
-  2. Data transmission direction. Here we also have two choices: delivery 
vector component to processor which possesses non-zero matrix component or vice 
versa. In most cases a number of non-zero items in matrix is much larger than 
vector length, thats why we prefer transmission of vector.
+  1. Framework splits the input file to peers automatically. So we don't need 
to perform mapping of matrix to peers manually. We only must define how matrix 
can be written to file and how it can be readed from it. If we create matrix, 
which consists from separate cells, framework will give some subset of cells to 
each peer. If we create matrix consisting from rows, framework will give subset 
of rows to each peer. The ways to influence on partitioning: creating different 
writables for matrices as described above, overriding default partitioner class 
behavior.
+  2. We don't need to care about communication in case of row-wise matrix 
access. First of all, rows of matrix are splitted automatically by the 
framework. After that we can compute inner product of the vector and concrete 
matrix row, and the result can be directly printed to output, because it is one 
of the cells of result vector. In this case we assume, that peer's memory can 
fit two vectors. Even if we have million x million matrix and vector of size 
million, some megabytes will be enough to store them. Even if we split input 
vector the gain in memory will be insignificant.
  
  === Algorithm description ===
- The generic algorithm will be divided in three supersteps:
+ The generic algorithm will contain one superstep, because no communication is 
needed:
   0. Matrix and vector distribution.
-  1. Fanout.
+  1. Custom partitioning.
   2. Local computation.
+  3. Output of result vector.
+ In setup stage every peer reads input dense vector from file. After that, 
framework will partition matrix rows by the algorithm provided in custom 
partitioner automatically. After that local computation is performed. We gain 
some cells of result vector, and they are written to output file.
-  3. Fanin.
- In Fanout phase all processors gets needed v components. In local computation 
phase local contribution to result vector is calculated. In Fanin phase all 
local contributions are sent to an owner of u. Most of efforts should be taken 
to choose right matrix and vector distribution which will improve the 
comunication volume of Fanout and Fanin phase. As base implementation of 
distribution I propose to create Cartesian (column mappings are not dependent 
of row mappings and vice versa) cyclic-block distribution with cyclical 
distribution of matrix diagonal. Also I assume that distr(u) != distr(v), which 
gives us more freedom in optimising vector distribution. This type of 
distribution has such advantages: it is simple, in fanin only communication 
with processor column is needed, in fanout only communication with processor 
row is needed, we can easily predict the productivity of algorithm. After 
matrix distribution we perform vector distribution in greedy way for each 
processor: processor grabs items until he reaches it's optimum state. After 
that stage some vector components can be unassigned (nearly 10%). We well 
distribute them in greedy fashion to. To support efficient local computation 
used data structure should provide indeces of rows and columns which have the 
non-zero item in them. Local computation must be performed with local indeces.
- 
- === Dealing with large matrices ===
- Current code contains classes which work with matrix in memory. That's why 
algorithm will fail in case of large matrices. So I propose some steps to 
modify SpMV algorithm to work with large matrices. First of all, simple matrix 
format based on work with file system will be created. Let's call this class 
FileMatrix. This format will give such possibilities: 
-  1. we can set matrix cell and it will be appended to file, without any check.
-  2. we can iterate through entire file for getting all matrix cells. 
- Such constraints are chosen because it is hard to imagine, how we can 
efficiently implement some matrix operations, for example, get cell with 
specified index. In the same time this constraints meets constraints of HDFS 
(large size of block, data will be written once and read many times, fast 
sequential reading of entire file). Creation of such class won't take much 
time, and it will be possible to store and read large matrices. The bottleneck 
in current algorithm in memory consumption - phase of matrix distribution. 
Array of local matrices is stored in memory. We can correct the situation in 
such way: input matrix is stored in file, after that we iterate through matrix 
and map its components to local matrices also presented as FileMatrix. From now 
we won't store array of local matrices in memory, in this step we assume that 
matrix, taken from local file can fit memory of local processor. But even this 
can be improved. When local matrix can't fit local processor memory we will 
invoke local SpMV algorithm on matrix parts. I propose to create class, which 
implements Mapper interface from linearalgebra package, and split local matrix 
recursively into chunks presented like FileMatrix until each chunk can fit 
local memory. After that local chunks will be verified. I will call further 
this process two-phase mapping. After making such steps, we will avoid storing 
entire matrix in memory, now we assume that matrix can fit entire space of hard 
drives and can store components of local vector in memory. Also two-phase 
mapping can be used in RandomMatrixGenerator for large matrices.
  
  === Possible improvements ===
+  1. Significant improvement in total time of algorithm can be achieved by 
creating custom partitioner class. It will give us load balancing and therefore 
better efficiency. This is the main possibility for optimization, because we 
decided, that using of row-wise matrix access i acceptable. Maybe it can be 
achieved by reordering of input or by customizing partitioning algorithm of 
framework.
-  1. Implementation of Mondrian distribution. In most cases it gives better 
results in comparison with cyclic-block Cartesian scheme.
-  2. Implement algorithm for detecting matrix sparsity patterns. This will 
give us a possibility to define, for example, if matrix is random sparse 
matrix, or Laplacian matrix. In case of random sparse matrices we can use 
distribution patterns which are independent of matrix sparsity pattern. In case 
of Laplacian matrices we diamond distribution can give better result.
-  3. In phase of vector distribution when some vectors remain unassigned we 
can use graph algoritms to determine the owner of vector component.
  
  === Literature ===
   1. Rob H. Bisseling - Parallel Scientific computation. (chapter 4).

Reply via email to