[Hama Wiki] Update of "SpMV" by Mikalai Parafeniuk

Apache Wiki Wed, 27 Jun 2012 01:25:21 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hama Wiki" for change 
notification.


The "SpMV" page has been changed by Mikalai Parafeniuk:
http://wiki.apache.org/hama/SpMV?action=diff&rev1=5&rev2=6

  
  === Problem description ===
  As a sequential problem SpMV is almost trivial problem. But in case of 
parallel version we should think about some additional aspects:
- 1. Partitioning of matrix and vector components. This means that we should 
split the input matrix and vectors by peers, if we want to have benefits from 
usage of parallel algorithm. Wise partitioning should be taken or communication 
time will rise very much and algorithm will be inefficient.
+  1. Partitioning of matrix and vector components. This means that we should 
split the input matrix and vectors by peers, if we want to have benefits from 
usage of parallel algorithm. Wise partitioning should be taken or communication 
time will rise very much and algorithm will be inefficient.
- 2. Load balancing. This means that each peer must perform nearly the same 
amount of work, and none of them should idle.
+  2. Load balancing. This means that each peer must perform nearly the same 
amount of work, and none of them should idle.
- 3. We should keep communication in bounds. In case of paralel SpMV we should 
take partitioning wise to keep communication in appropriate bounds 
independently of sparsity patterns of input matrix and vector.
+  3. We should keep communication in bounds. In case of paralel SpMV we should 
take partitioning wise to keep communication in appropriate bounds 
independently of sparsity patterns of input matrix and vector.
  
  === Implementation tips ===
- 1. Order of distribution and representation. We have two choices in this 
aspect: represent matrix first and distribute later, or distribute matrix first 
and represent later. In first case  (represent first, distribute later) all 
simple operations will be non-local and will bring some unnecessary overhead. 
In other case (distribute first, represent later) all local operations on 
processor remain local: algorithm first determines responsible processor and it 
performs operation locally. Thats why I prefer distribution first 
representation later approach.
+  1. Order of distribution and representation. We have two choices in this 
aspect: represent matrix first and distribute later, or distribute matrix first 
and represent later. In first case  (represent first, distribute later) all 
simple operations will be non-local and will bring some unnecessary overhead. 
In other case (distribute first, represent later) all local operations on 
processor remain local: algorithm first determines responsible processor and it 
performs operation locally. Thats why I prefer distribution first 
representation later approach.
- 2. Data transmission direction. Here we also have two choices: delivery 
vector component to processor which possesses non-zero matrix component or vice 
versa. In most cases a number of non-zero items in matrix is much larger than 
vector length, thats why we prefer transmission of vector.
+  2. Data transmission direction. Here we also have two choices: delivery 
vector component to processor which possesses non-zero matrix component or vice 
versa. In most cases a number of non-zero items in matrix is much larger than 
vector length, thats why we prefer transmission of vector.
  
  === Algorithm description ===
  The generic algorithm will be divided in three supersteps:
- 0. Matrix and vector distribution.
+  0. Matrix and vector distribution.
- 1. Fanout.
+  1. Fanout.
- 2. Local computation.
+  2. Local computation.
- 3. Fanin.
+  3. Fanin.
  In Fanout phase all processors gets needed v components. In local computation 
phase local contribution to result vector is calculated. In Fanin phase all 
local contributions are sent to an owner of u. Most of efforts should be taken 
to choose right matrix and vector distribution which will improve the 
comunication volume of Fanout and Fanin phase. As base implementation of 
distribution I propose to create Cartesian (column mappings are not dependent 
of row mappings and vice versa) cyclic-block distribution with cyclical 
distribution of matrix diagonal. Also I assume that distr(u) != distr(v), which 
gives us more freedom in optimising vector distribution. This type of 
distribution has such advantages: it is simple, in fanin only communication 
with processor column is needed, in fanout only communication with processor 
row is needed, we can easily predict the productivity of algorithm. After 
matrix distribution we perform vector distribution in greedy way for each 
processor: processor grabs items until he reaches it's optimum state. After 
that stage some vector components can be unassigned (nearly 10%). We well 
distribute them in greedy fashion to. To support efficient local computation 
used data structure should provide indeces of rows and columns which have the 
non-zero item in them. Local computation must be performed with local indeces.
  
  === Possible improvements ===
- 1. Implementation of Mondrian distribution. In most cases it gives better 
results in comparison with cyclic-block Cartesian scheme.
+  1. Implementation of Mondrian distribution. In most cases it gives better 
results in comparison with cyclic-block Cartesian scheme.
- 2. Implement algorithm for detecting matrix sparsity patterns. This will give 
us a possibility to define, for example, if matrix is random sparse matrix, or 
Laplacian matrix. In case of random sparse matrices we can use distribution 
patterns which are independent of matrix sparsity pattern. In case of Laplacian 
matrices we diamond distribution can give better result.
+  2. Implement algorithm for detecting matrix sparsity patterns. This will 
give us a possibility to define, for example, if matrix is random sparse 
matrix, or Laplacian matrix. In case of random sparse matrices we can use 
distribution patterns which are independent of matrix sparsity pattern. In case 
of Laplacian matrices we diamond distribution can give better result.
- 3. In phase of vector distribution when some vectors remain unassigned we can 
use graph algoritms to determine the owner of vector component.
+  3. In phase of vector distribution when some vectors remain unassigned we 
can use graph algoritms to determine the owner of vector component.
  
  === Literature ===
- 1. Rob H. Bisseling - Parallel Scientific computation. (chapter 4).
+  1. Rob H. Bisseling - Parallel Scientific computation. (chapter 4).
- 2. Steve Rennich - Block SpMV on GPU.
+  2. Steve Rennich - Block SpMV on GPU.

[Hama Wiki] Update of "SpMV" by Mikalai Parafeniuk

Reply via email to