Re: [Hama Wiki] Update of "SpMV" by Mikalai Parafeniuk

Edward J. Yoon Thu, 19 Jul 2012 18:36:11 -0700

Mikalai,

You seems not understand the IO mechanism of BSP framework yet. Like
MapReduce, the task scheduler of BSP framework automatically
partitions the input data using the splitter function and executes
each BSP task on that split.


See this file loading test code:

https://svn.apache.org/repos/asf/hama/trunk/core/src/test/java/org/apache/hama/bsp/TestPartitioning.java

And, I don't know why do you want to access each cell of matrix
directly? Please try to write matrices to a file first and then load
the sub-matrices in bsp() function. Basically, it's a row-based
partition.

On Thu, Jul 19, 2012 at 6:30 PM, Apache Wiki <[email protected]> wrote:
> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Hama Wiki" for change 
> notification.
>
> The "SpMV" page has been changed by Mikalai Parafeniuk:
> http://wiki.apache.org/hama/SpMV?action=diff&rev1=7&rev2=8
>
>   Current code contains classes which work with matrix in memory. That's why 
> algorithm will fail in case of large matrices. So I propose some steps to 
> modify SpMV algorithm to work with large matrices. First of all, simple 
> matrix format based on work with file system will be created. Let's call this 
> class FileMatrix. This format will give such possibilities:
>    1. we can set matrix cell and it will be appended to file, without any 
> check.
>    2. we can iterate through entire file for getting all matrix cells.
> - Such constraints are chosen because it is hard to imagine, how we can 
> efficiently implement some matrix operations, for example, get cell with 
> specified index. In the same time this constraints meets constraints of HDFS 
> (large size of block, data will be written once and read many times, fast 
> sequential reading of entire file). Creation of such class won't take much 
> time, and it will be possible to store and read large matrices. The 
> bottleneck in current algorithm in memory consumption - phase of matrix 
> distribution. Array of local matrices is stored in memory. We can correct the 
> situation in such way: input matrix is stored in file, after that we iterate 
> through matrix and map its components to local matrices also presented as 
> FileMatrix. From now we won't store array of local matrices in memory, in 
> this step we assume that matrix, taken from local file can fit memory of 
> local processor. But even this can be improved. When local matrix can't fit 
> local processor memory we will invoke local SpMV algorithm on matrix parts. I 
> propose to create class, which implements Mapper interface from linearalgebra 
> package, and split local matrix recursively into chunks presented like 
> FileMatrix until each chunk can fit local memory. I will call further this 
> process two-phase mapping. After making such steps, we will avoid storing 
> entire matrix in memory, now we assume that matrix can fit entire space of 
> hard drives and can store components of local vector in memory. Also 
> two-phase mapping can be used in RandomMatrixGenerator for large matrices.
> + Such constraints are chosen because it is hard to imagine, how we can 
> efficiently implement some matrix operations, for example, get cell with 
> specified index. In the same time this constraints meets constraints of HDFS 
> (large size of block, data will be written once and read many times, fast 
> sequential reading of entire file). Creation of such class won't take much 
> time, and it will be possible to store and read large matrices. The 
> bottleneck in current algorithm in memory consumption - phase of matrix 
> distribution. Array of local matrices is stored in memory. We can correct the 
> situation in such way: input matrix is stored in file, after that we iterate 
> through matrix and map its components to local matrices also presented as 
> FileMatrix. From now we won't store array of local matrices in memory, in 
> this step we assume that matrix, taken from local file can fit memory of 
> local processor. But even this can be improved. When local matrix can't fit 
> local processor memory we will invoke local SpMV algorithm on matrix parts. I 
> propose to create class, which implements Mapper interface from linearalgebra 
> package, and split local matrix recursively into chunks presented like 
> FileMatrix until each chunk can fit local memory. After that local chunks 
> will be verified. I will call further this process two-phase mapping. After 
> making such steps, we will avoid storing entire matrix in memory, now we 
> assume that matrix can fit entire space of hard drives and can store 
> components of local vector in memory. Also two-phase mapping can be used in 
> RandomMatrixGenerator for large matrices.
>
>   === Possible improvements ===
>    1. Implementation of Mondrian distribution. In most cases it gives better 
> results in comparison with cyclic-block Cartesian scheme.



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: [Hama Wiki] Update of "SpMV" by Mikalai Parafeniuk

Reply via email to