hi all,

I am considering about how to use map/reudce to bulk-load a matrix from a
file.

we can split the file, and let many mappers to load part of the file. but
lots of region-split will happen while loading if the matrix is huge. It may
affect the matrix load performance.

I think that a file that stores a matrix may be regular.
without compression, it may be as below.
d11 d12 d13 .................... d1m
d21 d22 d23 .................... d2m
.............................................
dn1 dn2 dn3......................dnm

An Optimization method will be:
(1) read a line from the matrix file, we may know it's row-size. assume it
is RS.
(2) we can get the file size from filesystem's metadata know the file-size.
assume it is FS.
(3) we can do a computation to got the number of rows. N(R) = FS/RS.
(4) if we know the rows, we can estimate the number of regions of the
matrix.
finally, we can split the matrix's table in hbase first and let
matrix-loading parallely without splitting again.

certainly, no one will store a matrix as above in file. some compression
will be used to store a dense or sparse matrix.
but even in a compressed matrix-file, we still can pay little to estimate
the number of regions of the matrix and gain more performance improvement of
matrix-bulk-loading.

Am I right?

regards,

samuel

Reply via email to