hi all, I am considering about how to use map/reudce to bulk-load a matrix from a file.
we can split the file, and let many mappers to load part of the file. but lots of region-split will happen while loading if the matrix is huge. It may affect the matrix load performance. I think that a file that stores a matrix may be regular. without compression, it may be as below. d11 d12 d13 .................... d1m d21 d22 d23 .................... d2m ............................................. dn1 dn2 dn3......................dnm An Optimization method will be: (1) read a line from the matrix file, we may know it's row-size. assume it is RS. (2) we can get the file size from filesystem's metadata know the file-size. assume it is FS. (3) we can do a computation to got the number of rows. N(R) = FS/RS. (4) if we know the rows, we can estimate the number of regions of the matrix. finally, we can split the matrix's table in hbase first and let matrix-loading parallely without splitting again. certainly, no one will store a matrix as above in file. some compression will be used to store a dense or sparse matrix. but even in a compressed matrix-file, we still can pay little to estimate the number of regions of the matrix and gain more performance improvement of matrix-bulk-loading. Am I right? regards, samuel
