On Mon, Sep 29, 2008 at 12:43 PM, Edward J. Yoon <[EMAIL PROTECTED]>wrote:
> The table is nothing more and nothing less than a matrix. So, we can > think about bulk load such as > http://wiki.apache.org/hadoop/Hbase/MapReduce Yes. MapReduce should be used to load a matrix. but still, if the matrix is huge(many rows, many columns), the loading will cause a lot of matrix-table split actions. Is it right? > > > And, I think we can provides some regular format to store the matrix > such as hadoop SquenceFileFormat. It is great! > > > Then, file->matrix, matrix->file, matrix operations,..., all done. > > /Edward > > On Fri, Sep 26, 2008 at 11:26 PM, Samuel Guo <[EMAIL PROTECTED]> wrote: > > hi all, > > > > I am considering about how to use map/reudce to bulk-load a matrix from a > > file. > > > > we can split the file, and let many mappers to load part of the file. but > > lots of region-split will happen while loading if the matrix is huge. It > may > > affect the matrix load performance. > > > > I think that a file that stores a matrix may be regular. > > without compression, it may be as below. > > d11 d12 d13 .................... d1m > > d21 d22 d23 .................... d2m > > ............................................. > > dn1 dn2 dn3......................dnm > > > > An Optimization method will be: > > (1) read a line from the matrix file, we may know it's row-size. assume > it > > is RS. > > (2) we can get the file size from filesystem's metadata know the > file-size. > > assume it is FS. > > (3) we can do a computation to got the number of rows. N(R) = FS/RS. > > (4) if we know the rows, we can estimate the number of regions of the > > matrix. > > finally, we can split the matrix's table in hbase first and let > > matrix-loading parallely without splitting again. > > > > certainly, no one will store a matrix as above in file. some compression > > will be used to store a dense or sparse matrix. > > but even in a compressed matrix-file, we still can pay little to estimate > > the number of regions of the matrix and gain more performance improvement > of > > matrix-bulk-loading. > > > > Am I right? > > > > regards, > > > > samuel > > > > > > -- > Best regards, Edward J. Yoon > [EMAIL PROTECTED] > http://blog.udanax.org >
