The table is nothing more and nothing less than a matrix. So, we can think about bulk load such as http://wiki.apache.org/hadoop/Hbase/MapReduce
And, I think we can provides some regular format to store the matrix such as hadoop SquenceFileFormat. Then, file->matrix, matrix->file, matrix operations,..., all done. /Edward On Fri, Sep 26, 2008 at 11:26 PM, Samuel Guo <[EMAIL PROTECTED]> wrote: > hi all, > > I am considering about how to use map/reudce to bulk-load a matrix from a > file. > > we can split the file, and let many mappers to load part of the file. but > lots of region-split will happen while loading if the matrix is huge. It may > affect the matrix load performance. > > I think that a file that stores a matrix may be regular. > without compression, it may be as below. > d11 d12 d13 .................... d1m > d21 d22 d23 .................... d2m > ............................................. > dn1 dn2 dn3......................dnm > > An Optimization method will be: > (1) read a line from the matrix file, we may know it's row-size. assume it > is RS. > (2) we can get the file size from filesystem's metadata know the file-size. > assume it is FS. > (3) we can do a computation to got the number of rows. N(R) = FS/RS. > (4) if we know the rows, we can estimate the number of regions of the > matrix. > finally, we can split the matrix's table in hbase first and let > matrix-loading parallely without splitting again. > > certainly, no one will store a matrix as above in file. some compression > will be used to store a dense or sparse matrix. > but even in a compressed matrix-file, we still can pay little to estimate > the number of regions of the matrix and gain more performance improvement of > matrix-bulk-loading. > > Am I right? > > regards, > > samuel > -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
