> but still, if the matrix is huge(many rows, many columns), the loading will > cause a lot of matrix-table split actions. Is it right?
Yes, but >> finally, we can split the matrix's table in hbase first and let >> matrix-loading parallely without splitting again. I don't understand exactly. Do you mean that create tablets directly by pre-splitting and assign them to region server? Then, this is a role of the the hbase. The merge/split are be issued after compaction. I guess it will be same with a hbase compaction mechanism. /Edward On Mon, Sep 29, 2008 at 7:21 PM, Samuel Guo <[EMAIL PROTECTED]> wrote: > On Mon, Sep 29, 2008 at 12:43 PM, Edward J. Yoon <[EMAIL PROTECTED]>wrote: > >> The table is nothing more and nothing less than a matrix. So, we can >> think about bulk load such as >> http://wiki.apache.org/hadoop/Hbase/MapReduce > > > Yes. MapReduce should be used to load a matrix. > but still, if the matrix is huge(many rows, many columns), the loading will > cause a lot of matrix-table split actions. Is it right? > >> >> >> And, I think we can provides some regular format to store the matrix >> such as hadoop SquenceFileFormat. > > > It is great! > > >> >> >> Then, file->matrix, matrix->file, matrix operations,..., all done. >> >> /Edward >> >> On Fri, Sep 26, 2008 at 11:26 PM, Samuel Guo <[EMAIL PROTECTED]> wrote: >> > hi all, >> > >> > I am considering about how to use map/reudce to bulk-load a matrix from a >> > file. >> > >> > we can split the file, and let many mappers to load part of the file. but >> > lots of region-split will happen while loading if the matrix is huge. It >> may >> > affect the matrix load performance. >> > >> > I think that a file that stores a matrix may be regular. >> > without compression, it may be as below. >> > d11 d12 d13 .................... d1m >> > d21 d22 d23 .................... d2m >> > ............................................. >> > dn1 dn2 dn3......................dnm >> > >> > An Optimization method will be: >> > (1) read a line from the matrix file, we may know it's row-size. assume >> it >> > is RS. >> > (2) we can get the file size from filesystem's metadata know the >> file-size. >> > assume it is FS. >> > (3) we can do a computation to got the number of rows. N(R) = FS/RS. >> > (4) if we know the rows, we can estimate the number of regions of the >> > matrix. >> > finally, we can split the matrix's table in hbase first and let >> > matrix-loading parallely without splitting again. >> > >> > certainly, no one will store a matrix as above in file. some compression >> > will be used to store a dense or sparse matrix. >> > but even in a compressed matrix-file, we still can pay little to estimate >> > the number of regions of the matrix and gain more performance improvement >> of >> > matrix-bulk-loading. >> > >> > Am I right? >> > >> > regards, >> > >> > samuel >> > >> >> >> >> -- >> Best regards, Edward J. Yoon >> [EMAIL PROTECTED] >> http://blog.udanax.org >> > -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
