On Mon, Sep 29, 2008 at 7:36 PM, Edward J. Yoon <[EMAIL PROTECTED]>wrote:
> > but still, if the matrix is huge(many rows, many columns), the loading > will > > cause a lot of matrix-table split actions. Is it right? > > Yes, but > > >> finally, we can split the matrix's table in hbase first and let > >> matrix-loading parallely without splitting again. > > I don't understand exactly. Do you mean that create tablets directly > by pre-splitting and assign them to region server? > > Then, this is a role of the the hbase. The merge/split are be issued > after compaction. I guess it will be same with a hbase compaction > mechanism. yes. it is the role of hbase. > > > /Edward > > On Mon, Sep 29, 2008 at 7:21 PM, Samuel Guo <[EMAIL PROTECTED]> wrote: > > On Mon, Sep 29, 2008 at 12:43 PM, Edward J. Yoon <[EMAIL PROTECTED] > >wrote: > > > >> The table is nothing more and nothing less than a matrix. So, we can > >> think about bulk load such as > >> http://wiki.apache.org/hadoop/Hbase/MapReduce > > > > > > Yes. MapReduce should be used to load a matrix. > > but still, if the matrix is huge(many rows, many columns), the loading > will > > cause a lot of matrix-table split actions. Is it right? > > > >> > >> > >> And, I think we can provides some regular format to store the matrix > >> such as hadoop SquenceFileFormat. > > > > > > It is great! > > > > > >> > >> > >> Then, file->matrix, matrix->file, matrix operations,..., all done. > >> > >> /Edward > >> > >> On Fri, Sep 26, 2008 at 11:26 PM, Samuel Guo <[EMAIL PROTECTED]> > wrote: > >> > hi all, > >> > > >> > I am considering about how to use map/reudce to bulk-load a matrix > from a > >> > file. > >> > > >> > we can split the file, and let many mappers to load part of the file. > but > >> > lots of region-split will happen while loading if the matrix is huge. > It > >> may > >> > affect the matrix load performance. > >> > > >> > I think that a file that stores a matrix may be regular. > >> > without compression, it may be as below. > >> > d11 d12 d13 .................... d1m > >> > d21 d22 d23 .................... d2m > >> > ............................................. > >> > dn1 dn2 dn3......................dnm > >> > > >> > An Optimization method will be: > >> > (1) read a line from the matrix file, we may know it's row-size. > assume > >> it > >> > is RS. > >> > (2) we can get the file size from filesystem's metadata know the > >> file-size. > >> > assume it is FS. > >> > (3) we can do a computation to got the number of rows. N(R) = FS/RS. > >> > (4) if we know the rows, we can estimate the number of regions of the > >> > matrix. > >> > finally, we can split the matrix's table in hbase first and let > >> > matrix-loading parallely without splitting again. > >> > > >> > certainly, no one will store a matrix as above in file. some > compression > >> > will be used to store a dense or sparse matrix. > >> > but even in a compressed matrix-file, we still can pay little to > estimate > >> > the number of regions of the matrix and gain more performance > improvement > >> of > >> > matrix-bulk-loading. > >> > > >> > Am I right? > >> > > >> > regards, > >> > > >> > samuel > >> > > >> > >> > >> > >> -- > >> Best regards, Edward J. Yoon > >> [EMAIL PROTECTED] > >> http://blog.udanax.org > >> > > > > > > -- > Best regards, Edward J. Yoon > [EMAIL PROTECTED] > http://blog.udanax.org >
