On Mon, Sep 29, 2008 at 12:43 PM, Edward J. Yoon <[EMAIL PROTECTED]>wrote:

> The table is nothing more and nothing less than a matrix. So, we can
> think about bulk load such as
> http://wiki.apache.org/hadoop/Hbase/MapReduce


Yes. MapReduce should be used to load a matrix.
but still, if the matrix is huge(many rows, many columns), the loading will
cause a lot of matrix-table split actions. Is it right?

>
>
> And, I think we can provides some regular format to store the matrix
> such as hadoop SquenceFileFormat.


It is great!


>
>
> Then, file->matrix, matrix->file, matrix operations,..., all done.
>
> /Edward
>
> On Fri, Sep 26, 2008 at 11:26 PM, Samuel Guo <[EMAIL PROTECTED]> wrote:
> > hi all,
> >
> > I am considering about how to use map/reudce to bulk-load a matrix from a
> > file.
> >
> > we can split the file, and let many mappers to load part of the file. but
> > lots of region-split will happen while loading if the matrix is huge. It
> may
> > affect the matrix load performance.
> >
> > I think that a file that stores a matrix may be regular.
> > without compression, it may be as below.
> > d11 d12 d13 .................... d1m
> > d21 d22 d23 .................... d2m
> > .............................................
> > dn1 dn2 dn3......................dnm
> >
> > An Optimization method will be:
> > (1) read a line from the matrix file, we may know it's row-size. assume
> it
> > is RS.
> > (2) we can get the file size from filesystem's metadata know the
> file-size.
> > assume it is FS.
> > (3) we can do a computation to got the number of rows. N(R) = FS/RS.
> > (4) if we know the rows, we can estimate the number of regions of the
> > matrix.
> > finally, we can split the matrix's table in hbase first and let
> > matrix-loading parallely without splitting again.
> >
> > certainly, no one will store a matrix as above in file. some compression
> > will be used to store a dense or sparse matrix.
> > but even in a compressed matrix-file, we still can pay little to estimate
> > the number of regions of the matrix and gain more performance improvement
> of
> > matrix-bulk-loading.
> >
> > Am I right?
> >
> > regards,
> >
> > samuel
> >
>
>
>
> --
> Best regards, Edward J. Yoon
> [EMAIL PROTECTED]
> http://blog.udanax.org
>

Reply via email to