Re: bulk load in hbase

Edward J. Yoon Sun, 28 Sep 2008 21:43:43 -0700

The table is nothing more and nothing less than a matrix. So, we can
think about bulk load such as
http://wiki.apache.org/hadoop/Hbase/MapReduce


And, I think we can provides some regular format to store the matrix
such as hadoop SquenceFileFormat.

Then, file->matrix, matrix->file, matrix operations,..., all done.

/Edward

On Fri, Sep 26, 2008 at 11:26 PM, Samuel Guo <[EMAIL PROTECTED]> wrote:
> hi all,
>
> I am considering about how to use map/reudce to bulk-load a matrix from a
> file.
>
> we can split the file, and let many mappers to load part of the file. but
> lots of region-split will happen while loading if the matrix is huge. It may
> affect the matrix load performance.
>
> I think that a file that stores a matrix may be regular.
> without compression, it may be as below.
> d11 d12 d13 .................... d1m
> d21 d22 d23 .................... d2m
> .............................................
> dn1 dn2 dn3......................dnm
>
> An Optimization method will be:
> (1) read a line from the matrix file, we may know it's row-size. assume it
> is RS.
> (2) we can get the file size from filesystem's metadata know the file-size.
> assume it is FS.
> (3) we can do a computation to got the number of rows. N(R) = FS/RS.
> (4) if we know the rows, we can estimate the number of regions of the
> matrix.
> finally, we can split the matrix's table in hbase first and let
> matrix-loading parallely without splitting again.
>
> certainly, no one will store a matrix as above in file. some compression
> will be used to store a dense or sparse matrix.
> but even in a compressed matrix-file, we still can pay little to estimate
> the number of regions of the matrix and gain more performance improvement of
> matrix-bulk-loading.
>
> Am I right?
>
> regards,
>
> samuel
>



-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org

Re: bulk load in hbase

Reply via email to