Re: Usign Hbase, for storing biology data and query it

Jesper Utoft Thu, 15 Apr 2010 03:27:30 -0700

Hey.

First off i have only been playing around with HBase and Hadoop in school,
so i have no in debt knowledge of it.


I think you should not use HBase but just store the files in HDFS direcly.
And then make these HDF files using a map/reduce job in some way.

Just my 2 cents.

Cheers.

2010/4/15 Håkon Sagehaug <hakon.sageh...@googlemail.com>

> Hi
>
> Does anyone have an input on my question?
>
> Håkon
>
> 2010/4/9 Håkon Sagehaug <hakon.sageh...@googlemail.com>
>
> > Hi all,
> >
> > I work in a project where we need to deal with different types of biology
> > data. For the first case, which I'm now investigating if HBase is
> something
> > we might use the scenario is like this.
> >
> > The raw text data, is public so we can download it and store it as
> regular
> > files. The content of looks like this
> >
> >  1                  2         3                4            5    6    7
> > 8
> >
> > 24915 31643 CHB rs2003280 rs1500095 1.0 0.0 0.02 0
> > 24915 36594 CHB rs2003280 rs7299571 1.0 0.025 0.21 0
> > 24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0
> > 24916 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0
> > 24916 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0
> > 24916 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0
> >
> > One files is normally between 1-2 gb(20-30 million lines), and we have
> > between 23-60 files. Data is something called LD_data if anyone is
> > interested. For storing this better we've turned all these files into a
> HDF
> > file, that is a binary format, this can then be handed over to
> applications
> > using LD_data in analysis of biology problems. The reason why we're
> thinking
> > of HBase for storing the raw text files is that we want to offer the
> users
> > ability to issue the creation of these HDF files them self, based on a
> > cutoff value from one or the two last columns in the file as input. We've
> > now just turned the hole file into to a HDF, and then the application
> > receiving the file deals with the cutoff. So a "query" from user that
> needs
> > the lines with a value of column 6 > 0.1 gets
> >
> > 24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0
> > 24915 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0
> > 24915 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0
> > 24915 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0
> >
> > Is this something that sound reasonable to use Hbase for. I guess I also
> > could use hadoop, and do map-reduce job, but sure how to define the map
> > and/or the reduce job for this. Would the best maybe be to go through the
> > files and map columns 3, can be looked at as a key, to a list of its
> values
> > over the cutoff. Map for the query above woule then in a map be
> >
> >
> > < rs2003280,    {
> >     24915 50733 CHB rs4079417 1.0 0.130 0.09 0
> >     }
> > >
> >
> >
> > <rs2003282,    {
> >     24915 59354 CHB rs1500098 1.0 0.157 0.91 0,
> >     24915 61880 CHB rs11063263 1.0 0.157 0.91 0,
> >     24915 62481 CHB rs10774263 1.0 0.157 0.91 0
> >     }
> > >
> >
> > If the Hbase would be used, I'm bit unsure how the data should be
> > structured best, of way is to store one row per line in the file, but
> maybe
> > not the best. Maybe another one is something like this, for the first
> line
> > in the example above
> >
> > rs2003280{
> >                  col1:24915 = 24915,
> >                  col:31643 = 31643,
> >                  col4:rs1500095 = rs1500095,
> >                  col4:rs7299571 = rs7299571,
> >                  col4:rs4079417 = rs4079417,
> >                  value:1=1.0,
> >                  value:2=0.0,
> >                  value:3=0.02,
> >                  value:4=0,
> > }
> >
> >
> >
> > As you all can see I've got some questions, I'm in the process of
> grasping
> > Hbase,hadoop concepts.
> >
> > cheers, Håkon
> >
>

Re: Usign Hbase, for storing biology data and query it

Reply via email to