Re: Usign Hbase, for storing biology data and query it

Håkon Sagehaug Thu, 15 Apr 2010 03:21:39 -0700

Hi

Does anyone have an input on my question?


Håkon

2010/4/9 Håkon Sagehaug <hakon.sageh...@googlemail.com>

> Hi all,
>
> I work in a project where we need to deal with different types of biology
> data. For the first case, which I'm now investigating if HBase is something
> we might use the scenario is like this.
>
> The raw text data, is public so we can download it and store it as regular
> files. The content of looks like this
>
>  1                  2         3                4            5    6    7
> 8
>
> 24915 31643 CHB rs2003280 rs1500095 1.0 0.0 0.02 0
> 24915 36594 CHB rs2003280 rs7299571 1.0 0.025 0.21 0
> 24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0
> 24916 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0
> 24916 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0
> 24916 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0
>
> One files is normally between 1-2 gb(20-30 million lines), and we have
> between 23-60 files. Data is something called LD_data if anyone is
> interested. For storing this better we've turned all these files into a HDF
> file, that is a binary format, this can then be handed over to applications
> using LD_data in analysis of biology problems. The reason why we're thinking
> of HBase for storing the raw text files is that we want to offer the users
> ability to issue the creation of these HDF files them self, based on a
> cutoff value from one or the two last columns in the file as input. We've
> now just turned the hole file into to a HDF, and then the application
> receiving the file deals with the cutoff. So a "query" from user that needs
> the lines with a value of column 6 > 0.1 gets
>
> 24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0
> 24915 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0
> 24915 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0
> 24915 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0
>
> Is this something that sound reasonable to use Hbase for. I guess I also
> could use hadoop, and do map-reduce job, but sure how to define the map
> and/or the reduce job for this. Would the best maybe be to go through the
> files and map columns 3, can be looked at as a key, to a list of its values
> over the cutoff. Map for the query above woule then in a map be
>
>
> < rs2003280,    {
>     24915 50733 CHB rs4079417 1.0 0.130 0.09 0
>     }
> >
>
>
> <rs2003282,    {
>     24915 59354 CHB rs1500098 1.0 0.157 0.91 0,
>     24915 61880 CHB rs11063263 1.0 0.157 0.91 0,
>     24915 62481 CHB rs10774263 1.0 0.157 0.91 0
>     }
> >
>
> If the Hbase would be used, I'm bit unsure how the data should be
> structured best, of way is to store one row per line in the file, but maybe
> not the best. Maybe another one is something like this, for the first line
> in the example above
>
> rs2003280{
>                  col1:24915 = 24915,
>                  col:31643 = 31643,
>                  col4:rs1500095 = rs1500095,
>                  col4:rs7299571 = rs7299571,
>                  col4:rs4079417 = rs4079417,
>                  value:1=1.0,
>                  value:2=0.0,
>                  value:3=0.02,
>                  value:4=0,
> }
>
>
>
> As you all can see I've got some questions, I'm in the process of grasping
> Hbase,hadoop concepts.
>
> cheers, Håkon
>

Re: Usign Hbase, for storing biology data and query it

Reply via email to