Hi Does anyone have an input on my question?
Håkon 2010/4/9 Håkon Sagehaug <hakon.sageh...@googlemail.com> > Hi all, > > I work in a project where we need to deal with different types of biology > data. For the first case, which I'm now investigating if HBase is something > we might use the scenario is like this. > > The raw text data, is public so we can download it and store it as regular > files. The content of looks like this > > 1 2 3 4 5 6 7 > 8 > > 24915 31643 CHB rs2003280 rs1500095 1.0 0.0 0.02 0 > 24915 36594 CHB rs2003280 rs7299571 1.0 0.025 0.21 0 > 24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0 > 24916 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0 > 24916 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0 > 24916 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0 > > One files is normally between 1-2 gb(20-30 million lines), and we have > between 23-60 files. Data is something called LD_data if anyone is > interested. For storing this better we've turned all these files into a HDF > file, that is a binary format, this can then be handed over to applications > using LD_data in analysis of biology problems. The reason why we're thinking > of HBase for storing the raw text files is that we want to offer the users > ability to issue the creation of these HDF files them self, based on a > cutoff value from one or the two last columns in the file as input. We've > now just turned the hole file into to a HDF, and then the application > receiving the file deals with the cutoff. So a "query" from user that needs > the lines with a value of column 6 > 0.1 gets > > 24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0 > 24915 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0 > 24915 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0 > 24915 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0 > > Is this something that sound reasonable to use Hbase for. I guess I also > could use hadoop, and do map-reduce job, but sure how to define the map > and/or the reduce job for this. Would the best maybe be to go through the > files and map columns 3, can be looked at as a key, to a list of its values > over the cutoff. Map for the query above woule then in a map be > > > < rs2003280, { > 24915 50733 CHB rs4079417 1.0 0.130 0.09 0 > } > > > > > <rs2003282, { > 24915 59354 CHB rs1500098 1.0 0.157 0.91 0, > 24915 61880 CHB rs11063263 1.0 0.157 0.91 0, > 24915 62481 CHB rs10774263 1.0 0.157 0.91 0 > } > > > > If the Hbase would be used, I'm bit unsure how the data should be > structured best, of way is to store one row per line in the file, but maybe > not the best. Maybe another one is something like this, for the first line > in the example above > > rs2003280{ > col1:24915 = 24915, > col:31643 = 31643, > col4:rs1500095 = rs1500095, > col4:rs7299571 = rs7299571, > col4:rs4079417 = rs4079417, > value:1=1.0, > value:2=0.0, > value:3=0.02, > value:4=0, > } > > > > As you all can see I've got some questions, I'm in the process of grasping > Hbase,hadoop concepts. > > cheers, Håkon >