Hey. First off i have only been playing around with HBase and Hadoop in school, so i have no in debt knowledge of it.
I think you should not use HBase but just store the files in HDFS direcly. And then make these HDF files using a map/reduce job in some way. Just my 2 cents. Cheers. 2010/4/15 Håkon Sagehaug <hakon.sageh...@googlemail.com> > Hi > > Does anyone have an input on my question? > > Håkon > > 2010/4/9 Håkon Sagehaug <hakon.sageh...@googlemail.com> > > > Hi all, > > > > I work in a project where we need to deal with different types of biology > > data. For the first case, which I'm now investigating if HBase is > something > > we might use the scenario is like this. > > > > The raw text data, is public so we can download it and store it as > regular > > files. The content of looks like this > > > > 1 2 3 4 5 6 7 > > 8 > > > > 24915 31643 CHB rs2003280 rs1500095 1.0 0.0 0.02 0 > > 24915 36594 CHB rs2003280 rs7299571 1.0 0.025 0.21 0 > > 24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0 > > 24916 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0 > > 24916 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0 > > 24916 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0 > > > > One files is normally between 1-2 gb(20-30 million lines), and we have > > between 23-60 files. Data is something called LD_data if anyone is > > interested. For storing this better we've turned all these files into a > HDF > > file, that is a binary format, this can then be handed over to > applications > > using LD_data in analysis of biology problems. The reason why we're > thinking > > of HBase for storing the raw text files is that we want to offer the > users > > ability to issue the creation of these HDF files them self, based on a > > cutoff value from one or the two last columns in the file as input. We've > > now just turned the hole file into to a HDF, and then the application > > receiving the file deals with the cutoff. So a "query" from user that > needs > > the lines with a value of column 6 > 0.1 gets > > > > 24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0 > > 24915 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0 > > 24915 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0 > > 24915 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0 > > > > Is this something that sound reasonable to use Hbase for. I guess I also > > could use hadoop, and do map-reduce job, but sure how to define the map > > and/or the reduce job for this. Would the best maybe be to go through the > > files and map columns 3, can be looked at as a key, to a list of its > values > > over the cutoff. Map for the query above woule then in a map be > > > > > > < rs2003280, { > > 24915 50733 CHB rs4079417 1.0 0.130 0.09 0 > > } > > > > > > > > > <rs2003282, { > > 24915 59354 CHB rs1500098 1.0 0.157 0.91 0, > > 24915 61880 CHB rs11063263 1.0 0.157 0.91 0, > > 24915 62481 CHB rs10774263 1.0 0.157 0.91 0 > > } > > > > > > > If the Hbase would be used, I'm bit unsure how the data should be > > structured best, of way is to store one row per line in the file, but > maybe > > not the best. Maybe another one is something like this, for the first > line > > in the example above > > > > rs2003280{ > > col1:24915 = 24915, > > col:31643 = 31643, > > col4:rs1500095 = rs1500095, > > col4:rs7299571 = rs7299571, > > col4:rs4079417 = rs4079417, > > value:1=1.0, > > value:2=0.0, > > value:3=0.02, > > value:4=0, > > } > > > > > > > > As you all can see I've got some questions, I'm in the process of > grasping > > Hbase,hadoop concepts. > > > > cheers, Håkon > > >