Hey St.Ack: Thank you for your reply.
I chose to start with HBase after getting the answer for the original post on the hadoop list :) As of now I use two fields to form a composite key. Other fields are organized into one column family. I will discuss with my manager and see how she thinks about getting more nodes to continue the testing. Thanks! Xueling On Thu, Dec 17, 2009 at 4:40 PM, stack <[email protected]> wrote: > Hey Xueling: > > Now I notice that you are the fellow who recently wrote up on the hadoop > list. > > Todds described scheme I take it won't work for you then? There'd be less > moving parts for sure. > > Up on hadoop list you gave a description of your records as so: > > "1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0 1 4 *103570835* F .. 23G 24 > > "The highlighted field is called "position of match" and the query we are > interested in is the # of sequences in a certain range of this "position of > match". For instance the range can be "position of match" > 200 and > "position of match" + 36 < 200,000." > > What are you thinking regards row key? Will each of the fields above be > concatenated as row key or will they each be individual columns all in the > one column family or in many? > > I'd suggest you get some subset of your dataset, say a million records or > so. This should load into a single hbase node fine. Use this small > dataset > to figure the schema that best serves the way you'll be querying the data. > > If you can get away with a single family, work on writing an import that > write hfiles directly: > > http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk > . > It'll run an order of magnitude or more faster than going via the API. > > Now, as to the size of the cluster, see the presentations section where > Ryan > describes the hardware used loading up a 9B row table. His hardware might > be more than you need. I'd suggest you start with 4 or 5 nodes and see how > loading goes. Check query latency. If the numbers are not to your liking, > add more nodes. HBase generally scales linearly. > > Hope this helps, > St.Ack > > > > > > > > > On Thu, Dec 17, 2009 at 4:00 PM, Xueling Shu <[email protected] > >wrote: > > > Hi St.Ack: > > > > Wondering how many nodes in a cluster do you recommend to hold 5B data? > > Eventually we need to handle X times 5B data. I want to get an idea of > how > > many resources we need. > > > > Thanks, > > Xueling > > > > > > On Thu, Dec 17, 2009 at 3:45 PM, stack <[email protected]> wrote: > > > > > Hey Xueling, 5B into a single node ain't going to work. Get yourself a > > bit > > > of a cluster somewhere. Single node is for messing around. Not for > > doing > > > 'real' stuff. > > > > > > St.Ack > > > > > > > > > On Thu, Dec 17, 2009 at 3:29 PM, stack <[email protected]> wrote: > > > > > > > On Thu, Dec 17, 2009 at 2:38 PM, Xueling Shu < > [email protected] > > > >wrote: > > > > > > > >> > > > >> Things started fine until 5 mins after the data population started. > > > >> > > > >> Here is the exception: > > > >> Exception in thread "main" > > > >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to > > > >> contact > > > >> region server 10.0.176.64:39045 for region Genome,,1261087437258, > row > > > >> > > > >> > > > > > > '\x00\x00\x00\x00\x0E\xB00\xAC\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00s\xAD', > > > >> but failed after 10 attempts. > > > >> Exceptions: > > > >> java.io.IOException: java.io.IOException: Server not running, > aborting > > > >> > > > > > > > > See why it quit by looking in the regionserver log. > > > > > > > > Make sure you have latest hbase and read the 'Getting Started' > section. > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > >> at > > > >> > > > >> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2347) > > > >> at > > > >> > > > >> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1826) > > > >> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown > Source) > > > >> at > > > >> > > > >> > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > >> at java.lang.reflect.Method.invoke(Method.java:597) > > > >> at > > > >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > > > >> at > > > >> > > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > > > >> > > > >> java.net.ConnectException: Connection refused > > > >> java.net.ConnectException: Connection refused > > > >> java.net.ConnectException: Connection refused > > > >> java.net.ConnectException: Connection refused > > > >> java.net.ConnectException: Connection refused > > > >> java.net.ConnectException: Connection refused > > > >> java.net.ConnectException: Connection refused > > > >> java.net.ConnectException: Connection refused > > > >> java.net.ConnectException: Connection refused > > > >> > > > >> at > > > >> > > > >> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1002) > > > >> at > > > >> > > > >> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers$2.doCall(HConnectionManager.java:1193) > > > >> at > > > >> > > > >> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1115) > > > >> at > > > >> > > > >> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1201) > > > >> at > > > >> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:605) > > > >> at org.apache.hadoop.hbase.client.HTable.put(HTable.java:470) > > > >> at HadoopTrigger.populateData(HadoopTrigger.java:126) > > > >> at HadoopTrigger.main(HadoopTrigger.java:52) > > > >> > > > >> Can anybody let me know how to fix it? > > > >> Thanks, > > > >> Xueling > > > >> > > > > > > > > > > > > > >
