Thanks again for the responses. Stack: What's the issue with 25 families? I will mostly be accessing HBase as a map-reduce source and will be looking at ~5 column families at a time. Is there any documentation on column family limits in practice?
-Calvin On Mon, Nov 30, 2009 at 7:38 PM, stack <[email protected]> wrote: > There is little art to the HFileOutputFormat. You could play with it to > make it support multiple column families (As soon as any family hits the > region size boundary, start up a new set of files). If keys are already > sorted, could hook up HFileOutputFormat to the map (I think) and avoid the > mapreduce framework sort. > > 25 families is probably too many for current hbase -- depends on how you'll > be accessing them. > > Can you put your input files under an http server and then write a > mapreduce > that pulls via HTTP? > > St.Ack > > On Mon, Nov 30, 2009 at 3:33 PM, Calvin <[email protected]> wrote: > > > Thanks for the responses. If I can avoid writing a map-reduce job that > > would be preferable (getting map-reduce to work with / depend on my > > existing > > infrastructure is turning out to be annoying). > > > > I have no good way of randomizing my dataset since it's a very large > stream > > of sequential data (ordered by some key). I have a fair number of column > > families (~25) and every column is a long or a double. Having a > standalone > > program that writes rows using the HTable / Put API seems to run at > ~2-5000 > > rows/sec, which seems ridiculously slow. Is it possible I am doing > > something terribly wrong? > > > > -Calvin > > > > On Mon, Nov 30, 2009 at 5:47 PM, Ryan Rawson <[email protected]> wrote: > > > > > Sequentially ordered rows is the worst insert case in HBase - you end > > > up writing all to 1 server even if you have 500. If you could > > > randomize your input, and I have pasted a Randomize.java map reduce > > > that will randomize lines of a file, then your performance will > > > improve. > > > > > > I have seen sustained inserts of 100-300k rows/sec on small rows > > > before. Obviously large blob rows will be slower, since the limiting > > > factor is how fast we can write data to HDFS, thus it isnt the actual > > > row count, but the amount of data involved. > > > > > > Try the randomize.java, see where that gets you. I think it's on the > > > list archives. > > > > > > -ryan > > > > > > > > > On Mon, Nov 30, 2009 at 2:41 PM, Jean-Daniel Cryans < > [email protected] > > > > > > wrote: > > > > Could you put your data in HDFS and load it from there with a > MapReduce > > > job? > > > > > > > > J-D > > > > > > > > On Mon, Nov 30, 2009 at 2:33 PM, Calvin <[email protected]> > > wrote: > > > >> I have a large amount of sequential ordered rows I would like to > write > > > to an > > > >> HBase table. What is the preferred way to do bulk writes of > > > multi-column > > > >> tables in HBase? Using the get/put interface seems fairly slow even > > if > > > I > > > >> bulk writes with table.put(List<Put>). > > > >> > > > >> I have followed the directions on: > > > >> * http://wiki.apache.org/hadoop/PerformanceTuning > > > >> * > > > >> > > > > > > http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html > > > >> > > > >> Are there any other resources for improving the throughput of my > bulk > > > >> writes? On > > > >> > > > > > > http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.htmlI > > > >> see there's a way to write HFiles directly, but HFileOutputFormat > can > > > >> only > > > >> write a single column famly at a time ( > > > >> https://issues.apache.org/jira/browse/HBASE-1861). > > > >> > > > >> Thanks! > > > >> > > > >> -Calvin > > > >> > > > > > > > > > >
