Thanks for the responses. If I can avoid writing a map-reduce job that would be preferable (getting map-reduce to work with / depend on my existing infrastructure is turning out to be annoying).
I have no good way of randomizing my dataset since it's a very large stream of sequential data (ordered by some key). I have a fair number of column families (~25) and every column is a long or a double. Having a standalone program that writes rows using the HTable / Put API seems to run at ~2-5000 rows/sec, which seems ridiculously slow. Is it possible I am doing something terribly wrong? -Calvin On Mon, Nov 30, 2009 at 5:47 PM, Ryan Rawson <[email protected]> wrote: > Sequentially ordered rows is the worst insert case in HBase - you end > up writing all to 1 server even if you have 500. If you could > randomize your input, and I have pasted a Randomize.java map reduce > that will randomize lines of a file, then your performance will > improve. > > I have seen sustained inserts of 100-300k rows/sec on small rows > before. Obviously large blob rows will be slower, since the limiting > factor is how fast we can write data to HDFS, thus it isnt the actual > row count, but the amount of data involved. > > Try the randomize.java, see where that gets you. I think it's on the > list archives. > > -ryan > > > On Mon, Nov 30, 2009 at 2:41 PM, Jean-Daniel Cryans <[email protected]> > wrote: > > Could you put your data in HDFS and load it from there with a MapReduce > job? > > > > J-D > > > > On Mon, Nov 30, 2009 at 2:33 PM, Calvin <[email protected]> wrote: > >> I have a large amount of sequential ordered rows I would like to write > to an > >> HBase table. What is the preferred way to do bulk writes of > multi-column > >> tables in HBase? Using the get/put interface seems fairly slow even if > I > >> bulk writes with table.put(List<Put>). > >> > >> I have followed the directions on: > >> * http://wiki.apache.org/hadoop/PerformanceTuning > >> * > >> > http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html > >> > >> Are there any other resources for improving the throughput of my bulk > >> writes? On > >> > http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.htmlI > >> see there's a way to write HFiles directly, but HFileOutputFormat can > >> only > >> write a single column famly at a time ( > >> https://issues.apache.org/jira/browse/HBASE-1861). > >> > >> Thanks! > >> > >> -Calvin > >> > > >
