An interesting thing about HBase is it really performs better with more data. Pre-splitting tables is one way.
Another performance bottleneck is the write-ahead-log. You can disable it by calling: Put.setWriteToWAL(false); and you will achieve a substantial speedup. Good luck! -ryan On Tue, Sep 22, 2009 at 3:39 PM, stack <st...@duboce.net> wrote: > Split your table in advance? You can do it from the UI or shell (Script > it?) > > Regards same performance for 10 nodes as for 5 nodes, how many regions in > your table? What happens if you pile on more data? > > The split algorithm will be sped up in coming versions for sure. Two > minutes seems like a long time. Its under load at this time? > > St.Ack > > > > On Tue, Sep 22, 2009 at 3:14 PM, Molinari, Guy <guy.molin...@disney.com>wrote: > >> Hello all, >> >> I've been working with HBase for the past few months on a proof of >> concept/technology adoption evaluation. I wanted to describe my >> scenario to the user/development community to get some input on my >> observations. >> >> >> >> I've written an application that is comprised of two tables. It models >> a classic many-to-many relationship. One table stores "User" data and >> the other represents an "Inbox" of items assigned to that user. The >> key for the user is a string generated by the JDK's UUID.randomUUID() >> method. The key for the "Inbox" is a monotonically increasing value. >> >> >> >> It works just fine. I've reviewed the performance tuning info on the >> HBase WIKI page. The client application spins up 100 threads each one >> grabbing a range of keys (for the "Inbox"). The I/O mix is about >> 50/50 read/write. The test client inserts 1,000,000 "Inbox" items and >> verifies the existence of a "User" (FK check). It uses column families >> to maintain integrity of the relationships. >> >> >> >> I'm running versions 0.19.3 and 0.20.0. The behavior is basically the >> same. The cluster consists of 10 nodes. I'm running my namenode and >> HBase master on one dedicated box. The other 9 run datanodes/region >> servers. >> >> >> >> I'm seeing around ~1000 "Inbox" transactions per second (dividing total >> time for the batch by total count inserted). The problem is that I >> get the same results with 5 nodes as with 10. Not quite what I was >> expecting. >> >> >> >> The bottleneck seems to be the splitting algorithms. I've set my >> region size to 2M. I can see that as the process moves forward, HBase >> pauses and re-distributes the data and splits regions. It does this >> first for the "Inbox" table and then pauses again and redistributes the >> "User" table. This pause can be quite long. Often 2 minutes or >> more. >> >> >> >> Can the key ranges be pre-defined somehow in advance to avoid this? I >> would rather not burden application developers/DBA's with this. >> Perhaps the divvy algorithms could be sped up? Any configuration >> recommendations? >> >> >> >> Thanks in advance, >> >> Guy >> >> >