Hi Edward, The problem is that even if the workload is 5% write and 95% read, if you can't load the data, you need more machines. In the 167 billion insert test, HBase failed with *Concurrent mode failure* after 20% of the data was loaded. One of our customers has loaded 1/2 trillion records of historical financial market data on 16 machines. If you do the back-of-the-envelope calculation, it would take about 180 machines for HBase to load 1/2 trillion cells. That makes HBase 10X more expensive in terms of hardware, power consumption, and data center real estate.
- Doug On Fri, Feb 17, 2012 at 3:58 PM, Edward Capriolo <[email protected]>wrote: > I would almost agree with prospective. But their is a problem with 'java is > slow' theory. The reason is that in a 100 percent write workload gc might > be a factor. > > But in the real world people have to read data and read becomes disk bound > as your data gets larger then memory. > > Unless C++ can make your disk spin faster then java It is a wash. Making a > claim that your going to need more servers for java/hbase is bogus. To put > it in prospective, if the workload is 5 % write and 95 % read you are > probably going to need just the same amount of hardware. > > You might get some win on the read size because your custom caching could > be more efficient in terms of object size in memory and other gc issues but > it is not 2 or 3 to one. > > If a million writes fall into a hypertable forest but it take a billion > years to read them back did the writes ever sync :) > > > On Monday, February 13, 2012, Doug Judd <[email protected]> wrote: > > Hey Todd, > > > > Bulk loading isn't always an option when data is streaming in from a live > > application. Many big data use cases involve massive amounts of smaller > > items in the size range of 10-100 bytes, for example URLs, sensor > readings, > > genome sequence reads, network traffic logs, etc. If HBase requires 2-3 > > times the amount of hardware to avoid *Concurrent mode failures*, then > that > > makes HBase 2-3 times more expensive from the standpoint of hardware, > power > > consumption, and datacenter real estate. > > > > What takes the most time is getting the core database mechanics right > > (we're going on 5 years now). Once the core database is stable, > > integration with applications such as Solr and others are short term > > projects. I believe that sooner or later, most engineers working in this > > space will come to the conclusion that Java is the wrong language for > this > > kind of database application. At that point, folks on the HBase project > > will realize that they are five years behind. > > > > - Doug > > > > On Mon, Feb 13, 2012 at 11:33 AM, Todd Lipcon <[email protected]> wrote: > > > >> Hey Doug, > >> > >> Want to also run a comparison test with inter-cluster replication > >> turned on? How about kerberos-based security on secure HDFS? How about > >> ACLs or other table permissions even without strong authentication? > >> Can you run a test comparing performance running on top of Hadoop > >> 0.23? How about running other ecosystem products like Solbase, > >> Havrobase, and Lily, or commercial products like Digital Reasoning's > >> Synthesys, etc? > >> > >> For those unfamiliar, the answer to all of the above is that those > >> comparisons can't be run because Hypertable is years behind HBase in > >> terms of features, adoption, etc. They've found a set of benchmarks > >> they win at, but bulk loading either database through the "put" API is > >> the wrong way to go about it anyway. Anyone loading 5T of data like > >> this would use the bulk load APIs which are one to two orders of > >> magnitude more efficient. Just ask the Yahoo crawl cache team, who has > >> ~1PB stored in HBase, or Facebook, or eBay, or many others who store > >> hundreds to thousands of TBs in HBase today. > >> > >> Thanks, > >> -Todd > >> > >> On Mon, Feb 13, 2012 at 9:07 AM, Doug Judd <[email protected]> wrote: > >> > In our original test, we mistakenly ran the HBase test with > >> > the hbase.hregion.memstore.mslab.enabled property set to false. We > >> re-ran > >> > the test with the hbase.hregion.memstore.mslab.enabled property set to > >> true > >> > and have reported the results in the following addendum: > >> > > >> > Addendum to Hypertable vs. HBase Performance > >> > Test< > >> > http://www.hypertable.com/why_hypertable/hypertable_vs_hbase_2/addendum/> > >> > > >> > Synopsis: It slowed performance on the 10KB and 1KB tests and still > >> failed > >> > the 100 byte and 10 byte tests with *Concurrent mode failure* > >> > > >> > - Doug > >> > >> > >> > >> -- > >> Todd Lipcon > >> Software Engineer, Cloudera > >> > > >
