Most log data tends to be time-oriented, thus the 'natural' schema is to use the timestamp as the row key, thus concentrating all inserts on a single region and thus node. This is fixable by changing the key to something other than a monotonically increasing value.
If you just insert on 1 region, you end up being gated by the performance of a single node. Thus limiting intake/insert scalability. As for that slide, I am the originator of it, and the reasons above are why I suggested as below. On Mon, Feb 15, 2010 at 4:45 PM, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote: > Hello, > > I've seen the following in a few HBase presentations now: > > * What to store in HBase? > * Maybe not your raw log data... > * ...but the results of processing it with Hadoop > > e.g. slides 26 & 27: > http://www.slideshare.net/cloudera/hw09-practical-h-base-getting-the-most-from-your-h-base-install > > > Is there anything wrong in storing raw log data directly into HBase and doing > so in real-time, even when that means having to insert a few hundred > rows/second? > > Is the above advice purely because of data volume associated with storing > lots of raw logs or some other reason? > > Thanks, > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Hadoop ecosystem search :: http://search-hadoop.com/ > >