[Hadoop Wiki] Update of "Hbase/PoweredBy" by AndrewPurt ell

Apache Wiki Tue, 15 Dec 2009 10:17:17 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hbase/PoweredBy" page has been changed by AndrewPurtell.
http://wiki.apache.org/hadoop/Hbase/PoweredBy?action=diff&rev1=37&rev2=38

--------------------------------------------------

  
  [[http://www.tokenizer.org|Shopping Engine at Tokenizer]] is a web crawler; 
it uses HBase to store URLs and Outlinks (!AnchorText + LinkedURL): more than a 
billion. It was initially designed as Nutch-Hadoop extension, then (due to very 
specific 'shopping' scenario) moved to SOLR + MySQL(InnoDB) (ten thousands 
queries per second), and now - to HBase. HBase is significantly faster due to: 
no need for huge transaction logs, column-oriented design exactly matches 
'lazy' business logic, data compression, !MapReduce support. Number of mutable 
'indexes' (term from RDBMS) significantly reduced due to the fact that each 
'row::column' structure is physically sorted by 'row'. MySQL InnoDB engine is 
best DB choice for highly-concurrent updates. However, necessity to flash a 
block of data to harddrive even if we changed only few bytes is obvious 
bottleneck. HBase greatly helps: not-so-popular in modern DBMS 'delete-insert', 
'mutable primary key', and 'natural primary key' patterns become a big 
advantage with HBase.
  
- [[http://trendmicro.com/|Trend Micro]] Advanced Threats Research is running 
Hadoop 0.18.1 and HBase 0.18.0. Our application is a web crawling application 
with concurrent batch content analysis of various kinds. All of the workflow 
components are implemented as subclasses of !TableMap and/or !TableReduce on a 
cluster of 25 nodes. We see a constant rate of 2500 requests/sec or greater, 
peaking periodically near 100K/sec when some of the batch scan tasks run.
+ [[http://trendmicro.com/|Trend Micro]] uses HBase as a foundation for cloud 
scale storage for a variety of applications. We have been developing with HBase 
since version 0.1 and production since version 0.20.0. 
  
  [[http://www.veoh.com/|Veoh Networks]] uses HBase to store and process 
visitor(human) and entity(non-human) profiles which are used for behavioral 
targeting, demographic detection, and personalization services.  Our site reads 
this data in real-time (heavily cached) and submits updates via various batch 
map/reduce jobs. With 25 million unique visitors a month storing this data in a 
traditional RDBMS is not an option. We currently have a 24 node Hadoop/HBase 
cluster and our profiling system is sharing this cluster with our other Hadoop 
data pipeline processes.

[Hadoop Wiki] Update of "Hbase/PoweredBy" by AndrewPurt ell

Reply via email to