[Hadoop Wiki] Trivial Update of "Hbase/PoweredBy" by ChrisPaterson

Apache Wiki Fri, 03 Oct 2008 01:55:57 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by ChrisPaterson:
http://wiki.apache.org/hadoop/Hbase/PoweredBy

The comment on the change is:
Corrected spelling of Trend Micro's entry.

------------------------------------------------------------------------------
  
  [http://www.tokenizer.org Shopping Engine at Tokenizer] is a web crawler; it 
uses HBase to store URLs and Outlinks (AnchorText + LinkedURL): more than a 
billion. It was initially designed as Nutch-Hadoop extension, then (due to very 
specific 'shopping' scenario) moved to SOLR + MySQL(InnoDB) (ten thousands 
queries per second), and now - to HBase. HBase is significantly faster due to: 
no need for huge transaction logs, column-oriented design exactly matches 
'lazy' business logic, data compression, MapReduce support. Number of mutable 
'indexes' (term from RDBMS) significantly reduced due to the fact that each 
'row::column' structure is physically sorted by 'row'. MySQL InnoDB engine is 
best DB choice for highly-concurrent updates. However, necessity to flash a 
block of data to harddrive even if we changed only few bytes is obvious 
bottleneck. HBase greatly helps: not-so-popular in modern DBMS 'delete-insert', 
'mutable primary key', and 'natural primary key' patterns become a 
 big advantage with HBase.
  
- [http://trendmicro.com/ Trend Micro] Advanced Threads Research is running 
Hadoop 0.18.1 and HBase 0.18.0. Our application is a web crawling application 
with concurrent batch content analysis of various kinds. All of the workflow 
components are implemented as subclasses of TableMap and/or TableReduce on a 
cluster of 25 nodes. We see a constant rate of 2500 requests/sec or greater, 
peaking periodically near 100K/sec when some of the batch scan tasks run.
+ [http://trendmicro.com/ Trend Micro] Advanced Threats Research is running 
Hadoop 0.18.1 and HBase 0.18.0. Our application is a web crawling application 
with concurrent batch content analysis of various kinds. All of the workflow 
components are implemented as subclasses of TableMap and/or TableReduce on a 
cluster of 25 nodes. We see a constant rate of 2500 requests/sec or greater, 
peaking periodically near 100K/sec when some of the batch scan tasks run.
  
  [http://www.videosurf.com/ VideoSurf] - "The video search engine that has 
taught computers to see". We're using Hbase to persist various large graphs of 
data and other statistics. Hbase was a real win for us because it let us store 
substantially larger datasets without the need for manually partitioning the 
data and it's column-oriented nature allowed us to create schemas that were 
substantially more efficient for storing and retrieving data.

[Hadoop Wiki] Trivial Update of "Hbase/PoweredBy" by ChrisPaterson

Reply via email to