[Hadoop Wiki] Update of "Hbase/PoweredBy" by BryanMcCor mick

Apache Wiki Wed, 13 Jan 2010 18:56:44 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hbase/PoweredBy" page has been changed by BryanMcCormick.
http://wiki.apache.org/hadoop/Hbase/PoweredBy?action=diff&rev1=40&rev2=41

--------------------------------------------------

  
  [[http://www.powerset.com/|Powerset (a Microsoft company)]] uses HBase to 
store raw documents.  We have a ~110 node hadoop cluster running DFS, 
mapreduce, and hbase.  In our wikipedia hbase table, we have one row for each 
wikipedia page (~2.5M pages and climbing).  We use this as input to our 
indexing jobs, which are run in hadoop mapreduce.  Uploading the entire 
wikipedia dump to our cluster takes a couple hours.  Scanning the table inside 
mapreduce is very fast -- the latency is in the noise compared to everything 
else we do.
  
+ [[http://www.readpath.com/|ReadPath]] uses HBase to store several hundred 
million RSS items and dictionary for its RSS newsreader. Readpath is currently 
running on an 8 node cluster. 
+ 
  [[http://www.runa.com/|Runa Inc.]] offers a SaaS that enables online 
merchants to offer dynamic per-consumer, per-product promotions embedded in 
their website. To implement this we collect the click streams of all their 
visitors to determine along with the rules of the merchant what promotion to 
offer the visitor at different points of their browsing the Merchant website. 
So we have lots of data and have to do lots of off-line and real-time 
analytics. HBase is the core for us. We also use Clojure and our own open 
sourced distributed processing framework, Swarmiji. The HBase Community has 
been key to our forward movement with HBase. We're looking for experienced 
developers to join us to help make things go even faster!
  
  [[http://www.socialmedia.com/|SocialMedia]] uses HBase to store and process 
user events which allows us to provide near-realtime user metrics and 
reporting. HBase forms the heart of our Advertising Network data storage and 
management system. We use HBase as a data source and sink for both realtime 
request cycle queries and as a backend for mapreduce analysis.
@@ -32, +34 @@

  
  [[http://www.tokenizer.org|Shopping Engine at Tokenizer]] is a web crawler; 
it uses HBase to store URLs and Outlinks (!AnchorText + LinkedURL): more than a 
billion. It was initially designed as Nutch-Hadoop extension, then (due to very 
specific 'shopping' scenario) moved to SOLR + MySQL(InnoDB) (ten thousands 
queries per second), and now - to HBase. HBase is significantly faster due to: 
no need for huge transaction logs, column-oriented design exactly matches 
'lazy' business logic, data compression, !MapReduce support. Number of mutable 
'indexes' (term from RDBMS) significantly reduced due to the fact that each 
'row::column' structure is physically sorted by 'row'. MySQL InnoDB engine is 
best DB choice for highly-concurrent updates. However, necessity to flash a 
block of data to harddrive even if we changed only few bytes is obvious 
bottleneck. HBase greatly helps: not-so-popular in modern DBMS 'delete-insert', 
'mutable primary key', and 'natural primary key' patterns become a big 
advantage with HBase.
  
- [[http://trendmicro.com/|Trend Micro]] uses HBase as a foundation for cloud 
scale storage for a variety of applications. We have been developing with HBase 
since version 0.1 and production since version 0.20.0. 
+ [[http://trendmicro.com/|Trend Micro]] uses HBase as a foundation for cloud 
scale storage for a variety of applications. We have been developing with HBase 
since version 0.1 and production since version 0.20.0.
  
  [[http://www.veoh.com/|Veoh Networks]] uses HBase to store and process 
visitor(human) and entity(non-human) profiles which are used for behavioral 
targeting, demographic detection, and personalization services.  Our site reads 
this data in real-time (heavily cached) and submits updates via various batch 
map/reduce jobs. With 25 million unique visitors a month storing this data in a 
traditional RDBMS is not an option. We currently have a 24 node Hadoop/HBase 
cluster and our profiling system is sharing this cluster with our other Hadoop 
data pipeline processes.

[Hadoop Wiki] Update of "Hbase/PoweredBy" by BryanMcCor mick

Reply via email to