[Hadoop Wiki] Trivial Update of "Hbase/PoweredBy" by stack

Apache Wiki Tue, 21 Oct 2008 09:30:38 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by stack:
http://wiki.apache.org/hadoop/Hbase/PoweredBy

The comment on the change is:
Added wikia

------------------------------------------------------------------------------
  [http://www.mahalo.com Mahalo], "...the world's first human-powered search 
engine". All the markup that powers the wiki is stored in HBase. It's been in 
use for a few months now. !MediaWiki - the same software that power Wikipedia - 
has version/revision control. Mahalo's in-house editors produce a lot of 
revisions per day, which was not working well in a RDBMS. An hbase-based 
solution for this was built and tested, and the data migrated out of MySQL and 
into HBase. Right now it's at something like 6 million items in HBase. The 
upload tool runs every hour from a shell script to back up that data, and on 6 
nodes takes about 5-10 minutes to run - and does not slow down production at 
all. 
  
  [http://www.powerset.com/ Powerset (a Microsoft company)] uses HBase to store 
raw documents.  We have a ~70 node hadoop cluster running DFS, mapreduce, and 
hbase.  In our wikipedia hbase table, we have one row for each wikipedia page 
(~2.5M pages and climbing).  We use this as input to our indexing jobs, which 
are run in hadoop mapreduce.  Uploading the entire wikipedia dump to our 
cluster takes a couple hours.  Scanning the table inside mapreduce is very fast 
-- the latency is in the noise compared to everything else we do.
+ 
+ [http://www.subrecord.org SubRecord Project] is an Open Source project that 
is using HBase as a repository of records (persisted map-like data) for the 
aspects it provides like logging, tracing or metrics. HBase and Lucene index 
both constitute a repo/storage for this platform.
  
  [http://www.tokenizer.org Shopping Engine at Tokenizer] is a web crawler; it 
uses HBase to store URLs and Outlinks (!AnchorText + LinkedURL): more than a 
billion. It was initially designed as Nutch-Hadoop extension, then (due to very 
specific 'shopping' scenario) moved to SOLR + MySQL(InnoDB) (ten thousands 
queries per second), and now - to HBase. HBase is significantly faster due to: 
no need for huge transaction logs, column-oriented design exactly matches 
'lazy' business logic, data compression, !MapReduce support. Number of mutable 
'indexes' (term from RDBMS) significantly reduced due to the fact that each 
'row::column' structure is physically sorted by 'row'. MySQL InnoDB engine is 
best DB choice for highly-concurrent updates. However, necessity to flash a 
block of data to harddrive even if we changed only few bytes is obvious 
bottleneck. HBase greatly helps: not-so-popular in modern DBMS 'delete-insert', 
'mutable primary key', and 'natural primary key' patterns become 
 a big advantage with HBase.
  
@@ -10, +12 @@

  
  [http://www.videosurf.com/ VideoSurf] - "The video search engine that has 
taught computers to see". We're using Hbase to persist various large graphs of 
data and other statistics. Hbase was a real win for us because it let us store 
substantially larger datasets without the need for manually partitioning the 
data and it's column-oriented nature allowed us to create schemas that were 
substantially more efficient for storing and retrieving data.
  
+ [http://www.wikia.com/wiki/Wikia Wikia] hosts its user and keyword databases 
on a cluster of 7 machines.
  
  [http://www.yahoo.com/ Yahoo!] uses HBase to store document fingerprint for 
detecting near-duplications. We have a cluster of few nodes that runs HDFS, 
mapreduce, and HBase. The table contains millions of rows. We use this for 
querying duplicated documents with realtime traffic.
  
- [http://www.subrecord.org SubRecord Project] is an Open Source project that 
is using HBase as a repository of records (persisted map-like data) for the 
aspects it provides like logging, tracing or metrics. HBase and Lucene index 
both constitute a repo/storage for this platform.
-

[Hadoop Wiki] Trivial Update of "Hbase/PoweredBy" by stack

Reply via email to