[Hadoop Wiki] Update of "Hbase/PoweredBy" by DaveLatham

Apache Wiki Fri, 26 Jun 2009 10:50:17 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by DaveLatham:
http://wiki.apache.org/hadoop/Hbase/PoweredBy

The comment on the change is:
added Flurry, moved OpenPlaces to alphabetical order

------------------------------------------------------------------------------
  [http://www.adobe.com Adobe] - We currently have about 30 nodes running HDFS, 
Hadoop and HBase  in clusters ranging from 5 to 14 nodes on both production and 
development. We plan a deployment on an 80 nodes cluster. We are using HBase in 
several areas from social services to structured data and processing for 
internal use. We constantly write data to HBase and run mapreduce jobs to 
process then store it back to HBase or external systems. Our production cluster 
has been running since Oct 2008.
+ 
+ [http://www.flurry.com Flurry] provides mobile application analytics.  We use 
HBase and Hadoop of all of our analytics processing, and serve all of our live 
requests directly out of HBase in our production cluster with billions of rows 
over several tables.
  
  [http://www.mahalo.com Mahalo], "...the world's first human-powered search 
engine". All the markup that powers the wiki is stored in HBase. It's been in 
use for a few months now. !MediaWiki - the same software that power Wikipedia - 
has version/revision control. Mahalo's in-house editors produce a lot of 
revisions per day, which was not working well in a RDBMS. An hbase-based 
solution for this was built and tested, and the data migrated out of MySQL and 
into HBase. Right now it's at something like 6 million items in HBase. The 
upload tool runs every hour from a shell script to back up that data, and on 6 
nodes takes about 5-10 minutes to run - and does not slow down production at 
all. 
  
+ [http://www.openplaces.org Openplaces] is a search engine for travel that 
uses HBase to store terabytes of web pages and travel-related entity records 
(countries, cities, hotels, etc.). We have dozens of MapReduce jobs that crunch 
data on a daily basis.  We use a 20-node cluster for development, a 40-node 
cluster for offline production processing and an EC2 cluster for the live web 
site. 
  [http://www.powerset.com/ Powerset (a Microsoft company)] uses HBase to store 
raw documents.  We have a ~110 node hadoop cluster running DFS, mapreduce, and 
hbase.  In our wikipedia hbase table, we have one row for each wikipedia page 
(~2.5M pages and climbing).  We use this as input to our indexing jobs, which 
are run in hadoop mapreduce.  Uploading the entire wikipedia dump to our 
cluster takes a couple hours.  Scanning the table inside mapreduce is very fast 
-- the latency is in the noise compared to everything else we do.
  
  [http://www.streamy.com/ Streamy] is a recently launched realtime social news 
site.  We use HBase for all of our data storage, query, and analysis needs, 
replacing an existing SQL-based system.  This includes hundreds of millions of 
documents, sparse matrices, logs, and everything else once done in the 
relational system.  We perform significant in-memory caching of query results 
similar to a traditional Memcached/SQL setup as well as other external 
components to perform joining and sorting.  We also run thousands of daily 
MapReduce jobs using HBase tables for log analysis, attention data processing, 
and feed crawling.  HBase has helped us scale and distribute in ways we could 
not otherwise, and the community has provided consistent and invaluable 
assistance.
@@ -22, +25 @@

  
  [http://www.yahoo.com/ Yahoo!] uses HBase to store document fingerprint for 
detecting near-duplications. We have a cluster of few nodes that runs HDFS, 
mapreduce, and HBase. The table contains millions of rows. We use this for 
querying duplicated documents with realtime traffic.
  
- [http://www.openplaces.org Openplaces] is a search engine for travel that 
uses HBase to store terabytes of web pages and travel-related entity records 
(countries, cities, hotels, etc.). We have dozens of MapReduce jobs that crunch 
data on a daily basis.  We use a 20-node cluster for development, a 40-node 
cluster for offline production processing and an EC2 cluster for the live web 
site. 
-

[Hadoop Wiki] Update of "Hbase/PoweredBy" by DaveLatham

Reply via email to