Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hbase/PoweredBy" page has been changed by BryanMcCormick. http://wiki.apache.org/hadoop/Hbase/PoweredBy?action=diff&rev1=40&rev2=41 -------------------------------------------------- [[http://www.powerset.com/|Powerset (a Microsoft company)]] uses HBase to store raw documents. We have a ~110 node hadoop cluster running DFS, mapreduce, and hbase. In our wikipedia hbase table, we have one row for each wikipedia page (~2.5M pages and climbing). We use this as input to our indexing jobs, which are run in hadoop mapreduce. Uploading the entire wikipedia dump to our cluster takes a couple hours. Scanning the table inside mapreduce is very fast -- the latency is in the noise compared to everything else we do. + [[http://www.readpath.com/|ReadPath]] uses HBase to store several hundred million RSS items and dictionary for its RSS newsreader. Readpath is currently running on an 8 node cluster. + [[http://www.runa.com/|Runa Inc.]] offers a SaaS that enables online merchants to offer dynamic per-consumer, per-product promotions embedded in their website. To implement this we collect the click streams of all their visitors to determine along with the rules of the merchant what promotion to offer the visitor at different points of their browsing the Merchant website. So we have lots of data and have to do lots of off-line and real-time analytics. HBase is the core for us. We also use Clojure and our own open sourced distributed processing framework, Swarmiji. The HBase Community has been key to our forward movement with HBase. We're looking for experienced developers to join us to help make things go even faster! [[http://www.socialmedia.com/|SocialMedia]] uses HBase to store and process user events which allows us to provide near-realtime user metrics and reporting. HBase forms the heart of our Advertising Network data storage and management system. We use HBase as a data source and sink for both realtime request cycle queries and as a backend for mapreduce analysis. @@ -32, +34 @@ [[http://www.tokenizer.org|Shopping Engine at Tokenizer]] is a web crawler; it uses HBase to store URLs and Outlinks (!AnchorText + LinkedURL): more than a billion. It was initially designed as Nutch-Hadoop extension, then (due to very specific 'shopping' scenario) moved to SOLR + MySQL(InnoDB) (ten thousands queries per second), and now - to HBase. HBase is significantly faster due to: no need for huge transaction logs, column-oriented design exactly matches 'lazy' business logic, data compression, !MapReduce support. Number of mutable 'indexes' (term from RDBMS) significantly reduced due to the fact that each 'row::column' structure is physically sorted by 'row'. MySQL InnoDB engine is best DB choice for highly-concurrent updates. However, necessity to flash a block of data to harddrive even if we changed only few bytes is obvious bottleneck. HBase greatly helps: not-so-popular in modern DBMS 'delete-insert', 'mutable primary key', and 'natural primary key' patterns become a big advantage with HBase. - [[http://trendmicro.com/|Trend Micro]] uses HBase as a foundation for cloud scale storage for a variety of applications. We have been developing with HBase since version 0.1 and production since version 0.20.0. + [[http://trendmicro.com/|Trend Micro]] uses HBase as a foundation for cloud scale storage for a variety of applications. We have been developing with HBase since version 0.1 and production since version 0.20.0. [[http://www.veoh.com/|Veoh Networks]] uses HBase to store and process visitor(human) and entity(non-human) profiles which are used for behavioral targeting, demographic detection, and personalization services. Our site reads this data in real-time (heavily cached) and submits updates via various batch map/reduce jobs. With 25 million unique visitors a month storing this data in a traditional RDBMS is not an option. We currently have a 24 node Hadoop/HBase cluster and our profiling system is sharing this cluster with our other Hadoop data pipeline processes.
