[Hadoop Wiki] Update of "Hbase/PoweredBy" by Misty

Apache Wiki Tue, 13 Oct 2015 23:29:03 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hbase/PoweredBy" page has been changed by Misty:
https://wiki.apache.org/hadoop/Hbase/PoweredBy?action=diff&rev1=91&rev2=92

- This page documents a roughly alphabetical list of institutions that are 
using HBase. Please include details about your cluster hardware and size. 
Entries without this may be mistaken for spam references and deleted.
+ The HBase Wiki is in the process of being decommissioned. The info that used 
to be on this page has moved to http://hbase.apache.org/poweredbyhbase.html. 
Please update your bookmarks.
  
- To add entries you need write permission to the wiki, which you can get by 
subscribing to the [email protected] mailing list and asking for 
permissions on the wiki account username you've registered yourself as. If you 
are using HBase in production you ought to consider getting involved in the 
development process anyway, by filing bugs, testing beta releases, reviewing 
the code and turning your notes into shared documentation. Your participation 
in this process will ensure your needs get met.
- 
- [[http://www.adobe.com|Adobe]] - We currently have about 30 nodes running 
HDFS, Hadoop and HBase  in clusters ranging from 5 to 14 nodes on both 
production and development. We plan a deployment on an 80 nodes cluster. We are 
using HBase in several areas from social services to structured data and 
processing for internal use. We constantly write data to HBase and run 
mapreduce jobs to process then store it back to HBase or external systems. Our 
production cluster has been running since Oct 2008.
- 
- [[http://axibase.com/products/axibase-time-series-database/|Axibase Time 
Series Database]] (ATSD) runs on top of HBase to collect, analyze and visualize 
time series data at scale. ATSD capabilities include optimized storage schema, 
built-in rule engine, forecasting algorithms (Holt-Winters and ARIMA) and 
next-generation graphics designed for high-frequency data. Primary use cases: 
IT infrastructure monitoring, data consolidation, operational historian in OPC 
environments.
- 
- [[http://www.benipaltechnologies.com|Benipal Technologies]] - We have a 35 
node cluster used for HBase and Mapreduce with Lucene / SOLR and katta 
integration to create and finetune our search databases. Currently, our HBase 
installation has over 10 Billion rows with 100s of datapoints per row. We 
compute over 10¹⁸ calculations daily using MapReduce directly on HBase. We 
heart HBase. 
- 
- [[https://github.com/ermanpattuk/BigSecret|BigSecret]] - is a security 
framework that is designed to secure Key-Value data, while preserving efficient 
processing capabilities. It achieves cell-level security, using combinations of 
different cryptographic techniques, in an efficient and secure manner. It 
provides a wrapper library around HBase.
- 
- [[http://caree.rs|Caree.rs]] - Accelerated hiring platform for HiTech 
companies. We use HBase and Hadoop for all aspects of our backend - job and 
company data storage, analytics processing, machine learning algorithms for our 
hire recommendation engine. Our live production site is directly served from 
HBase. We use cascading for running offline data processing jobs.
- 
- [[http://www.celer-tech.com/|Celer Technologies]] is a global financial 
software company that creates modular-based systems that have the flexibility 
to meet tomorrow's business environment, today.  The Celer framework uses 
Hadoop/HBase for storing all financial data for trading, risk, clearing in a 
single data store. With our flexible framework and all the data in 
Hadoop/HBase, clients can build new features to quickly extract data based on 
their trading, risk and clearing activities from one single location.
- 
- [[http://www.explorys.net|Explorys]] uses an HBase cluster containing over a 
billion anonymized clinical records, to enable subscribers to search and 
analyze patient populations, treatment protocols, and clinical outcomes.
- 
- 
[[http://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-messages/454991608919|Facebook]]
 uses HBase to power their Messages infrastructure.
- 
- [[http://www.filmweb.pl|Filmweb]] is a film web portal with a large dataset 
of films, persons and movie-related entities. We have just started a small 
cluster of 3 HBase nodes to handle our web cache persistency layer. We plan to 
increase the cluster size, and also to start migrating some of the data from 
our databases which have some demanding scalability requirements.
- 
- [[http://www.flurry.com|Flurry]] provides mobile application analytics.  We 
use HBase and Hadoop for all of our analytics processing, and serve all of our 
live requests directly out of HBase on our 50 node production cluster with tens 
of billions of rows over several tables.
- 
- [[http://gumgum.com|GumGum]] is an In-Image Advertising Platform. We use 
HBase on an 15-node Amazon EC2 High-CPU Extra Large (c1.xlarge) cluster for 
both real-time data and analytics. Our production cluster has been running 
since June 2010.
- 
- HubSpot, see dev.hubspot\.com, is an online marketing platform, providing 
analytics, email, and segmentation of leads/contacts.  HBase is our primary 
datastore for our customers' customer data, with multiple HBase clusters 
powering the majority of our product.  We have nearly 200 regionservers across 
the various clusters, and 2 hadoop clusters also with nearly 200 tasktrackers.  
We use c1.xlarge in EC2 for both, but are starting to move some of that to 
baremetal hardware.  We've been running HBase for over 2 years.
- 
- [[http://helprace.com/help-desk/|Helprace]], a customer service platform uses 
Hadoop for analytics and internal searching and filtering. Being on HBase we 
can share our HBase & Hadoop cluster with other Hadoop processes - this 
particularly helps in keeping community speeds up. We use Hadoop and HBase on 
small cluster with 4 cores and 32 GB RAM each.
- 
- [[http://www.infolinks.com/|Infolinks]] - Infolinks is an In-Text ad 
provider. We use HBase to process advertisement selection and user events for 
our In-Text ad network. The reports generated from HBase are used as feedback 
for our production system to optimize ad selection.
- 
- [[http://www.kalooga.com|Kalooga]] is a discovery service for image 
galleries. We use Hadoop, HBase and Pig on a 20-node cluster for our crawling, 
analysis and events processing.
- 
- [[http://www.ngdata.com|NGDATA]] delivers 
[[http://www.ngdata.com/site/products/lily.html|Lily]], the consumer 
intelligence solution that delivers a unique combination of  Big Data 
management, machine learning technologies and consumer intelligence 
applications in one integrated solution to allow better, and more dynamic, 
consumer insights. Lily allows companies to process and analyze massive 
structured and unstructured data, scale storage elastically and locate 
actionable data quickly from large data sources in near real time. 
- 
- [[http://www.mahalo.com|Mahalo]], "...the world's first human-powered search 
engine". All the markup that powers the wiki is stored in HBase. It's been in 
use for a few months now. !MediaWiki - the same software that power Wikipedia - 
has version/revision control. Mahalo's in-house editors produce a lot of 
revisions per day, which was not working well in a RDBMS. An hbase-based 
solution for this was built and tested, and the data migrated out of MySQL and 
into HBase. Right now it's at something like 6 million items in HBase. The 
upload tool runs every hour from a shell script to back up that data, and on 6 
nodes takes about 5-10 minutes to run - and does not slow down production at 
all.
- 
- [[http://www.meetup.com|Meetup]] is on a mission to help the world’s people 
self-organize into local groups.  We use Hadoop and HBase to power a site-wide, 
real-time activity feed system for all of our members and groups.  Group 
activity is written directly to HBase, and indexed per member, with the 
member's custom feed served directly from HBase for incoming requests.  We're 
running HBase 0.20.0 on a 11 node cluster.
- 
- [[http://www.mendeley.com|Mendeley]] We are creating a platform for 
researchers to collaborate and share their research online. HBase is helping us 
to create the world's largest research paper collection and is being used to 
store all our raw imported data. We use a lot of map reduce jobs to process 
these papers into pages displayed on the site. We also use HBase with Pig to do 
analytics and produce the article statistics shown on the web site. You can 
find out more about how we use HBase in these slides 
[http://www.slideshare.net/danharvey/hbase-at-mendeley].
- 
- [[http://ning.com|Ning]] uses HBase to store and serve the results of 
processing user events and log files, which allows us to provide near-real time 
analytics and reporting. We use a small cluster of commodity machines with 4 
cores and 16GB of RAM per machine to handle all our analytics and reporting 
needs.
- 
- [[http://www.worldcat.org|OCLC]] uses HBase as the main data store for 
WorldCat, a union catalog which aggregates the collections of 72,000 libraries 
in 112 countries and territories.  WorldCat is currently comprised of nearly 1 
billion records with nearly 2 billion library ownership indications. We're 
running a 50 Node HBase cluster and a separate offline map-reduce cluster.
- 
- [[http://olex.openlogic.com|OpenLogic]] stores all the world's Open Source 
packages, versions, files, and lines of code in HBase for both near-real-time 
access and analytical purposes.  The production cluster has well over 100TB of 
disk spread across nodes with 32GB+ RAM and dual-quad or dual-hex core CPU's.
- 
- [[http://www.openplaces.org|Openplaces]] is a search engine for travel that 
uses HBase to store terabytes of web pages and travel-related entity records 
(countries, cities, hotels, etc.). We have dozens of MapReduce jobs that crunch 
data on a daily basis.  We use a 20-node cluster for development, a 40-node 
cluster for offline production processing and an EC2 cluster for the live web 
site.
- 
- [[http://www.pnl.gov|Pacific Northwest National Laboratory]] - Hadoop and 
HBase (Cloudera distribution) are being used within PNNL's Computational 
Biology & Bioinformatics Group for a systems biology data warehouse project 
that integrates high throughput proteomics and transcriptomics data sets coming 
from instruments in the Environmental  Molecular Sciences Laboratory, a US 
Department of Energy national user facility located at PNNL. The data sets are 
being merged and annotated with other public genomics information in the data 
warehouse environment, with Hadoop analysis programs operating on the annotated 
data in the HBase tables. This work is hosted by olympus, a large PNNL 
institutional computing cluster (http://www.pnl.gov/news/release.aspx?id=908) , 
with the HBase tables being stored in olympus's Lustre file system.
- 
- [[http://www.readpath.com/|ReadPath]] uses HBase to store several hundred 
million RSS items and dictionary for its RSS newsreader. Readpath is currently 
running on an 8 node cluster.
- 
- [[http://resu.me/|resu.me]] - Career network for the net generation. We use 
HBase and Hadoop for all aspects of our backend - user and resume data storage, 
analytics processing, machine learning algorithms for our job recommendation 
engine. Our live production site is directly served from HBase. We use 
cascading for running offline data processing jobs.
- 
- [[http://www.runa.com/|Runa Inc.]] offers a SaaS that enables online 
merchants to offer dynamic per-consumer, per-product promotions embedded in 
their website. To implement this we collect the click streams of all their 
visitors to determine along with the rules of the merchant what promotion to 
offer the visitor at different points of their browsing the Merchant website. 
So we have lots of data and have to do lots of off-line and real-time 
analytics. HBase is the core for us. We also use Clojure and our own open 
sourced distributed processing framework, Swarmiji. The HBase Community has 
been key to our forward movement with HBase. We're looking for experienced 
developers to join us to help make things go even faster!
- 
- [[http://www.sematext.com/|Sematext]] runs 
[[http://www.sematext.com/search-analytics/index.html|Search Analytics]], a 
service that uses HBase to store search activity and MapReduce to produce 
reports showing user search behaviour and experience.
- 
- [[http://www.sematext.com/search-analytics/index.html|Sematext]] runs 
[[http://www.sematext.com/spm/index.html|Scalable Performance Monitoring]] 
(SPM), a service that uses HBase to store performance data over time, crunch it 
with the help of MapReduce, and display it in a visually rich browser-based UI. 
 Interestingly, SPM features 
[[http://www.sematext.com/spm/hbase-performance-monitoring/index.html|SPM for 
HBase]], which is specifically designed to monitor all HBase performance 
metrics.
- 
- [[http://www.socialmedia.com/|SocialMedia]] uses HBase to store and process 
user events which allows us to provide near-realtime user metrics and 
reporting. HBase forms the heart of our Advertising Network data storage and 
management system. We use HBase as a data source and sink for both realtime 
request cycle queries and as a backend for mapreduce analysis.
- 
- [[http://www.splicemachine.com/|Splice Machine]] is built on top of HBase.  
Splice Machine is a full-featured ANSI SQL database that provides real-time 
updates, secondary indices, ACID transactions, optimized joins, triggers, and 
UDFs.
- 
- [[http://www.streamy.com/|Streamy]] is a recently launched realtime social 
news site.  We use HBase for all of our data storage, query, and analysis 
needs, replacing an existing SQL-based system.  This includes hundreds of 
millions of documents, sparse matrices, logs, and everything else once done in 
the relational system.  We perform significant in-memory caching of query 
results similar to a traditional Memcached/SQL setup as well as other external 
components to perform joining and sorting.  We also run thousands of daily 
MapReduce jobs using HBase tables for log analysis, attention data processing, 
and feed crawling.  HBase has helped us scale and distribute in ways we could 
not otherwise, and the community has provided consistent and invaluable 
assistance.
- 
- [[http://www.stumbleupon.com/|Stumbleupon]] and [[http://su.pr|Su.pr]] use 
HBase as a real time data storage and analytics platform. Serving directly out 
of HBase, various site features and statistics are kept up to date in a real 
time fashion. We also use HBase a map-reduce data source to overcome 
traditional query speed limits in MySQL.
- 
- [[http://www.tokenizer.org|Shopping Engine at Tokenizer]] is a web crawler; 
it uses HBase to store URLs and Outlinks (!AnchorText + LinkedURL): more than a 
billion. It was initially designed as Nutch-Hadoop extension, then (due to very 
specific 'shopping' scenario) moved to SOLR + MySQL(InnoDB) (ten thousands 
queries per second), and now - to HBase. HBase is significantly faster due to: 
no need for huge transaction logs, column-oriented design exactly matches 
'lazy' business logic, data compression, !MapReduce support. Number of mutable 
'indexes' (term from RDBMS) significantly reduced due to the fact that each 
'row::column' structure is physically sorted by 'row'. MySQL InnoDB engine is 
best DB choice for highly-concurrent updates. However, necessity to flash a 
block of data to harddrive even if we changed only few bytes is obvious 
bottleneck. HBase greatly helps: not-so-popular in modern DBMS 'delete-insert', 
'mutable primary key', and 'natural primary key' patterns become a big 
advantage with HBase.
- 
- [[http://traackr.com/|Traackr]] uses HBase to store and serve online 
influencer data in real-time. We use MapReduce to frequently re-score our 
entire data set as we keep updating influencer metrics on a daily basis.
- 
- [[http://trendmicro.com/|Trend Micro]] uses HBase as a foundation for cloud 
scale storage for a variety of applications. We have been developing with HBase 
since version 0.1 and production since version 0.20.0.
- 
- [[http://www.twitter.com|Twitter]] runs HBase across its entire Hadoop 
cluster.  HBase provides a distributed, read/write backup of all  mysql tables 
in Twitter's production backend, allowing engineers to run MapReduce jobs over 
the data while maintaining the ability to apply periodic row updates (something 
that is more difficult to do with vanilla HDFS).  A number of applications 
including people search rely on HBase internally for data generation. 
Additionally, the operations team uses HBase as a timeseries database for 
cluster-wide monitoring/performance data.
- 
- [[http://www.udanax.org|Udanax.org]] (URL shortener) use 10 nodes HBase 
cluster to store URLs, Web Log data and response the real-time request on its 
Web Server. This application is now used for some twitter clients and a number 
of web sites. Currently API requests are almost 30 per second and web 
redirection requests are about 300 per second.
- 
- [[http://www.veoh.com/|Veoh Networks]] uses HBase to store and process 
visitor(human) and entity(non-human) profiles which are used for behavioral 
targeting, demographic detection, and personalization services.  Our site reads 
this data in real-time (heavily cached) and submits updates via various batch 
map/reduce jobs. With 25 million unique visitors a month storing this data in a 
traditional RDBMS is not an option. We currently have a 24 node Hadoop/HBase 
cluster and our profiling system is sharing this cluster with our other Hadoop 
data pipeline processes.
- 
- [[http://www.videosurf.com/|VideoSurf]] - "The video search engine that has 
taught computers to see". We're using Hbase to persist various large graphs of 
data and other statistics. Hbase was a real win for us because it let us store 
substantially larger datasets without the need for manually partitioning the 
data and it's column-oriented nature allowed us to create schemas that were 
substantially more efficient for storing and retrieving data.
- 
- [[http://www.visibletechnologies.com/|Visible Technologies]] - We use Hadoop, 
HBase, Katta, and more to collect, parse, store, and search hundreds of 
millions of Social Media content. We get incredibly fast throughput and very 
low latency on commodity hardware. HBase enables our business to exist.
- 
- [[http://www.worldlingo.com/|WorldLingo]] - The !WorldLingo Multilingual 
Archive. We use HBase to store millions of documents that we scan using 
Map/Reduce jobs to machine translate them into all or selected target languages 
from our set of available machine translation languages. We currently store 12 
million documents but plan to eventually reach the 450 million mark. HBase 
allows us to scale out as we need to grow our storage capacities. Combined with 
Hadoop to keep the data replicated and therefore fail-safe we have the backbone 
our service can rely on now and in the future. !WorldLingo is using HBase since 
December 2007 and is along with a few others one of the longest running HBase 
installation. Currently we are running the latest HBase 0.20 and serving 
directly from it: 
[[http://www.worldlingo.com/ma/enwiki/en/HBase|MultilingualArchive]].
- 
- [[http://www.yahoo.com/|Yahoo!]] uses HBase to store document fingerprint for 
detecting near-duplications. We have a cluster of few nodes that runs HDFS, 
mapreduce, and HBase. The table contains millions of rows. We use this for 
querying duplicated documents with realtime traffic.
- 
- [[http://h50146.www5.hp.com/products/software/security/icewall/eng/|HP 
IceWall SSO]] - is a web-based single sign-on solution and uses HBase to store 
user data to authenticate users. We have supported RDB and LDAP previously but 
have newly supported HBase with a view to authenticate over tens of millions of 
users and devices.
- 
- 
[[http://www.ymc.ch/en/big-data-analytics-en?utm_source=hadoopwiki&utm_medium=poweredbypage&utm_campaign=ymc.ch|YMC
 AG]]
-   * operating a Cloudera Hadoop/HBase cluster for media monitoring purpose
-   * offering technical and operative consulting for the Hadoop stack + 
ecosystem
-   * editor of 
[[http://www.ymc.ch/en/hbase-split-visualisation-introducing-hannibal?utm_source=hadoopwiki&utm_medium=poweredbypage&utm_campaign=ymc.ch|Hannibal]],
 a open-source tool to visualize HBase regions sizes & splits that helps 
running HBase in production
-

[Hadoop Wiki] Update of "Hbase/PoweredBy" by Misty

Reply via email to