[Hadoop Wiki] Update of "PoweredBy" by anil madan

Apache Wiki Thu, 15 Jul 2010 18:19:19 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "PoweredBy" page has been changed by anil madan.
http://wiki.apache.org/hadoop/PoweredBy?action=diff&rev1=208&rev2=209

--------------------------------------------------

    * Each night, we run 112 Hadoop jobs
    * It is roughly 4X faster to export the transaction tables from each of our 
reporting databases, transfer the data to the cluster, perform the rollups, 
then import back into the databases than to perform the same rollups in the 
database.
  
- 
   * [[http://www.adobe.com|Adobe]]
    * We use Hadoop and HBase in several areas from social services to 
structured data storage and processing for internal use.
-   * We currently have about 30 nodes running HDFS, Hadoop and HBase  in 
clusters ranging from 5 to 14 nodes on both production and development. We plan 
a deployment on an 80 nodes cluster.
+   * We currently have about 30 nodes running HDFS, Hadoop and HBase in 
clusters ranging from 5 to 14 nodes on both production and development. We plan 
a deployment on an 80 nodes cluster.
    * We constantly write data to HBase and run MapReduce jobs to process then 
store it back to HBase or external systems.
    * Our production cluster has been running since Oct 2008.
  
@@ -35, +34 @@

    * Each node has 8 cores, 16G RAM and 1.4T storage.
  
   * [[http://aws.amazon.com/|Amazon Web Services]]
-   * We provide [[http://aws.amazon.com/elasticmapreduce|Amazon Elastic 
MapReduce]].  It's a web service that provides a hosted Hadoop framework 
running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon 
EC2) and Amazon Simple Storage Service (Amazon S3).
+   * We provide [[http://aws.amazon.com/elasticmapreduce|Amazon Elastic 
MapReduce]]. It's a web service that provides a hosted Hadoop framework running 
on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) 
and Amazon Simple Storage Service (Amazon S3).
    * Our customers can instantly provision as much or as little capacity as 
they like to perform data-intensive tasks for applications such as web 
indexing, data mining, log file analysis, machine learning, financial analysis, 
scientific simulation, and bioinformatics research.
  
   * [[http://aol.com/|AOL]]
@@ -52, +51 @@

  
   * [[http://www.backdocsearch.com|backdocsearch.com]] - search engine for 
chiropractic information, local chiropractors, products and schools
  
-  * [[http://www.baidu.cn|Baidu]] -  the leading Chinese language search engine
+  * [[http://www.baidu.cn|Baidu]] - the leading Chinese language search engine
    * Hadoop used to analyze the log of search and do some mining work on web 
page database
    * We handle about 3000TB per week
    * Our clusters vary from 10 to 500 nodes
@@ -62, +61 @@

    * 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RAM)
    * We use hadoop for matching dating profiles
  
-  * [[http://www.benipaltechnologies.com|Benipal Technologies]] -  
Outsourcing, Consulting, Innovation
+  * [[http://www.benipaltechnologies.com|Benipal Technologies]] - Outsourcing, 
Consulting, Innovation
    * 35 Node Cluster (Core2Quad Q9400 Processor, 4-8 GB RAM, 500 GB HDD)
    * Largest Data Node with Xeon E5420*2 Processors, 64GB RAM, 3.5 TB HDD
    * Total Cluster capacity of around 20 TB on a gigabit network with failover 
and redundancy
@@ -73, +72 @@

    * We're doing a 200M page/5TB crawl as part of the 
[[http://bixolabs.com/datasets/public-terabyte-dataset-project/|public terabyte 
dataset project]].
    * This runs as a 20 machine 
[[http://aws.amazon.com/elasticmapreduce/|Elastic MapReduce]] cluster.
  
-  * [[http://www.brainpad.co.jp|BrainPad]] -  Data mining and analysis
+  * [[http://www.brainpad.co.jp|BrainPad]] - Data mining and analysis
    * We use Hadoop to summarize of user's tracking data.
    * And use analyzing.
  
@@ -86, +85 @@

  
   * [[http://www.contextweb.com/|Contextweb]] - ADSDAQ Ad Excange
    * We use Hadoop to store ad serving log and use it as a source for Ad 
optimizations/Analytics/reporting/machine learning.
-   * Currently we have a 23 machine cluster with 184 cores and about 35TB raw 
storage.  Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of storage.
+   * Currently we have a 23 machine cluster with 184 cores and about 35TB raw 
storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of storage.
  
   * [[http://www.cooliris.com|Cooliris]] - Cooliris transforms your browser 
into a lightning fast, cinematic way to browse photos and videos, both online 
and on your hard drive.
    * We have a 15-node Hadoop cluster where each machine has 8 cores, 8 GB 
ram, and 3-4 TB of storage.
@@ -104, +103 @@

    * We primarily run Hadoop jobs on Amazon Elastic MapReduce, with cluster 
sizes of 1 to 20 nodes depending on the size of the dataset (hundreds of 
millions to billions of RDF statements).
  
   * [[http://www.datameer.com|Datameer]]
-   * Datameer Analytics Solution (DAS) is the first Hadoop-based solution for 
big data analytics that includes data source integration, storage, an analytics 
engine and visualization. 
+   * Datameer Analytics Solution (DAS) is the first Hadoop-based solution for 
big data analytics that includes data source integration, storage, an analytics 
engine and visualization.
    * DAS Log File Aggregator is a plug-in to DAS that makes it easy to import 
large numbers of log files stored on disparate servers.
  
   * [[http://www.deepdyve.com|Deepdyve]]
@@ -119, +118 @@

    * We generate Pig Latin scripts that describe structural and semantic 
conversions between data contexts
    * We use Hadoop to execute these scripts for production-level deployments
    * Eliminates the need for explicit data and schema mappings during database 
integration
+ 
+  * [[www.ebay.com|EBay]]
+   * 532 nodes cluster (8 * 532 cores, 5.3PB).
+   * Heavy usage of Java MapReduce, Pig, Hive
+   * Using it for Search optimization and Research.
  
   * [[http://www.enormo.com/|Enormo]]
    * 4 nodes cluster (32 cores, 1TB).
@@ -145, +149 @@

     * A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
     * A 300-machine cluster with 2400 cores and about 3 PB raw storage.
     * Each (commodity) node has 8 cores and 12 TB of storage.
-   * We are heavy users of both streaming as well as the Java apis. We have 
built a higher level data warehousing framework using these features called 
Hive (see the http://hadoop.apache.org/hive/).  We have also developed a FUSE 
implementation over hdfs.
+   * We are heavy users of both streaming as well as the Java apis. We have 
built a higher level data warehousing framework using these features called 
Hive (see the http://hadoop.apache.org/hive/). We have also developed a FUSE 
implementation over hdfs.
  
   * [[http://www.foxaudiencenetwork.com|FOX Audience Network]]
    * 40 machine cluster (8 cores/machine, 2TB/machine storage)
@@ -171, +175 @@

    * 
[[http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html|University
 Initiative to Address Internet-Scale Computing Challenges]]
  
   * [[http://www.gruter.com|Gruter. Corp.]]
-   * 30 machine cluster  (4 cores, 1TB~2TB/machine storage)
+   * 30 machine cluster (4 cores, 1TB~2TB/machine storage)
    * storage for blog data and web documents
    * used for data indexing by MapReduce
    * link analyzing and Machine Learning by MapReduce
@@ -243, +247 @@

    * Uses Hadoop FileSytem, RPC and IO
  
   * [[http://www.koubei.com/|Koubei.com]] Large local community and local 
search at China.
-   . Using Hadoop to process apache log, analyzing user's action and click 
flow and the links click with any specified page in site and more.  Using 
Hadoop to process whole price data user input with map/reduce.
+   . Using Hadoop to process apache log, analyzing user's action and click 
flow and the links click with any specified page in site and more. Using Hadoop 
to process whole price data user input with map/reduce.
  
   * [[http://krugle.com/|Krugle]]
    * Source code search engine uses Hadoop and Nutch.
@@ -272, +276 @@

   * [[http://www.markt24.de/|Markt24]]
    * We use Hadoop to filter user behaviour, recommendations and trends from 
externals sites
    * Using zkpython
-   * Used EC2, no using many small machines (8GB Ram, 4 cores, 1TB) 
+   * Used EC2, no using many small machines (8GB Ram, 4 cores, 1TB)
  
   * [[http://www.crmcs.com//|MicroCode]]
    * 18 node cluster (Quad-Core Intel Xeon, 1TB/node storage)
@@ -304, +308 @@

    * Powers data for search and aggregation
  
   * [[http://lucene.apache.org/mahout|Mahout]]
-   . Another Apache project using Hadoop to build scalable machine learning   
algorithms like canopy clustering, k-means and many more to come (naive bayes 
classifiers, others)
+   . Another Apache project using Hadoop to build scalable machine learning 
algorithms like canopy clustering, k-means and many more to come (naive bayes 
classifiers, others)
  
   * [[http://metrixcloud.com/|MetrixCloud]] - provides commercial support, 
installation, and hosting of Hadoop Clusters. 
[[http://metrixcloud.com/contact.php|Contact Us.]]
  
@@ -338, +342 @@

   * [[http://www.powerset.com|Powerset / Microsoft]] - Natural Language Search
    * up to 400 instances on 
[[http://www.amazon.com/b/ref=sc_fe_l_2/002-1156069-5604805?ie=UTF8&node=201590011&no=3435361&me=A36L942TSJ2AJA|Amazon
 EC2]]
    * data storage in 
[[http://www.amazon.com/S3-AWS-home-page-Money/b/ref=sc_fe_l_2/002-1156069-5604805?ie=UTF8&node=16427261&no=3435361&me=A36L942TSJ2AJA|Amazon
 S3]]
-   * Microsoft is now contributing to HBase, a Hadoop subproject (   
[[http://port25.technet.com/archive/2008/10/14/microsoft-s-powerset-team-resumes-hbase-contributions.aspx|announcement]]).
+   * Microsoft is now contributing to HBase, a Hadoop subproject ( 
[[http://port25.technet.com/archive/2008/10/14/microsoft-s-powerset-team-resumes-hbase-contributions.aspx|announcement]]).
  
   * [[http://pressflip.com|Pressflip]] - Personalized Persistent Search
    * Using Hadoop on EC2 to process documents from a continuous web crawl and 
distributed training of support vector machines
@@ -351, +355 @@

  
   * [[http://www.psgtech.edu/|PSG Tech, Coimbatore, India]]
    * Multiple alignment of protein sequences helps to determine evolutionary 
linkages and to predict molecular structures. The dynamic nature of the 
algorithm coupled with data and compute parallelism of hadoop data grids 
improves the accuracy and speed of sequence alignment. Parallelism at the 
sequence and block level reduces the time complexity of MSA problems. Scalable 
nature of Hadoop makes it apt to solve large scale alignment problems.
-   * Our cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2950 
Quad Core  Rack Server,  with 2x6MB Cache and 4 x 500 GB SATA Hard Drive to 
E7200 / E7400 processors with 4 GB RAM and 160 GB HDD.
+   * Our cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2950 
Quad Core Rack Server, with 2x6MB Cache and 4 x 500 GB SATA Hard Drive to E7200 
/ E7400 processors with 4 GB RAM and 160 GB HDD.
  
   * [[http://www.quantcast.com/|Quantcast]]
    * 3000 cores, 3500TB. 1PB+ processing each day.
@@ -376, +380 @@

    * We intend to parallelize some traditional classification, clustering 
algorithms like Naive Bayes, K-Means, EM so that can deal with large-scale data 
sets.
  
   * [[http://alpha.search.wikia.com|Search Wikia]]
-   * A project to help develop open source social search tools.  We run a 125 
node hadoop cluster.
+   * A project to help develop open source social search tools. We run a 125 
node hadoop cluster.
  
   * [[http://wwwse.inf.tu-dresden.de/SEDNS/SEDNS_home.html|SEDNS]] - Security 
Enhanced DNS Group
    * We are gathering world wide DNS data in order to discover content 
distribution networks and
@@ -418, +422 @@

    * 6 node cluster with 96 total cores, 8GB RAM and 2 TB storage per machine.
  
   * [[http://www.twitter.com|Twitter]]
-   * We use Hadoop to store and process tweets, log files, and many other 
types of data generated across Twitter.  We use Cloudera's CDH2 distribution of 
Hadoop, and store all data as compressed LZO files.
+   * We use Hadoop to store and process tweets, log files, and many other 
types of data generated across Twitter. We use Cloudera's CDH2 distribution of 
Hadoop, and store all data as compressed LZO files.
    * We use both Scala and Java to access Hadoop's MapReduce APIs
    * We use Pig heavily for both scheduled and ad-hoc jobs, due to its ability 
to accomplish a lot with few statements.
    * We employ committers on Pig, Avro, Hive, and Cassandra, and contribute 
much of our internal Hadoop work to opensource (see 
[[http://github.com/kevinweil/hadoop-lzo|hadoop-lzo]])
@@ -429, +433 @@

    We use Hadoop to facilitate information retrieval research & 
experimentation, particularly for TREC, using the Terrier IR platform. The open 
source release of [[http://ir.dcs.gla.ac.uk/terrier/|Terrier]] includes 
large-scale distributed indexing using Hadoop Map Reduce.
  
   * 
[[http://www.umiacs.umd.edu/~jimmylin/cloud-computing/index.html|University of 
Maryland]]
-   . We are one of six universities participating in IBM/Google's academic 
cloud computing initiative.  Ongoing research and teaching efforts include 
projects in machine translation, language modeling, bioinformatics, email 
analysis, and image processing.
+   . We are one of six universities participating in IBM/Google's academic 
cloud computing initiative. Ongoing research and teaching efforts include 
projects in machine translation, language modeling, bioinformatics, email 
analysis, and image processing.
  
   * [[http://t2.unl.edu|University of Nebraska Lincoln, Research Computing 
Facility]]
-   . We currently run one medium-sized Hadoop cluster (200TB) to store and 
serve up physics data for the computing portion of the Compact Muon Solenoid 
(CMS) experiment.  This requires a filesystem which can download data at 
multiple Gbps and process data at an even higher rate locally.  Additionally, 
several of our students are involved in research projects on Hadoop.
+   . We currently run one medium-sized Hadoop cluster (200TB) to store and 
serve up physics data for the computing portion of the Compact Muon Solenoid 
(CMS) experiment. This requires a filesystem which can download data at 
multiple Gbps and process data at an even higher rate locally. Additionally, 
several of our students are involved in research projects on Hadoop.
  
   * [[http://www.veoh.com|Veoh]]
    * We use a small Hadoop cluster to reduce usage data for internal metrics, 
for search indexing and for recommendation data.
  
-  * [[http://www.visiblemeasures.com|Visible Measures Corporation]] uses 
Hadoop as a component in our Scalable Data Pipeline, which ultimately powers 
!VisibleSuite and other products.  We use Hadoop to aggregate, store, and 
analyze data related to in-stream viewing behavior of Internet video audiences. 
  Our current grid contains more than 128 CPU cores and in excess of 100 
terabytes of storage, and we plan to grow that substantially during 2008.
+  * [[http://www.visiblemeasures.com|Visible Measures Corporation]] uses 
Hadoop as a component in our Scalable Data Pipeline, which ultimately powers 
!VisibleSuite and other products. We use Hadoop to aggregate, store, and 
analyze data related to in-stream viewing behavior of Internet video audiences. 
Our current grid contains more than 128 CPU cores and in excess of 100 
terabytes of storage, and we plan to grow that substantially during 2008.
  
   * [[http://www.vksolutions.com/|VK Solutions]]
    * We use a small Hadoop cluster in the scope of our general research 
activities at [[http://www.vklabs.com|VK Labs]] to get a faster data access 
from web applications.

[Hadoop Wiki] Update of "PoweredBy" by anil madan

Reply via email to