[Hadoop Wiki] Update of "PoweredBy" by SomeOtherAccount

Apache Wiki Fri, 22 Oct 2010 17:34:08 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "PoweredBy" page has been changed by SomeOtherAccount.
http://wiki.apache.org/hadoop/PoweredBy?action=diff&rev1=220&rev2=221

--------------------------------------------------

  Applications and organizations using Hadoop include (alphabetically):
  
+ <<TableOfContents(3)>>
+ 
+ = A =
+ 
-  * [[http://a9.com/|A9.com]] - Amazon
+  * [[http://a9.com/|A9.com]] - Amazon *
    * We build Amazon's product search indices using the streaming API and 
pre-existing C++, Perl, and Python tools.
    * We process millions of sessions daily for analytics, using both the Java 
and streaming APIs.
    * Our clusters vary from 1 to 100 nodes.
@@ -45, +49 @@

  
   * [[http://atbrox.com/|Atbrox]]
    * We use hadoop for information extraction & search, and data analysis 
consulting
- 
    * Cluster: we primarily use Amazon's Elastic Mapreduce
+ 
+ = B =
  
   * [[http://www.babacar.org/|BabaCar]]
    * 4 nodes cluster (32 cores, 1TB).
@@ -81, +86 @@

    * We use Hadoop to summarize of user's tracking data.
    * And use analyzing.
  
+ = C =
+ 
   * [[http://www.cascading.org/|Cascading]] - Cascading is a feature rich API 
for defining and executing complex and fault tolerant data processing workflows 
on a Hadoop cluster.
  
   * [[http://www.cloudera.com|Cloudera, Inc]] - Cloudera provides commercial 
support and professional training for Hadoop.
    * We provide [[http://www.cloudera.com/hadoop|Cloudera's Distribution for 
Hadoop]]. Stable packages for Redhat and Ubuntu (rpms / debs), EC2 Images and 
web based configuration.
- 
    * Check out our [[http://www.cloudera.com/blog|Hadoop and Big Data Blog]]
- 
    * Get [[http://oreilly.com/catalog/9780596521998/index.html|"Hadoop: The 
Definitive Guide"]] (Tom White/O'Reilly)
  
   * [[http://www.contextweb.com/|Contextweb]] - Ad Exchange
@@ -101, +106 @@

   * [[http://www.weblab.infosci.cornell.edu/|Cornell University Web Lab]]
    * Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 
72GB Hard Drive)
  
+ = D =
+ 
   * [[http://datagraph.org/|Datagraph]]
    * We use Hadoop for batch-processing large [[http://www.w3.org/RDF/|RDF]] 
datasets, in particular for indexing RDF data.
- 
    * We also use Hadoop for executing long-running offline 
[[http://en.wikipedia.org/wiki/SPARQL|SPARQL]] queries for clients.
- 
    * We use Amazon S3 and Cassandra to store input RDF datasets and output 
files.
    * We've developed [[http://rdfgrid.rubyforge.org/|RDFgrid]], a Ruby 
framework for map/reduce-based processing of RDF data.
- 
    * We primarily use Ruby, [[http://rdf.rubyforge.org/|RDF.rb]] and RDFgrid 
to process RDF data with Hadoop Streaming.
- 
    * We primarily run Hadoop jobs on Amazon Elastic MapReduce, with cluster 
sizes of 1 to 20 nodes depending on the size of the dataset (hundreds of 
millions to billions of RDF statements).
  
   * [[http://www.datameer.com|Datameer]]
@@ -130, +133 @@

    * We use Hadoop to execute these scripts for production-level deployments
    * Eliminates the need for explicit data and schema mappings during database 
integration
  
+ = E =
+ 
   * [[http://www.ebay.com|EBay]]
    * 532 nodes cluster (8 * 532 cores, 5.3PB).
    * Heavy usage of Java MapReduce, Pig, Hive, HBase
- 
    * Using it for Search optimization and Research.
  
   * [[http://www.enormo.com/|Enormo]]
@@ -148, +152 @@

  
   * [[http://www.systems.ethz.ch/education/courses/hs08/map-reduce/|ETH Zurich 
Systems Group]]
    * We are using Hadoop in a course that we are currently teaching: 
"Massively Parallel Data Analysis with MapReduce". The course projects are 
based on real use-cases from biological data analysis.
- 
    * Cluster hardware: 16 x (Quad-core Intel Xeon, 8GB RAM, 1.5 TB Hard-Disk)
  
   * [[http://www.eyealike.com/|Eyealike]] - Visual Media Search Platform
    * Facial similarity and recognition across large datasets.
    * Image content based advertising and auto-tagging for social media.
    * Image based video copyright protection.
+ 
+ = F =
  
   * [[http://www.facebook.com/|Facebook]]
    * We use Hadoop to store copies of internal log and dimension data sources 
and use it as a source for reporting/analytics and machine learning.
@@ -162, +167 @@

     * A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
     * A 300-machine cluster with 2400 cores and about 3 PB raw storage.
     * Each (commodity) node has 8 cores and 12 TB of storage.
- 
-   * We are heavy users of both streaming as well as the Java apis. We have 
built a higher level data warehousing framework using these features called 
Hive (see the http://hadoop.apache.org/hive/). We have also developed a FUSE 
implementation over hdfs.
+    * We are heavy users of both streaming as well as the Java apis. We have 
built a higher level data warehousing framework using these features called 
Hive (see the http://hadoop.apache.org/hive/). We have also developed a FUSE 
implementation over hdfs.
  
   * [[http://www.foxaudiencenetwork.com|FOX Audience Network]]
    * 40 machine cluster (8 cores/machine, 2TB/machine storage)
@@ -175, +179 @@

    * 5 machine cluster (8 cores/machine, 5TB/machine storage)
    * Existing 19 virtual machine cluster (2 cores/machine 30TB storage)
    * Predominantly Hive and Streaming API based jobs (~20,000 jobs a week) 
using [[http://github.com/trafficbroker/mandy|our Ruby library]], or see the 
[[http://oobaloo.co.uk/articles/2010/1/12/mapreduce-with-hadoop-and-ruby.html|canonical
 WordCount example]].
- 
    * Daily batch ETL with a slightly modified 
[[http://github.com/pingles/clojure-hadoop|clojure-hadoop]]
- 
    * Log analysis
    * Data mining
    * Machine learning
@@ -187, +189 @@

    * Our Hadoop environment produces the original database for fast access 
from our web application.
    * We also uses Hadoop to analyzing similarities of user's behavior.
  
+ = G =
+ 
   * [[http://www.google.com|Google]]
    * 
[[http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html|University
 Initiative to Address Internet-Scale Computing Challenges]]
  
@@ -194, +198 @@

    * 30 machine cluster (4 cores, 1TB~2TB/machine storage)
    * storage for blog data and web documents
    * used for data indexing by MapReduce
- 
    * link analyzing and Machine Learning by MapReduce
  
   * [[http://gumgum.com|GumGum]]
    * 20+ node cluster (Amazon EC2 c1.medium)
    * Nightly MapReduce jobs on 
[[http://aws.amazon.com/elasticmapreduce/|Amazon Elastic MapReduce]] process 
data stored in S3
- 
    * MapReduce jobs written in [[http://groovy.codehaus.org/|Groovy]] use 
Hadoop Java APIs
- 
    * Image and advertising analytics
+ 
+ = H =
  
   * [[http://www.hadoop.co.kr/|Hadoop Korean User Group]], a Korean Local 
Community Team Page.
    * 50 node cluster In the Korea university network environment.
     * Pentium 4 PC, HDFS 4TB Storage
- 
    * Used for development projects
     * Retrieving and Analyzing Biomedical Knowledge
     * Latent Semantic Analysis, Collaborative Filtering
@@ -228, +230 @@

    * We use a customised version of Hadoop and Nutch in a currently 
experimental 6 node/Dual Core cluster environment.
    * What we crawl are our clients Websites and from the information we 
gather. We fingerprint old and non updated software packages in that shared 
hosting environment. We can then inform our clients that they have old and non 
updated software running after matching a signature to a Database. With that 
information we know which sites would require patching as a free and courtesy 
service to protect the majority of users. Without the technologies of Nutch and 
Hadoop this would be a far harder to accomplish task.
  
+ = I =
+ 
   * [[http://www.ibm.com|IBM]]
    * [[http://www-03.ibm.com/press/us/en/pressrelease/22613.wss|Blue Cloud 
Computing Clusters]]
  
@@ -250, +254 @@

  
   * [[http://infochimps.org|Infochimps]]
    * 30 node AWS EC2 cluster (varying instance size, currently EBS-backed) 
managed by Chef & Poolparty running Hadoop 0.20.2+228, Pig 0.5.0+30, Azkaban 
0.04, [[http://github.com/infochimps/wukong|Wukong]]
- 
    * Used for ETL & data analysis on terascale datasets, especially social 
network data (on [[http://api.infochimps.com|api.infochimps.com]])
  
   * [[http://www.iterend.com/|Iterend]]
    * using 10 node hdfs cluster to store and process retrieved data.
  
+ = J =
+ 
   * [[http://joost.com|Joost]]
    * Session analysis and report generation
  
   * [[http://www.journeydynamics.com|Journey Dynamics]]
    * Using Hadoop MapReduce to analyse billions of lines of GPS data to create 
TrafficSpeeds, our accurate traffic speed forecast product.
  
+ = K =
+ 
   * [[http://www.karmasphere.com/|Karmasphere]]
    * Distributes [[http://www.hadoopstudio.org/|Karmasphere Studio for 
Hadoop]], which allows cross-version development and management of Hadoop jobs 
in a familiar integrated development environment.
  
@@ -273, +280 @@

  
   * [[http://krugle.com/|Krugle]]
    * Source code search engine uses Hadoop and Nutch.
+ 
+ = L =
  
   * [[http://www.last.fm|Last.fm]]
    * 44 nodes
@@ -285, +294 @@

    * HBase & Hadoop version 0.20
  
   * [[http://www.linkedin.com|LinkedIn]]
+   * We have multiple grids divided up based upon purpose.  They are composed 
of the following types of hardware:
-   * 2x50 Nehalem-based node grids, with 2x4 cores, 24GB RAM, 8x1TB storage 
using ZFS in a JBOD configuration.
+     * 100 Nehalem-based nodes, with 2x4 cores, 24GB RAM, 8x1TB storage using 
ZFS in a JBOD configuration on Solaris.
+     * 120 Westmere-based nodes, with 2x4 cores, 24GB RAM, 6x2TB storage using 
ext4 in a JBOD configuration on CentOS 5.5
    * We use Hadoop and Pig for discovering People You May Know and other fun 
facts.
  
   * [[http://www.lookery.com|Lookery]]
@@ -294, +305 @@

  
   * [[http://www.lotame.com|Lotame]]
    * Using Hadoop and Hbase for storage, log analysis, and pattern 
discovery/analysis.
+ 
+ = M =
  
   * [[http://www.markt24.de/|Markt24]]
    * We use Hadoop to filter user behaviour, recommendations and trends from 
externals sites
@@ -315, +328 @@

    * We use Hadoop to develop MapReduce algorithms:
     * Information retrival and analytics
     * Machine generated content - documents, text, audio, & video
- 
     * Natural Language Processing
- 
    * Project portfolio includes:
     * Natural Language Processing
     * Mobile Social Network Hacking
     * Web Crawlers/Page scrapping
     * Text to Speech
     * Machine generated Audio & Video with remuxing
- 
     * Automatic PDF creation & IR
- 
    * 2 node cluster (Windows Vista/CYGWIN, & CentOS) for developing MapReduce 
programs.
  
   * [[http://www.mylife.com/|MyLife]]
@@ -338, +347 @@

  
   * [[http://metrixcloud.com/|MetrixCloud]] - provides commercial support, 
installation, and hosting of Hadoop Clusters. 
[[http://metrixcloud.com/contact.php|Contact Us.]]
  
+ = N =
+ 
   * [[http://www.openneptune.com|Neptune]]
    * Another Bigtable cloning project using Hadoop to store large structured 
data set.
    * 200 nodes(each node has: 2 dual core CPUs, 2TB storage, 4GB RAM)
  
   * [[http://www.netseer.com|NetSeer]] -
    * Up to 1000 instances on 
[[http://www.amazon.com/b/ref=sc_fe_l_2/002-1156069-5604805?ie=UTF8&node=201590011&no=3435361&me=A36L942TSJ2AJA|Amazon
 EC2]]
- 
    * Data storage in 
[[http://www.amazon.com/S3-AWS-home-page-Money/b/ref=sc_fe_l_2/002-1156069-5604805?ie=UTF8&node=16427261&no=3435361&me=A36L942TSJ2AJA|Amazon
 S3]]
- 
    * 50 node cluster in Coloc
    * Used for crawling, processing, serving and log analysis
  
   * [[http://nytimes.com|The New York Times]]
    * 
[[http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/|Large
 scale image conversions]]
- 
    * Used EC2 to run hadoop on a large virtual cluster
  
   * [[http://www.ning.com|Ning]]
    * We use Hadoop to store and process our log files
    * We rely on Apache Pig for reporting, analytics, Cascading for machine 
learning, and on a proprietary JavaScript API for ad-hoc queries
- 
    * We use commodity hardware, with 8 cores and 16 GB of RAM per machine
  
   * [[http://lucene.apache.org/nutch|Nutch]] - flexible web search engine 
software
+ 
+ = P =
  
   * [[http://parc.com|PARC]] - Used Hadoop to analyze Wikipedia conflicts 
[[http://asc.parc.googlepages.com/2007-10-28-VAST2007-RevertGraph-Wiki.pdf|paper]].
  
   * [[http://pentaho.com|Pentaho]] – Open Source Business Intelligence
    * Pentaho provides the only complete, end-to-end open  source BI 
alternative to proprietary offerings like Oracle, SAP and  IBM
- 
    * We provide an easy-to-use, graphical ETL tool that  is integrated with 
Hadoop for managing data and coordinating Hadoop related  tasks in the broader 
context of your ETL and Business Intelligence  workflow
- 
    * We also provide Reporting and Analysis capabilities  against big data in 
Hadoop
- 
    * Learn more at 
[[http://www.pentaho.com/hadoop/|http://www.pentaho.com/hadoop]]
  
   * [[http://pharm2phork.org|Pharm2Phork Project]] - Agricultural Traceability
@@ -380, +386 @@

  
   * [[http://www.powerset.com|Powerset / Microsoft]] - Natural Language Search
    * up to 400 instances on 
[[http://www.amazon.com/b/ref=sc_fe_l_2/002-1156069-5604805?ie=UTF8&node=201590011&no=3435361&me=A36L942TSJ2AJA|Amazon
 EC2]]
- 
    * data storage in 
[[http://www.amazon.com/S3-AWS-home-page-Money/b/ref=sc_fe_l_2/002-1156069-5604805?ie=UTF8&node=16427261&no=3435361&me=A36L942TSJ2AJA|Amazon
 S3]]
- 
    * Microsoft is now contributing to HBase, a Hadoop subproject ( 
[[http://port25.technet.com/archive/2008/10/14/microsoft-s-powerset-team-resumes-hbase-contributions.aspx|announcement]]).
  
   * [[http://pressflip.com|Pressflip]] - Personalized Persistent Search
@@ -398, +402 @@

    * Multiple alignment of protein sequences helps to determine evolutionary 
linkages and to predict molecular structures. The dynamic nature of the 
algorithm coupled with data and compute parallelism of hadoop data grids 
improves the accuracy and speed of sequence alignment. Parallelism at the 
sequence and block level reduces the time complexity of MSA problems. Scalable 
nature of Hadoop makes it apt to solve large scale alignment problems.
    * Our cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2950 
Quad Core Rack Server, with 2x6MB Cache and 4 x 500 GB SATA Hard Drive to E7200 
/ E7400 processors with 4 GB RAM and 160 GB HDD.
  
+ = Q =
+ 
   * [[http://www.quantcast.com/|Quantcast]]
    * 3000 cores, 3500TB. 1PB+ processing each day.
    * Hadoop scheduler with fully custom data path / sorter
    * Significant contributions to KFS filesystem
  
+ = R =
+ 
   * [[http://www.rackspace.com/email_hosting/|Rackspace]]
    * 30 node cluster (Dual-Core, 4-8GB RAM, 1.5TB/node storage)
     * Parses and indexes logs from email hosting system for search: 
http://blog.racklabs.com/?p=66
@@ -420, +428 @@

    * Hardware: 35 nodes (2*4cpu 10TB disk 16GB RAM each)
    * We intend to parallelize some traditional classification, clustering 
algorithms like Naive Bayes, K-Means, EM so that can deal with large-scale data 
sets.
  
+ = S =
+ 
   * [[http://alpha.search.wikia.com|Search Wikia]]
    * A project to help develop open source social search tools. We run a 125 
node hadoop cluster.
  
@@ -429, +439 @@

  
   * [[http://www.slcsecurity.com/|SLC Security Services LLC]]
    * 18 node cluster (each node has: 4 dual core CPUs, 1TB storage, 4GB RAM, 
RedHat OS)
- 
    * We use Hadoop for our high speed data mining applications
  
   * [[http://www.socialmedia.com/|Socialmedia.com]]
@@ -438, +447 @@

  
   * [[http://www.spadac.com/|Spadac.com]]
    * We are developing the MrGeo (Map/Reduce Geospatial) application to allow 
our users to bring cloud computing to geospatial processing.
- 
    * We use HDFS and MapReduce to store, process, and index geospatial imagery 
and vector data.
- 
    * MrGeo is soon to be open sourced as well.
  
   * [[http://stampedehost.com/|Stampede Data Solutions (Stampedehost.com)]]
    * Hosted Hadoop data warehouse solution provider
+ 
+ = T =
  
   * [[http://www.taragana.com|Taragana]] - Web 2.0 Product development and 
outsourcing services
    * We are using 16 consumer grade computers to create the cluster, connected 
by 100 Mbps network.
@@ -468, +477 @@

   * [[http://www.twitter.com|Twitter]]
    * We use Hadoop to store and process tweets, log files, and many other 
types of data generated across Twitter. We use Cloudera's CDH2 distribution of 
Hadoop, and store all data as compressed LZO files.
    * We use both Scala and Java to access Hadoop's MapReduce APIs
- 
    * We use Pig heavily for both scheduled and ad-hoc jobs, due to its ability 
to accomplish a lot with few statements.
    * We employ committers on Pig, Avro, Hive, and Cassandra, and contribute 
much of our internal Hadoop work to opensource (see 
[[http://github.com/kevinweil/hadoop-lzo|hadoop-lzo]])
- 
    * For more on our use of hadoop, see the following presentations: 
[[http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009|Hadoop
 and Pig at Twitter]] and 
[[http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter|Protocol
 Buffers and Hadoop at Twitter]]
  
   * [[http://tynt.com|Tynt]]
@@ -479, +486 @@

    * We use Pig and custom Java map-reduce code, as well as chukwa.
    * We have 94 nodes (752 cores) in our clusters, as of July 2010, but the 
number grows regularly.
  
+ = U =
+ 
   * [[http://glud.udistrital.edu.co|Universidad Distrital Francisco Jose de 
Caldas (Grupo GICOGE/Grupo Linux UD GLUD/Grupo GIGA]]
    5 node low-profile cluster. We use Hadoop to support the research project: 
Territorial Intelligence System of Bogota City.
  
@@ -492, +501 @@

   * [[http://t2.unl.edu|University of Nebraska Lincoln, Research Computing 
Facility]]
    . We currently run one medium-sized Hadoop cluster (200TB) to store and 
serve up physics data for the computing portion of the Compact Muon Solenoid 
(CMS) experiment. This requires a filesystem which can download data at 
multiple Gbps and process data at an even higher rate locally. Additionally, 
several of our students are involved in research projects on Hadoop.
  
+ = V =
+ 
   * [[http://www.veoh.com|Veoh]]
    * We use a small Hadoop cluster to reduce usage data for internal metrics, 
for search indexing and for recommendation data.
  
@@ -499, +510 @@

  
   * [[http://www.vksolutions.com/|VK Solutions]]
    * We use a small Hadoop cluster in the scope of our general research 
activities at [[http://www.vklabs.com|VK Labs]] to get a faster data access 
from web applications.
- 
    * We also use Hadoop for filtering and indexing listing, processing log 
analysis, and for recommendation data.
+ 
+ = W =
  
   * [[http://www.worldlingo.com/|WorldLingo]]
    * Hardware: 44 servers (each server has: 2 dual core CPUs, 2TB storage, 8GB 
RAM)
@@ -510, +522 @@

    * HBase is used as a scalable and fast storage back end for millions of 
documents.
    * Currently we store 12million documents with a target of 450million in the 
near future.
  
+ = Y =
+ 
   * [[http://www.yahoo.com/|Yahoo!]]
    * More than 100,000 CPUs in >36,000 computers running Hadoop
- 
    * Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
     * Used to support research for Ad Systems and Web Search
     * Also used to do scaling tests to support development of Hadoop on larger 
clusters
- 
    * [[http://developer.yahoo.com/blogs/hadoop|Our Blog]] - Learn more about 
how we use Hadoop.
- 
    * >60% of Hadoop Jobs within Yahoo are Pig jobs.
+ 
+ = Z =
  
   * [[http://www.zvents.com/|Zvents]]
    * 10 node cluster (Dual-Core AMD Opteron 2210, 4GB RAM, 1TB/node storage)

[Hadoop Wiki] Update of "PoweredBy" by SomeOtherAccount

Reply via email to