[GitHub] [incubator-training] RyanSkraba commented on a change in pull request #63: Hadoop Slides

GitBox Fri, 15 Nov 2019 03:04:03 -0800

RyanSkraba commented on a change in pull request #63: Hadoop Slides
URL: https://github.com/apache/incubator-training/pull/63#discussion_r346756987


 ##########
 File path: content/ApacheHadoop/src/main/asciidoc/index.adoc
 ##########
 @@ -0,0 +1,80 @@
+////
+
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+
+////
+:revealjs_progress: true
+:revealjs_slidenumber: true
+:sourcedir: ../java
+
+== What is Apache Hadoop?
+Apache Hadoop is a framework that allows for the distributed processing of 
large data sets across clusters of computers using simple programming models.
+
+Two major layers
+
+- Processing layer (MapReduce)
+- Storage layer (Hadoop Distributed File System)
+
+== MapReduce
+
+Hadoop MapReduce is a software framework for easily writing applications which 
process vast amounts of data (multi-terabyte data-sets) in-parallel on large 
clusters (thousands of nodes) of commodity hardware in a reliable, 
fault-tolerant manner. 
+
+The MapReduce framework consists of:
+- single master ResourceManager
+- one worker NodeManager per cluster-node
+- MRAppMaster per application 
+
+== Hadoop Distributed File System
+Hadoop Distributed File System (HDFS) is a distributed file system designed to 
run on commodity hardware. 
+
+- similar to other distributed file systems
+- highly fault-tolerant 
+- designed to be deployed on low-cost hardware
+- provides high throughput access to application data
+
+== MapReduce Inputs and Outputs
+
+The MapReduce framework operates exclusively on <key, value> pairs, that is, 
the framework views the input to the job as a set of <key, value> pairs and 
produces a set of <key, value> pairs as the output of the job, conceivably of 
different types.
+
+The key and value classes have to be serializable by the framework and hence 
need to implement the Writable interface. Additionally, the key classes have to 
implement the WritableComparable interface to facilitate sorting by the 
framework.
+
+Input and Output types of a MapReduce job:
+
+(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, 
v3> (output)
+
+== HDFS Architecture
+HDFS has a master/slave architecture. An HDFS cluster consists of:
 
 Review comment:
   Add newline before bullets.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-training] RyanSkraba commented on a change in pull request #63: Hadoop Slides

Reply via email to