greatgautam commented on a change in pull request #63: Training-27: Hadoop Training Slides URL: https://github.com/apache/incubator-training/pull/63#discussion_r349243065
########## File path: content/ApacheHadoop/src/main/asciidoc/index.adoc ########## @@ -0,0 +1,80 @@ +//// + + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + +//// +:revealjs_progress: true +:revealjs_slidenumber: true +:sourcedir: ../java + +== What is Apache Hadoop? +Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. + +Two major layers + +- Processing layer (MapReduce) +- Storage layer (Hadoop Distributed File System) + +== MapReduce + +Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. + +The MapReduce framework consists of: +- single master ResourceManager +- one worker NodeManager per cluster-node +- MRAppMaster per application + +== Hadoop Distributed File System +Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. + +- similar to other distributed file systems +- highly fault-tolerant +- designed to be deployed on low-cost hardware +- provides high throughput access to application data + +== MapReduce Inputs and Outputs + +The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types. + +The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework. + +Input and Output types of a MapReduce job: + +(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output) + +== HDFS Architecture +HDFS has a master/slave architecture. An HDFS cluster consists of: +- single NameNode, a master server that manages the file system namespace and regulates access to files by clients +- one of more of DataNodes used for storage and serving read and write requests from the file system’s clients + +== HDFS New Features +New features and improvements are regularly implemented in HDFS. The following is a subset of useful features in HDFS: Review comment: I have updated it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services