[GitHub] [incubator-training] brahmareddybattula commented on a change in pull request #63: Training-27: Hadoop Training Slides

GitBox Fri, 10 Jan 2020 01:55:15 -0800

brahmareddybattula commented on a change in pull request #63: Training-27: 
Hadoop Training Slides
URL: https://github.com/apache/incubator-training/pull/63#discussion_r363568719


 ##########
 File path: content/Hadoop/src/main/asciidoc/index.adoc
 ##########
 @@ -0,0 +1,126 @@
+////
+
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+
+////
+:revealjs_progress: true
+:revealjs_slidenumber: true
+:sourcedir: ../java
+
+== What is Apache Hadoop?
+
+Apache Hadoop is a framework that allows for the distributed processing of 
large data sets across clusters of computers using simple programming models.
+
+Two main layers:
+
+- Processing layer (MapReduce)
+- Storage layer (Hadoop Distributed File System)
+
+== MapReduce
+
+Hadoop MapReduce is a software framework for easily writing applications which 
process vast amounts of data (multi-terabyte data-sets) in-parallel on large 
clusters (thousands of nodes) of commodity hardware in a reliable, 
fault-tolerant manner. 
+
+The MapReduce framework consists of:
+
+- single master ResourceManager
+- one worker NodeManager per cluster-node
+- MRAppMaster per application 
+
+== Hadoop Distributed File System
+
+Hadoop Distributed File System (HDFS) is a distributed file system designed to 
run on commodity hardware. 
+
+- similar to other distributed file systems
+- highly fault-tolerant 
+- designed to be deployed on low-cost hardware
+- provides high throughput access to application data
+
+== MapReduce Inputs and Outputs
+
+The MapReduce framework operates exclusively on <key, value> pairs, that is, 
the framework views the input to the job as a set of <key, value> pairs and 
produces a set of <key, value> pairs as the output of the job, conceivably of 
different types.
+
+The key and value classes have to be serializable by the framework and hence 
need to implement the Writable interface. Additionally, the key classes have to 
implement the WritableComparable interface to facilitate sorting by the 
framework.
+
+Input and Output types of a MapReduce job:
+
+(input) <k1, v1> -> map -> <k2, v2> ->
+combine -> <k2, v2> ->
+reduce -> <k3, v3> (output)
+
+== HDFS Architecture
+
+HDFS has a master/slave architecture. An HDFS cluster consists of:
+
+- single NameNode, a master server that manages the file system namespace and 
regulates access to files by clients
+- one of more of DataNodes used for storage and serving read and write 
requests from the file system’s clients
+
+== HDFS Useful Features
+
+New features and improvements are regularly implemented in HDFS. The following 
is a subset of useful features in HDFS:
+
+- File permissions and authentication.
+- Rack awareness: to take a node’s physical location into account while 
scheduling tasks and allocating storage.
+- Safemode: an administrative mode for maintenance.
+- fsck: a utility to diagnose health of the file system, to find missing files 
or blocks.
+- fetchdt: a utility to fetch DelegationToken and store it in a file on the 
local system.
+- Balancer: tool to balance the cluster when the data is unevenly distributed 
among DataNodes.
+- Upgrade and rollback: after a software upgrade, it is possible to rollback 
to HDFS’ state before the upgrade in case of unexpected problems.
+
+
+== HDFS Special Nodes
+
+Main nodes in HDFS are as follows:
+
+- Secondary NameNode
+- Checkpoint node
+- Backup node
+
+== HDFS Secondary NameNode
+
+Secondary NameNode performs periodic checkpoints of the namespace.
+It helps keep the size of file containing log of HDFS modifications within 
certain limits at the NameNode.
+
+== HDFS Checkpoint node
+
+Checkpoint node performs periodic checkpoints of the namespace.
+It helps minimize the size of the log stored at the NameNode containing 
changes to the HDFS.
+It replaces the role previously filled by the Secondary NameNode.
+The NameNode allows multiple Checkpoint nodes simultaneously, as long as there 
are no Backup nodes registered with the system.
+
+
+== HDFS Backup node
+
+Backup node is an extension to the Checkpoint node.
+In addition to checkpointing it also receives a stream of edits from the 
NameNode.
+It maintains its own in-memory copy of the namespace, which is always in sync 
with the active NameNode namespace state.
+Only one Backup node may be registered with the NameNode at once.
+
+== HDFS Commands
+
+All HDFS commands are invoked by the bin/hdfs script.
+Running the hdfs script without any arguments prints the description for all 
commands.
+
+Usage: hdfs [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]
+
+
+== Common HDFS Commands
 
 Review comment:
   Same we can give for YARN and MR also..? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-training] brahmareddybattula commented on a change in pull request #63: Training-27: Hadoop Training Slides

Reply via email to