[GitHub] [incubator-training] greatgautam commented on pull request #63: Training-27: Hadoop Training Slides

GitBox Wed, 20 May 2020 17:35:20 -0700


greatgautam commented on pull request #63:
URL: https://github.com/apache/incubator-training/pull/63#issuecomment-631809617



   Hi,
   I have updated the PR for Apache Hadoop slides. Please review and merge if
   it looks good.
   
   https://github.com/apache/incubator-training/pull/63
   
   thanks,
   Gautam
   
   On Fri, Jan 10, 2020 at 1:54 AM Brahma Reddy Battula <
   notificati...@github.com> wrote:
   
   > *@brahmareddybattula* commented on this pull request.
   > ------------------------------
   >
   > In content/Hadoop/src/main/asciidoc/index.adoc
   > 
<https://github.com/apache/incubator-training/pull/63#discussion_r363564712>
   > :
   >
   > > +      http://www.apache.org/licenses/LICENSE-2.0
   >
   > +
   >
   > +  Unless required by applicable law or agreed to in writing, software
   >
   > +  distributed under the License is distributed on an "AS IS" BASIS,
   >
   > +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   >
   > +  See the License for the specific language governing permissions and
   >
   > +  limitations under the License.
   >
   > +
   >
   > +////
   >
   > +:revealjs_progress: true
   >
   > +:revealjs_slidenumber: true
   >
   > +:sourcedir: ../java
   >
   > +
   >
   > +== What is Apache Hadoop?
   >
   > +
   >
   > +Apache Hadoop is a framework that allows for the distributed processing 
of large data sets across clusters of computers using simple programming models.
   >
   >
   > I feel, We can describe like below.
   >
   > The Apache Hadoop software library is a framework that allows for the
   > distributed processing of large data sets across clusters of computers
   > using simple programming models. It is designed to scale up from single
   > servers to thousands of machines, each offering local computation and
   > storage. Rather than rely on hardware to deliver high-availability, the
   > library itself is designed to detect and handle failures at the application
   > layer, so delivering a highly-available service on top of a cluster of
   > computers, each of which may be prone to failures.
   > ------------------------------
   >
   > In content/Hadoop/src/main/asciidoc/index.adoc
   > 
<https://github.com/apache/incubator-training/pull/63#discussion_r363566531>
   > :
   >
   > > +  distributed under the License is distributed on an "AS IS" BASIS,
   >
   > +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   >
   > +  See the License for the specific language governing permissions and
   >
   > +  limitations under the License.
   >
   > +
   >
   > +////
   >
   > +:revealjs_progress: true
   >
   > +:revealjs_slidenumber: true
   >
   > +:sourcedir: ../java
   >
   > +
   >
   > +== What is Apache Hadoop?
   >
   > +
   >
   > +Apache Hadoop is a framework that allows for the distributed processing 
of large data sets across clusters of computers using simple programming models.
   >
   > +
   >
   > +Two main layers:
   >
   > +
   >
   >
   > The base Apache Hadoop framework is composed of the following modules:
   >
   >
   >
   > Hadoop Common – contains libraries and utilities needed by other Hadoop 
modules;
   >
   > Hadoop Distributed File System (HDFS) – a distributed file-system that 
stores data on commodity machines, providing very high aggregate bandwidth 
across the cluster;
   >
   > Hadoop YARN – (introduced in 2012) a platform responsible for managing 
computing resources in clusters and using them for scheduling users' 
applications;[10][11]
   >
   > Hadoop MapReduce – an implementation of the MapReduce programming model 
for large-scale data processing.
   >
   >
   > ------------------------------
   >
   > In content/Hadoop/src/main/asciidoc/index.adoc
   > 
<https://github.com/apache/incubator-training/pull/63#discussion_r363566774>
   > :
   >
   > > +== What is Apache Hadoop?
   >
   > +
   >
   > +Apache Hadoop is a framework that allows for the distributed processing 
of large data sets across clusters of computers using simple programming models.
   >
   > +
   >
   > +Two main layers:
   >
   > +
   >
   > +- Processing layer (MapReduce)
   >
   > +- Storage layer (Hadoop Distributed File System)
   >
   > +
   >
   > +== MapReduce
   >
   > +
   >
   > +Hadoop MapReduce is a software framework for easily writing applications 
which process vast amounts of data (multi-terabyte data-sets) in-parallel on 
large clusters (thousands of nodes) of commodity hardware in a reliable, 
fault-tolerant manner.
   >
   > +
   >
   > +The MapReduce framework consists of:
   >
   > +
   >
   > +- single master ResourceManager
   >
   >
   > This is for YARN, not for MR.
   > ------------------------------
   >
   > In content/Hadoop/src/main/asciidoc/index.adoc
   > 
<https://github.com/apache/incubator-training/pull/63#discussion_r363567463>
   > :
   >
   > > +
   >
   > +== MapReduce
   >
   > +
   >
   > +Hadoop MapReduce is a software framework for easily writing applications 
which process vast amounts of data (multi-terabyte data-sets) in-parallel on 
large clusters (thousands of nodes) of commodity hardware in a reliable, 
fault-tolerant manner.
   >
   > +
   >
   > +The MapReduce framework consists of:
   >
   > +
   >
   > +- single master ResourceManager
   >
   > +- one worker NodeManager per cluster-node
   >
   > +- MRAppMaster per application
   >
   > +
   >
   > +== Hadoop Distributed File System
   >
   > +
   >
   > +Hadoop Distributed File System (HDFS) is a distributed file system 
designed to run on commodity hardware.
   >
   > +
   >
   > +- similar to other distributed file systems
   >
   >
   > I suggest, you go through the following and update the same.
   >
   >
   > 
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
   > ------------------------------
   >
   > In content/Hadoop/src/main/asciidoc/index.adoc
   > 
<https://github.com/apache/incubator-training/pull/63#discussion_r363567669>
   > :
   >
   > > +
   >
   > +////
   >
   > +:revealjs_progress: true
   >
   > +:revealjs_slidenumber: true
   >
   > +:sourcedir: ../java
   >
   > +
   >
   > +== What is Apache Hadoop?
   >
   > +
   >
   > +Apache Hadoop is a framework that allows for the distributed processing 
of large data sets across clusters of computers using simple programming models.
   >
   > +
   >
   > +Two main layers:
   >
   > +
   >
   > +- Processing layer (MapReduce)
   >
   > +- Storage layer (Hadoop Distributed File System)
   >
   > +
   >
   > +== MapReduce
   >
   >
   > I suggest, you go through the following and update.
   >
   >
   > 
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
   > ------------------------------
   >
   > In content/Hadoop/src/main/asciidoc/index.adoc
   > 
<https://github.com/apache/incubator-training/pull/63#discussion_r363567952>
   > :
   >
   > > +== HDFS Useful Features
   >
   > +
   >
   > +New features and improvements are regularly implemented in HDFS. The 
following is a subset of useful features in HDFS:
   >
   > +
   >
   > +- File permissions and authentication.
   >
   > +- Rack awareness: to take a node’s physical location into account while 
scheduling tasks and allocating storage.
   >
   > +- Safemode: an administrative mode for maintenance.
   >
   > +- fsck: a utility to diagnose health of the file system, to find missing 
files or blocks.
   >
   > +- fetchdt: a utility to fetch DelegationToken and store it in a file on 
the local system.
   >
   > +- Balancer: tool to balance the cluster when the data is unevenly 
distributed among DataNodes.
   >
   > +- Upgrade and rollback: after a software upgrade, it is possible to 
rollback to HDFS’ state before the upgrade in case of unexpected problems.
   >
   > +
   >
   > +
   >
   > +== HDFS Special Nodes
   >
   > +
   >
   > +Main nodes in HDFS are as follows:
   >
   >
   > Include high availablity, these nodes are almost not used now. We can try
   > to updated data...?
   > ------------------------------
   >
   > In content/Hadoop/src/main/asciidoc/index.adoc
   > 
<https://github.com/apache/incubator-training/pull/63#discussion_r363568291>
   > :
   >
   > > +== MapReduce Inputs and Outputs
   >
   > +
   >
   > +The MapReduce framework operates exclusively on <key, value> pairs, that 
is, the framework views the input to the job as a set of <key, value> pairs and 
produces a set of <key, value> pairs as the output of the job, conceivably of 
different types.
   >
   > +
   >
   > +The key and value classes have to be serializable by the framework and 
hence need to implement the Writable interface. Additionally, the key classes 
have to implement the WritableComparable interface to facilitate sorting by the 
framework.
   >
   > +
   >
   > +Input and Output types of a MapReduce job:
   >
   > +
   >
   > +(input) <k1, v1> -> map -> <k2, v2> ->
   >
   > +combine -> <k2, v2> ->
   >
   > +reduce -> <k3, v3> (output)
   >
   > +
   >
   > +== HDFS Architecture
   >
   > +
   >
   > +HDFS has a master/slave architecture. An HDFS cluster consists of:
   >
   > +
   >
   >
   > Better to include the Architecture diagram ( from below link), if it's
   > already included, please ignore
   >
   >
   > 
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
   > ------------------------------
   >
   > In content/Hadoop/src/main/asciidoc/index.adoc
   > 
<https://github.com/apache/incubator-training/pull/63#discussion_r363568719>
   > :
   >
   > > +== HDFS Backup node
   >
   > +
   >
   > +Backup node is an extension to the Checkpoint node.
   >
   > +In addition to checkpointing it also receives a stream of edits from the 
NameNode.
   >
   > +It maintains its own in-memory copy of the namespace, which is always in 
sync with the active NameNode namespace state.
   >
   > +Only one Backup node may be registered with the NameNode at once.
   >
   > +
   >
   > +== HDFS Commands
   >
   > +
   >
   > +All HDFS commands are invoked by the bin/hdfs script.
   >
   > +Running the hdfs script without any arguments prints the description for 
all commands.
   >
   > +
   >
   > +Usage: hdfs [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]
   >
   > +
   >
   > +
   >
   > +== Common HDFS Commands
   >
   >
   > Same we can give for YARN and MR also..?
   >
   > —
   > You are receiving this because you authored the thread.
   > Reply to this email directly, view it on GitHub
   > 
<https://github.com/apache/incubator-training/pull/63?email_source=notifications&email_token=AA5KNGY5L5BOL3FKSVSYFW3Q5BAWVA5CNFSM4JKQDJYKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCQ2LDWQ#pullrequestreview-338997722>,
   > or unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AA5KNG7SQPGWKIA3GNXAQJDQ5BAWVANCNFSM4JKQDJYA>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [incubator-training] greatgautam commented on pull request #63: Training-27: Hadoop Training Slides

Reply via email to