greatgautam commented on pull request #63: URL: https://github.com/apache/incubator-training/pull/63#issuecomment-631809617
Hi, I have updated the PR for Apache Hadoop slides. Please review and merge if it looks good. https://github.com/apache/incubator-training/pull/63 thanks, Gautam On Fri, Jan 10, 2020 at 1:54 AM Brahma Reddy Battula < notificati...@github.com> wrote: > *@brahmareddybattula* commented on this pull request. > ------------------------------ > > In content/Hadoop/src/main/asciidoc/index.adoc > <https://github.com/apache/incubator-training/pull/63#discussion_r363564712> > : > > > + http://www.apache.org/licenses/LICENSE-2.0 > > + > > + Unless required by applicable law or agreed to in writing, software > > + distributed under the License is distributed on an "AS IS" BASIS, > > + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > > + See the License for the specific language governing permissions and > > + limitations under the License. > > + > > +//// > > +:revealjs_progress: true > > +:revealjs_slidenumber: true > > +:sourcedir: ../java > > + > > +== What is Apache Hadoop? > > + > > +Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. > > > I feel, We can describe like below. > > The Apache Hadoop software library is a framework that allows for the > distributed processing of large data sets across clusters of computers > using simple programming models. It is designed to scale up from single > servers to thousands of machines, each offering local computation and > storage. Rather than rely on hardware to deliver high-availability, the > library itself is designed to detect and handle failures at the application > layer, so delivering a highly-available service on top of a cluster of > computers, each of which may be prone to failures. > ------------------------------ > > In content/Hadoop/src/main/asciidoc/index.adoc > <https://github.com/apache/incubator-training/pull/63#discussion_r363566531> > : > > > + distributed under the License is distributed on an "AS IS" BASIS, > > + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > > + See the License for the specific language governing permissions and > > + limitations under the License. > > + > > +//// > > +:revealjs_progress: true > > +:revealjs_slidenumber: true > > +:sourcedir: ../java > > + > > +== What is Apache Hadoop? > > + > > +Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. > > + > > +Two main layers: > > + > > > The base Apache Hadoop framework is composed of the following modules: > > > > Hadoop Common – contains libraries and utilities needed by other Hadoop modules; > > Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster; > > Hadoop YARN – (introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users' applications;[10][11] > > Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing. > > > ------------------------------ > > In content/Hadoop/src/main/asciidoc/index.adoc > <https://github.com/apache/incubator-training/pull/63#discussion_r363566774> > : > > > +== What is Apache Hadoop? > > + > > +Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. > > + > > +Two main layers: > > + > > +- Processing layer (MapReduce) > > +- Storage layer (Hadoop Distributed File System) > > + > > +== MapReduce > > + > > +Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. > > + > > +The MapReduce framework consists of: > > + > > +- single master ResourceManager > > > This is for YARN, not for MR. > ------------------------------ > > In content/Hadoop/src/main/asciidoc/index.adoc > <https://github.com/apache/incubator-training/pull/63#discussion_r363567463> > : > > > + > > +== MapReduce > > + > > +Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. > > + > > +The MapReduce framework consists of: > > + > > +- single master ResourceManager > > +- one worker NodeManager per cluster-node > > +- MRAppMaster per application > > + > > +== Hadoop Distributed File System > > + > > +Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. > > + > > +- similar to other distributed file systems > > > I suggest, you go through the following and update the same. > > > https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html > ------------------------------ > > In content/Hadoop/src/main/asciidoc/index.adoc > <https://github.com/apache/incubator-training/pull/63#discussion_r363567669> > : > > > + > > +//// > > +:revealjs_progress: true > > +:revealjs_slidenumber: true > > +:sourcedir: ../java > > + > > +== What is Apache Hadoop? > > + > > +Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. > > + > > +Two main layers: > > + > > +- Processing layer (MapReduce) > > +- Storage layer (Hadoop Distributed File System) > > + > > +== MapReduce > > > I suggest, you go through the following and update. > > > https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html > ------------------------------ > > In content/Hadoop/src/main/asciidoc/index.adoc > <https://github.com/apache/incubator-training/pull/63#discussion_r363567952> > : > > > +== HDFS Useful Features > > + > > +New features and improvements are regularly implemented in HDFS. The following is a subset of useful features in HDFS: > > + > > +- File permissions and authentication. > > +- Rack awareness: to take a node’s physical location into account while scheduling tasks and allocating storage. > > +- Safemode: an administrative mode for maintenance. > > +- fsck: a utility to diagnose health of the file system, to find missing files or blocks. > > +- fetchdt: a utility to fetch DelegationToken and store it in a file on the local system. > > +- Balancer: tool to balance the cluster when the data is unevenly distributed among DataNodes. > > +- Upgrade and rollback: after a software upgrade, it is possible to rollback to HDFS’ state before the upgrade in case of unexpected problems. > > + > > + > > +== HDFS Special Nodes > > + > > +Main nodes in HDFS are as follows: > > > Include high availablity, these nodes are almost not used now. We can try > to updated data...? > ------------------------------ > > In content/Hadoop/src/main/asciidoc/index.adoc > <https://github.com/apache/incubator-training/pull/63#discussion_r363568291> > : > > > +== MapReduce Inputs and Outputs > > + > > +The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types. > > + > > +The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework. > > + > > +Input and Output types of a MapReduce job: > > + > > +(input) <k1, v1> -> map -> <k2, v2> -> > > +combine -> <k2, v2> -> > > +reduce -> <k3, v3> (output) > > + > > +== HDFS Architecture > > + > > +HDFS has a master/slave architecture. An HDFS cluster consists of: > > + > > > Better to include the Architecture diagram ( from below link), if it's > already included, please ignore > > > https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html > ------------------------------ > > In content/Hadoop/src/main/asciidoc/index.adoc > <https://github.com/apache/incubator-training/pull/63#discussion_r363568719> > : > > > +== HDFS Backup node > > + > > +Backup node is an extension to the Checkpoint node. > > +In addition to checkpointing it also receives a stream of edits from the NameNode. > > +It maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. > > +Only one Backup node may be registered with the NameNode at once. > > + > > +== HDFS Commands > > + > > +All HDFS commands are invoked by the bin/hdfs script. > > +Running the hdfs script without any arguments prints the description for all commands. > > + > > +Usage: hdfs [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS] > > + > > + > > +== Common HDFS Commands > > > Same we can give for YARN and MR also..? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <https://github.com/apache/incubator-training/pull/63?email_source=notifications&email_token=AA5KNGY5L5BOL3FKSVSYFW3Q5BAWVA5CNFSM4JKQDJYKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCQ2LDWQ#pullrequestreview-338997722>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AA5KNG7SQPGWKIA3GNXAQJDQ5BAWVANCNFSM4JKQDJYA> > . > ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org