Re: HDFS architecture based on GFS?

Amandeep Khurana Sun, 15 Feb 2009 17:33:16 -0800

This is good information! Thanks a ton. I'll take all this into account.

Amandeep



Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia <[email protected]> wrote:

> Typically the data flow is like this:1) Client submits a job description to
> the JobTracker.
> 2) JobTracker figures out block locations for the input file(s) by talking
> to HDFS NameNode.
> 3) JobTracker creates a job description file in HDFS which will be read by
> the nodes to copy over the job's code etc.
> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the
> appropriate data blocks.
> 5) After running, maps create intermediate output files on those slaves.
> These are not in HDFS, they're in some temporary storage used by MapReduce.
> 6) JobTracker starts reduces on a series of slaves, which copy over the
> appropriate map outputs, apply the reduce function, and write the outputs
> to
> HDFS (one output file per reducer).
> 7) Some logs for the job may also be put into HDFS by the JobTracker.
>
> However, there is a big caveat, which is that the map and reduce tasks run
> arbitrary code. It is not unusual to have a map that opens a second HDFS
> file to read some information (e.g. for doing a join of a small table
> against a big file). If you use Hadoop Streaming or Pipes to write a job in
> Python, Ruby, C, etc, then you are launching arbitrary processes which may
> also access external resources in this manner. Some people also read/write
> to DBs (e.g. MySQL) from their tasks. A comprehensive security solution
> would ideally deal with these cases too.
>
> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana <[email protected]>
> wrote:
>
> > A quick question here. How does a typical hadoop job work at the system
> > level? What are the various interactions and how does the data flow?
> >
> > Amandeep
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana <[email protected]>
> > wrote:
> >
> > > Thanks Matei. If the basic architecture is similar to the Google stuff,
> I
> > > can safely just work on the project using the information from the
> > papers.
> > >
> > > I am aware of the 4487 jira and the current status of the permissions
> > > mechanism. I had a look at them earlier.
> > >
> > > Cheers
> > > Amandeep
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia <[email protected]>
> > wrote:
> > >
> > >> Forgot to add, this JIRA details the latest security features that are
> > >> being
> > >> worked on in Hadoop trunk:
> > >> https://issues.apache.org/jira/browse/HADOOP-4487.
> > >> This document describes the current status and limitations of the
> > >> permissions mechanism:
> > >>
> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html.
> > >>
> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia <[email protected]>
> > >> wrote:
> > >>
> > >> > I think it's safe to assume that Hadoop works like MapReduce/GFS at
> > the
> > >> > level described in those papers. In particular, in HDFS, there is a
> > >> master
> > >> > node containing metadata and a number of slave nodes (datanodes)
> > >> containing
> > >> > blocks, as in GFS. Clients start by talking to the master to list
> > >> > directories, etc. When they want to read a region of some file, they
> > >> tell
> > >> > the master the filename and offset, and they receive a list of block
> > >> > locations (datanodes). They then contact the individual datanodes to
> > >> read
> > >> > the blocks. When clients write a file, they first obtain a new block
> > ID
> > >> and
> > >> > list of nodes to write it to from the master, then contact the
> > datanodes
> > >> to
> > >> > write it (actually, the datanodes pipeline the write as in GFS) and
> > >> report
> > >> > when the write is complete. HDFS actually has some security
> mechanisms
> > >> built
> > >> > in, authenticating users based on their Unix ID and providing
> > Unix-like
> > >> file
> > >> > permissions. I don't know much about how these are implemented, but
> > they
> > >> > would be a good place to start looking.
> > >> >
> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana <[email protected]
> > >> >wrote:
> > >> >
> > >> >> Thanks Matie
> > >> >>
> > >> >> I had gone through the architecture document online. I am currently
> > >> >> working
> > >> >> on a project towards Security in Hadoop. I do know how the data
> moves
> > >> >> around
> > >> >> in the GFS but wasnt sure how much of that does HDFS follow and how
> > >> >> different it is from GFS. Can you throw some light on that?
> > >> >>
> > >> >> Security would also involve the Map Reduce jobs following the same
> > >> >> protocols. Thats why the question about how does the Hadoop
> framework
> > >> >> integrate with the HDFS, and how different is it from Map Reduce
> and
> > >> GFS.
> > >> >> The GFS and Map Reduce papers give a good information on how those
> > >> systems
> > >> >> are designed but there is nothing that concrete for Hadoop that I
> > have
> > >> >> been
> > >> >> able to find.
> > >> >>
> > >> >> Amandeep
> > >> >>
> > >> >>
> > >> >> Amandeep Khurana
> > >> >> Computer Science Graduate Student
> > >> >> University of California, Santa Cruz
> > >> >>
> > >> >>
> > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia <
> [email protected]>
> > >> >> wrote:
> > >> >>
> > >> >> > Hi Amandeep,
> > >> >> > Hadoop is definitely inspired by MapReduce/GFS and aims to
> provide
> > >> those
> > >> >> > capabilities as an open-source project. HDFS is similar to GFS
> > (large
> > >> >> > blocks, replication, etc); some notable things missing are
> > read-write
> > >> >> > support in the middle of a file (unlikely to be provided because
> > few
> > >> >> Hadoop
> > >> >> > applications require it) and multiple appenders (the record
> append
> > >> >> > operation). You can read about HDFS architecture at
> > >> >> > http://hadoop.apache.org/core/docs/current/hdfs_design.html. The
> > >> >> MapReduce
> > >> >> > part of Hadoop interacts with HDFS in the same way that Google's
> > >> >> MapReduce
> > >> >> > interacts with GFS (shipping computation to the data), although
> > >> Hadoop
> > >> >> > MapReduce also supports running over other distributed
> filesystems.
> > >> >> >
> > >> >> > Matei
> > >> >> >
> > >> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana <
> > [email protected]
> > >> >
> > >> >> > wrote:
> > >> >> >
> > >> >> > > Hi
> > >> >> > >
> > >> >> > > Is the HDFS architecture completely based on the Google
> > Filesystem?
> > >> If
> > >> >> it
> > >> >> > > isnt, what are the differences between the two?
> > >> >> > >
> > >> >> > > Secondly, is the coupling between Hadoop and HDFS same as how
> it
> > is
> > >> >> > between
> > >> >> > > the Google's version of Map Reduce and GFS?
> > >> >> > >
> > >> >> > > Amandeep
> > >> >> > >
> > >> >> > >
> > >> >> > > Amandeep Khurana
> > >> >> > > Computer Science Graduate Student
> > >> >> > > University of California, Santa Cruz
> > >> >> > >
> > >> >> >
> > >> >>
> > >> >
> > >> >
> > >>
> > >
> > >
> >
>

Re: HDFS architecture based on GFS?

Reply via email to