Re: HDFS architecture based on GFS?

Amr Awadallah Mon, 16 Feb 2009 13:20:59 -0800

> I didn't understand usage of "malicuous" here,
> but any process using HDFS api should first ask NameNode where the


Rasit,

  Matei is referring to fact that a malicious peace of code can bypass the
Name Node and connect to any data node directly, or probe all data nodes for
that matter. There is no strong authentication for RPC at this layer of
HDFS, which is one of the current shortcomings that will be addressed in
hadoop 1.0.

-- amr

On Sun, Feb 15, 2009 at 11:41 PM, Rasit OZDAS <rasitoz...@gmail.com> wrote:

> "If there was a
> malicious process though, then I imagine it could talk to a datanode
> directly and request a specific block."
>
> I didn't understand usage of "malicuous" here,
> but any process using HDFS api should first ask NameNode where the
> file replications are.
> Then - I assume - namenode returns the IP of best DataNode (or all IPs),
> then call to specific DataNode is made.
> Please correct me if I'm wrong.
>
> Cheers,
> Rasit
>
> 2009/2/16 Matei Zaharia <ma...@cloudera.com>:
> > In general, yeah, the scripts can access any resource they want (within
> the
> > permissions of the user that the task runs as). It's also possible to
> access
> > HDFS from scripts because HDFS provides a FUSE interface that can make it
> > look like a regular file system on the machine. (The FUSE module in turn
> > talks to the namenode as a regular HDFS client.)
> >
> > On Sun, Feb 15, 2009 at 8:43 PM, Amandeep Khurana <ama...@gmail.com>
> wrote:
> >
> >> I dont know much about Hadoop streaming and have a quick question here.
> >>
> >> The snippets of code/programs that you attach into the map reduce job
> might
> >> want to access outside resources (like you mentioned). Now these might
> not
> >> need to go to the namenode right? For example a python script. How would
> it
> >> access the data? Would it ask the parent java process (in the
> tasktracker)
> >> to get the data or would it go and do stuff on its own?
> >>
> >>
> >> Amandeep Khurana
> >> Computer Science Graduate Student
> >> University of California, Santa Cruz
> >>
> >>
> >> On Sun, Feb 15, 2009 at 8:23 PM, Matei Zaharia <ma...@cloudera.com>
> wrote:
> >>
> >> > Nope, typically the JobTracker just starts the process, and the
> >> tasktracker
> >> > talks directly to the namenode to get a pointer to the datanode, and
> then
> >> > directly to the datanode.
> >> >
> >> > On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana <ama...@gmail.com>
> >> > wrote:
> >> >
> >> > > Alright.. Got it.
> >> > >
> >> > > Now, do the task trackers talk to the namenode and the data node
> >> directly
> >> > > or
> >> > > do they go through the job tracker for it? So, if my code is such
> that
> >> I
> >> > > need to access more files from the hdfs, would the job tracker get
> >> > involved
> >> > > or not?
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > Amandeep Khurana
> >> > > Computer Science Graduate Student
> >> > > University of California, Santa Cruz
> >> > >
> >> > >
> >> > > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia <ma...@cloudera.com>
> >> > wrote:
> >> > >
> >> > > > Normally, HDFS files are accessed through the namenode. If there
> was
> >> a
> >> > > > malicious process though, then I imagine it could talk to a
> datanode
> >> > > > directly and request a specific block.
> >> > > >
> >> > > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana <
> ama...@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Ok. Got it.
> >> > > > >
> >> > > > > Now, when my job needs to access another file, does it go to the
> >> > > Namenode
> >> > > > > to
> >> > > > > get the block ids? How does the java process know where the
> files
> >> are
> >> > > and
> >> > > > > how to access them?
> >> > > > >
> >> > > > >
> >> > > > > Amandeep Khurana
> >> > > > > Computer Science Graduate Student
> >> > > > > University of California, Santa Cruz
> >> > > > >
> >> > > > >
> >> > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia <
> ma...@cloudera.com
> >> >
> >> > > > wrote:
> >> > > > >
> >> > > > > > I mentioned this case because even jobs written in Java can
> use
> >> the
> >> > > > HDFS
> >> > > > > > API
> >> > > > > > to talk to the NameNode and access the filesystem. People
> often
> >> do
> >> > > this
> >> > > > > > because their job needs to read a config file, some small data
> >> > table,
> >> > > > etc
> >> > > > > > and use this information in its map or reduce functions. In
> this
> >> > > case,
> >> > > > > you
> >> > > > > > open the second file separately in your mapper's init function
> >> and
> >> > > read
> >> > > > > > whatever you need from it. In general I wanted to point out
> that
> >> > you
> >> > > > > can't
> >> > > > > > know which files a job will access unless you look at its
> source
> >> > code
> >> > > > or
> >> > > > > > monitor the calls it makes; the input file(s) you provide in
> the
> >> > job
> >> > > > > > description are a hint to the MapReduce framework to place
> your
> >> job
> >> > > on
> >> > > > > > certain nodes, but it's reasonable for the job to access other
> >> > files
> >> > > as
> >> > > > > > well.
> >> > > > > >
> >> > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana <
> >> > ama...@gmail.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Another question that I have here - When the jobs run
> arbitrary
> >> > > code
> >> > > > > and
> >> > > > > > > access data from the HDFS, do they go to the namenode to get
> >> the
> >> > > > block
> >> > > > > > > information?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Amandeep Khurana
> >> > > > > > > Computer Science Graduate Student
> >> > > > > > > University of California, Santa Cruz
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana <
> >> > > ama...@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Assuming that the job is purely in Java and not involving
> >> > > streaming
> >> > > > > or
> >> > > > > > > > pipes, wouldnt the resources (files) required by the job
> as
> >> > > inputs
> >> > > > be
> >> > > > > > > known
> >> > > > > > > > beforehand? So, if the map task is accessing a second
> file,
> >> how
> >> > > > does
> >> > > > > it
> >> > > > > > > make
> >> > > > > > > > it different except that there are multiple files. The
> >> > JobTracker
> >> > > > > would
> >> > > > > > > know
> >> > > > > > > > beforehand that multiple files would be accessed. Right?
> >> > > > > > > >
> >> > > > > > > > I am slightly confused why you have mentioned this case
> >> > > > separately...
> >> > > > > > Can
> >> > > > > > > > you elaborate on it a little bit?
> >> > > > > > > >
> >> > > > > > > > Amandeep
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Amandeep Khurana
> >> > > > > > > > Computer Science Graduate Student
> >> > > > > > > > University of California, Santa Cruz
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia <
> >> > > ma...@cloudera.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > >> Typically the data flow is like this:1) Client submits a
> job
> >> > > > > > description
> >> > > > > > > >> to
> >> > > > > > > >> the JobTracker.
> >> > > > > > > >> 2) JobTracker figures out block locations for the input
> >> > file(s)
> >> > > by
> >> > > > > > > talking
> >> > > > > > > >> to HDFS NameNode.
> >> > > > > > > >> 3) JobTracker creates a job description file in HDFS
> which
> >> > will
> >> > > be
> >> > > > > > read
> >> > > > > > > by
> >> > > > > > > >> the nodes to copy over the job's code etc.
> >> > > > > > > >> 4) JobTracker starts map tasks on the slaves
> (TaskTrackers)
> >> > with
> >> > > > the
> >> > > > > > > >> appropriate data blocks.
> >> > > > > > > >> 5) After running, maps create intermediate output files
> on
> >> > those
> >> > > > > > slaves.
> >> > > > > > > >> These are not in HDFS, they're in some temporary storage
> >> used
> >> > by
> >> > > > > > > >> MapReduce.
> >> > > > > > > >> 6) JobTracker starts reduces on a series of slaves, which
> >> copy
> >> > > > over
> >> > > > > > the
> >> > > > > > > >> appropriate map outputs, apply the reduce function, and
> >> write
> >> > > the
> >> > > > > > > outputs
> >> > > > > > > >> to
> >> > > > > > > >> HDFS (one output file per reducer).
> >> > > > > > > >> 7) Some logs for the job may also be put into HDFS by the
> >> > > > > JobTracker.
> >> > > > > > > >>
> >> > > > > > > >> However, there is a big caveat, which is that the map and
> >> > reduce
> >> > > > > tasks
> >> > > > > > > run
> >> > > > > > > >> arbitrary code. It is not unusual to have a map that
> opens a
> >> > > > second
> >> > > > > > HDFS
> >> > > > > > > >> file to read some information (e.g. for doing a join of a
> >> > small
> >> > > > > table
> >> > > > > > > >> against a big file). If you use Hadoop Streaming or Pipes
> to
> >> > > write
> >> > > > a
> >> > > > > > job
> >> > > > > > > >> in
> >> > > > > > > >> Python, Ruby, C, etc, then you are launching arbitrary
> >> > processes
> >> > > > > which
> >> > > > > > > may
> >> > > > > > > >> also access external resources in this manner. Some
> people
> >> > also
> >> > > > > > > read/write
> >> > > > > > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive
> >> security
> >> > > > > > solution
> >> > > > > > > >> would ideally deal with these cases too.
> >> > > > > > > >>
> >> > > > > > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana <
> >> > > > ama...@gmail.com
> >> > > > > >
> >> > > > > > > >> wrote:
> >> > > > > > > >>
> >> > > > > > > >> > A quick question here. How does a typical hadoop job
> work
> >> at
> >> > > the
> >> > > > > > > system
> >> > > > > > > >> > level? What are the various interactions and how does
> the
> >> > data
> >> > > > > flow?
> >> > > > > > > >> >
> >> > > > > > > >> > Amandeep
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > Amandeep Khurana
> >> > > > > > > >> > Computer Science Graduate Student
> >> > > > > > > >> > University of California, Santa Cruz
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana <
> >> > > > > ama...@gmail.com
> >> > > > > > >
> >> > > > > > > >> > wrote:
> >> > > > > > > >> >
> >> > > > > > > >> > > Thanks Matei. If the basic architecture is similar to
> >> the
> >> > > > Google
> >> > > > > > > >> stuff, I
> >> > > > > > > >> > > can safely just work on the project using the
> >> information
> >> > > from
> >> > > > > the
> >> > > > > > > >> > papers.
> >> > > > > > > >> > >
> >> > > > > > > >> > > I am aware of the 4487 jira and the current status of
> >> the
> >> > > > > > > permissions
> >> > > > > > > >> > > mechanism. I had a look at them earlier.
> >> > > > > > > >> > >
> >> > > > > > > >> > > Cheers
> >> > > > > > > >> > > Amandeep
> >> > > > > > > >> > >
> >> > > > > > > >> > >
> >> > > > > > > >> > > Amandeep Khurana
> >> > > > > > > >> > > Computer Science Graduate Student
> >> > > > > > > >> > > University of California, Santa Cruz
> >> > > > > > > >> > >
> >> > > > > > > >> > >
> >> > > > > > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia <
> >> > > > > > ma...@cloudera.com>
> >> > > > > > > >> > wrote:
> >> > > > > > > >> > >
> >> > > > > > > >> > >> Forgot to add, this JIRA details the latest security
> >> > > features
> >> > > > > > that
> >> > > > > > > >> are
> >> > > > > > > >> > >> being
> >> > > > > > > >> > >> worked on in Hadoop trunk:
> >> > > > > > > >> > >> https://issues.apache.org/jira/browse/HADOOP-4487.
> >> > > > > > > >> > >> This document describes the current status and
> >> > limitations
> >> > > of
> >> > > > > the
> >> > > > > > > >> > >> permissions mechanism:
> >> > > > > > > >> > >>
> >> > > > > > > >>
> >> > > > > >
> >> > >
> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html
> >> > > > .
> >> > > > > > > >> > >>
> >> > > > > > > >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia <
> >> > > > > > ma...@cloudera.com
> >> > > > > > > >
> >> > > > > > > >> > >> wrote:
> >> > > > > > > >> > >>
> >> > > > > > > >> > >> > I think it's safe to assume that Hadoop works like
> >> > > > > > MapReduce/GFS
> >> > > > > > > at
> >> > > > > > > >> > the
> >> > > > > > > >> > >> > level described in those papers. In particular, in
> >> > HDFS,
> >> > > > > there
> >> > > > > > is
> >> > > > > > > a
> >> > > > > > > >> > >> master
> >> > > > > > > >> > >> > node containing metadata and a number of slave
> nodes
> >> > > > > > (datanodes)
> >> > > > > > > >> > >> containing
> >> > > > > > > >> > >> > blocks, as in GFS. Clients start by talking to the
> >> > master
> >> > > > to
> >> > > > > > list
> >> > > > > > > >> > >> > directories, etc. When they want to read a region
> of
> >> > some
> >> > > > > file,
> >> > > > > > > >> they
> >> > > > > > > >> > >> tell
> >> > > > > > > >> > >> > the master the filename and offset, and they
> receive
> >> a
> >> > > list
> >> > > > > of
> >> > > > > > > >> block
> >> > > > > > > >> > >> > locations (datanodes). They then contact the
> >> individual
> >> > > > > > datanodes
> >> > > > > > > >> to
> >> > > > > > > >> > >> read
> >> > > > > > > >> > >> > the blocks. When clients write a file, they first
> >> > obtain
> >> > > a
> >> > > > > new
> >> > > > > > > >> block
> >> > > > > > > >> > ID
> >> > > > > > > >> > >> and
> >> > > > > > > >> > >> > list of nodes to write it to from the master, then
> >> > > contact
> >> > > > > the
> >> > > > > > > >> > datanodes
> >> > > > > > > >> > >> to
> >> > > > > > > >> > >> > write it (actually, the datanodes pipeline the
> write
> >> as
> >> > > in
> >> > > > > GFS)
> >> > > > > > > and
> >> > > > > > > >> > >> report
> >> > > > > > > >> > >> > when the write is complete. HDFS actually has some
> >> > > security
> >> > > > > > > >> mechanisms
> >> > > > > > > >> > >> built
> >> > > > > > > >> > >> > in, authenticating users based on their Unix ID
> and
> >> > > > providing
> >> > > > > > > >> > Unix-like
> >> > > > > > > >> > >> file
> >> > > > > > > >> > >> > permissions. I don't know much about how these are
> >> > > > > implemented,
> >> > > > > > > but
> >> > > > > > > >> > they
> >> > > > > > > >> > >> > would be a good place to start looking.
> >> > > > > > > >> > >> >
> >> > > > > > > >> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana
> <
> >> > > > > > > >> ama...@gmail.com
> >> > > > > > > >> > >> >wrote:
> >> > > > > > > >> > >> >
> >> > > > > > > >> > >> >> Thanks Matie
> >> > > > > > > >> > >> >>
> >> > > > > > > >> > >> >> I had gone through the architecture document
> online.
> >> I
> >> > > am
> >> > > > > > > >> currently
> >> > > > > > > >> > >> >> working
> >> > > > > > > >> > >> >> on a project towards Security in Hadoop. I do
> know
> >> how
> >> > > the
> >> > > > > > data
> >> > > > > > > >> moves
> >> > > > > > > >> > >> >> around
> >> > > > > > > >> > >> >> in the GFS but wasnt sure how much of that does
> HDFS
> >> > > > follow
> >> > > > > > and
> >> > > > > > > >> how
> >> > > > > > > >> > >> >> different it is from GFS. Can you throw some
> light
> >> on
> >> > > > that?
> >> > > > > > > >> > >> >>
> >> > > > > > > >> > >> >> Security would also involve the Map Reduce jobs
> >> > > following
> >> > > > > the
> >> > > > > > > same
> >> > > > > > > >> > >> >> protocols. Thats why the question about how does
> the
> >> > > > Hadoop
> >> > > > > > > >> framework
> >> > > > > > > >> > >> >> integrate with the HDFS, and how different is it
> >> from
> >> > > Map
> >> > > > > > Reduce
> >> > > > > > > >> and
> >> > > > > > > >> > >> GFS.
> >> > > > > > > >> > >> >> The GFS and Map Reduce papers give a good
> >> information
> >> > on
> >> > > > how
> >> > > > > > > those
> >> > > > > > > >> > >> systems
> >> > > > > > > >> > >> >> are designed but there is nothing that concrete
> for
> >> > > Hadoop
> >> > > > > > that
> >> > > > > > > I
> >> > > > > > > >> > have
> >> > > > > > > >> > >> >> been
> >> > > > > > > >> > >> >> able to find.
> >> > > > > > > >> > >> >>
> >> > > > > > > >> > >> >> Amandeep
> >> > > > > > > >> > >> >>
> >> > > > > > > >> > >> >>
> >> > > > > > > >> > >> >> Amandeep Khurana
> >> > > > > > > >> > >> >> Computer Science Graduate Student
> >> > > > > > > >> > >> >> University of California, Santa Cruz
> >> > > > > > > >> > >> >>
> >> > > > > > > >> > >> >>
> >> > > > > > > >> > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia <
> >> > > > > > > >> ma...@cloudera.com>
> >> > > > > > > >> > >> >> wrote:
> >> > > > > > > >> > >> >>
> >> > > > > > > >> > >> >> > Hi Amandeep,
> >> > > > > > > >> > >> >> > Hadoop is definitely inspired by MapReduce/GFS
> and
> >> > > aims
> >> > > > to
> >> > > > > > > >> provide
> >> > > > > > > >> > >> those
> >> > > > > > > >> > >> >> > capabilities as an open-source project. HDFS is
> >> > > similar
> >> > > > to
> >> > > > > > GFS
> >> > > > > > > >> > (large
> >> > > > > > > >> > >> >> > blocks, replication, etc); some notable things
> >> > missing
> >> > > > are
> >> > > > > > > >> > read-write
> >> > > > > > > >> > >> >> > support in the middle of a file (unlikely to be
> >> > > provided
> >> > > > > > > because
> >> > > > > > > >> > few
> >> > > > > > > >> > >> >> Hadoop
> >> > > > > > > >> > >> >> > applications require it) and multiple appenders
> >> (the
> >> > > > > record
> >> > > > > > > >> append
> >> > > > > > > >> > >> >> > operation). You can read about HDFS
> architecture
> >> at
> >> > > > > > > >> > >> >> >
> >> > > > > http://hadoop.apache.org/core/docs/current/hdfs_design.html
> >> > > > > > .
> >> > > > > > > >> The
> >> > > > > > > >> > >> >> MapReduce
> >> > > > > > > >> > >> >> > part of Hadoop interacts with HDFS in the same
> way
> >> > > that
> >> > > > > > > Google's
> >> > > > > > > >> > >> >> MapReduce
> >> > > > > > > >> > >> >> > interacts with GFS (shipping computation to the
> >> > data),
> >> > > > > > > although
> >> > > > > > > >> > >> Hadoop
> >> > > > > > > >> > >> >> > MapReduce also supports running over other
> >> > distributed
> >> > > > > > > >> filesystems.
> >> > > > > > > >> > >> >> >
> >> > > > > > > >> > >> >> > Matei
> >> > > > > > > >> > >> >> >
> >> > > > > > > >> > >> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep
> Khurana
> >> <
> >> > > > > > > >> > ama...@gmail.com
> >> > > > > > > >> > >> >
> >> > > > > > > >> > >> >> > wrote:
> >> > > > > > > >> > >> >> >
> >> > > > > > > >> > >> >> > > Hi
> >> > > > > > > >> > >> >> > >
> >> > > > > > > >> > >> >> > > Is the HDFS architecture completely based on
> the
> >> > > > Google
> >> > > > > > > >> > Filesystem?
> >> > > > > > > >> > >> If
> >> > > > > > > >> > >> >> it
> >> > > > > > > >> > >> >> > > isnt, what are the differences between the
> two?
> >> > > > > > > >> > >> >> > >
> >> > > > > > > >> > >> >> > > Secondly, is the coupling between Hadoop and
> >> HDFS
> >> > > same
> >> > > > > as
> >> > > > > > > how
> >> > > > > > > >> it
> >> > > > > > > >> > is
> >> > > > > > > >> > >> >> > between
> >> > > > > > > >> > >> >> > > the Google's version of Map Reduce and GFS?
> >> > > > > > > >> > >> >> > >
> >> > > > > > > >> > >> >> > > Amandeep
> >> > > > > > > >> > >> >> > >
> >> > > > > > > >> > >> >> > >
> >> > > > > > > >> > >> >> > > Amandeep Khurana
> >> > > > > > > >> > >> >> > > Computer Science Graduate Student
> >> > > > > > > >> > >> >> > > University of California, Santa Cruz
> >> > > > > > > >> > >> >> > >
> >> > > > > > > >> > >> >> >
> >> > > > > > > >> > >> >>
> >> > > > > > > >> > >> >
> >> > > > > > > >> > >> >
> >> > > > > > > >> > >>
> >> > > > > > > >> > >
> >> > > > > > > >> > >
> >> > > > > > > >> >
> >> > > > > > > >>
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>
>
>
> --
> M. Raşit ÖZDAŞ
>

Re: HDFS architecture based on GFS?

Reply via email to