This is good information! Thanks a ton. I'll take all this into account. Amandeep
Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia <[email protected]> wrote: > Typically the data flow is like this:1) Client submits a job description to > the JobTracker. > 2) JobTracker figures out block locations for the input file(s) by talking > to HDFS NameNode. > 3) JobTracker creates a job description file in HDFS which will be read by > the nodes to copy over the job's code etc. > 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the > appropriate data blocks. > 5) After running, maps create intermediate output files on those slaves. > These are not in HDFS, they're in some temporary storage used by MapReduce. > 6) JobTracker starts reduces on a series of slaves, which copy over the > appropriate map outputs, apply the reduce function, and write the outputs > to > HDFS (one output file per reducer). > 7) Some logs for the job may also be put into HDFS by the JobTracker. > > However, there is a big caveat, which is that the map and reduce tasks run > arbitrary code. It is not unusual to have a map that opens a second HDFS > file to read some information (e.g. for doing a join of a small table > against a big file). If you use Hadoop Streaming or Pipes to write a job in > Python, Ruby, C, etc, then you are launching arbitrary processes which may > also access external resources in this manner. Some people also read/write > to DBs (e.g. MySQL) from their tasks. A comprehensive security solution > would ideally deal with these cases too. > > On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana <[email protected]> > wrote: > > > A quick question here. How does a typical hadoop job work at the system > > level? What are the various interactions and how does the data flow? > > > > Amandeep > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana <[email protected]> > > wrote: > > > > > Thanks Matei. If the basic architecture is similar to the Google stuff, > I > > > can safely just work on the project using the information from the > > papers. > > > > > > I am aware of the 4487 jira and the current status of the permissions > > > mechanism. I had a look at them earlier. > > > > > > Cheers > > > Amandeep > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia <[email protected]> > > wrote: > > > > > >> Forgot to add, this JIRA details the latest security features that are > > >> being > > >> worked on in Hadoop trunk: > > >> https://issues.apache.org/jira/browse/HADOOP-4487. > > >> This document describes the current status and limitations of the > > >> permissions mechanism: > > >> > http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html. > > >> > > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia <[email protected]> > > >> wrote: > > >> > > >> > I think it's safe to assume that Hadoop works like MapReduce/GFS at > > the > > >> > level described in those papers. In particular, in HDFS, there is a > > >> master > > >> > node containing metadata and a number of slave nodes (datanodes) > > >> containing > > >> > blocks, as in GFS. Clients start by talking to the master to list > > >> > directories, etc. When they want to read a region of some file, they > > >> tell > > >> > the master the filename and offset, and they receive a list of block > > >> > locations (datanodes). They then contact the individual datanodes to > > >> read > > >> > the blocks. When clients write a file, they first obtain a new block > > ID > > >> and > > >> > list of nodes to write it to from the master, then contact the > > datanodes > > >> to > > >> > write it (actually, the datanodes pipeline the write as in GFS) and > > >> report > > >> > when the write is complete. HDFS actually has some security > mechanisms > > >> built > > >> > in, authenticating users based on their Unix ID and providing > > Unix-like > > >> file > > >> > permissions. I don't know much about how these are implemented, but > > they > > >> > would be a good place to start looking. > > >> > > > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana <[email protected] > > >> >wrote: > > >> > > > >> >> Thanks Matie > > >> >> > > >> >> I had gone through the architecture document online. I am currently > > >> >> working > > >> >> on a project towards Security in Hadoop. I do know how the data > moves > > >> >> around > > >> >> in the GFS but wasnt sure how much of that does HDFS follow and how > > >> >> different it is from GFS. Can you throw some light on that? > > >> >> > > >> >> Security would also involve the Map Reduce jobs following the same > > >> >> protocols. Thats why the question about how does the Hadoop > framework > > >> >> integrate with the HDFS, and how different is it from Map Reduce > and > > >> GFS. > > >> >> The GFS and Map Reduce papers give a good information on how those > > >> systems > > >> >> are designed but there is nothing that concrete for Hadoop that I > > have > > >> >> been > > >> >> able to find. > > >> >> > > >> >> Amandeep > > >> >> > > >> >> > > >> >> Amandeep Khurana > > >> >> Computer Science Graduate Student > > >> >> University of California, Santa Cruz > > >> >> > > >> >> > > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia < > [email protected]> > > >> >> wrote: > > >> >> > > >> >> > Hi Amandeep, > > >> >> > Hadoop is definitely inspired by MapReduce/GFS and aims to > provide > > >> those > > >> >> > capabilities as an open-source project. HDFS is similar to GFS > > (large > > >> >> > blocks, replication, etc); some notable things missing are > > read-write > > >> >> > support in the middle of a file (unlikely to be provided because > > few > > >> >> Hadoop > > >> >> > applications require it) and multiple appenders (the record > append > > >> >> > operation). You can read about HDFS architecture at > > >> >> > http://hadoop.apache.org/core/docs/current/hdfs_design.html. The > > >> >> MapReduce > > >> >> > part of Hadoop interacts with HDFS in the same way that Google's > > >> >> MapReduce > > >> >> > interacts with GFS (shipping computation to the data), although > > >> Hadoop > > >> >> > MapReduce also supports running over other distributed > filesystems. > > >> >> > > > >> >> > Matei > > >> >> > > > >> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana < > > [email protected] > > >> > > > >> >> > wrote: > > >> >> > > > >> >> > > Hi > > >> >> > > > > >> >> > > Is the HDFS architecture completely based on the Google > > Filesystem? > > >> If > > >> >> it > > >> >> > > isnt, what are the differences between the two? > > >> >> > > > > >> >> > > Secondly, is the coupling between Hadoop and HDFS same as how > it > > is > > >> >> > between > > >> >> > > the Google's version of Map Reduce and GFS? > > >> >> > > > > >> >> > > Amandeep > > >> >> > > > > >> >> > > > > >> >> > > Amandeep Khurana > > >> >> > > Computer Science Graduate Student > > >> >> > > University of California, Santa Cruz > > >> >> > > > > >> >> > > > >> >> > > >> > > > >> > > > >> > > > > > > > > >
