As far as I know, datanodes just know the block ID, and the namenode knows which file this belongs to.
On Mon, Feb 16, 2009 at 4:54 PM, Amandeep Khurana <ama...@gmail.com> wrote: > Ok. Thanks.. > > Another question now. Do the datanodes have any way of linking a particular > block of data to a global file identifier? > > Amandeep > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Feb 15, 2009 at 9:37 PM, Matei Zaharia <ma...@cloudera.com> wrote: > > > In general, yeah, the scripts can access any resource they want (within > the > > permissions of the user that the task runs as). It's also possible to > > access > > HDFS from scripts because HDFS provides a FUSE interface that can make it > > look like a regular file system on the machine. (The FUSE module in turn > > talks to the namenode as a regular HDFS client.) > > > > On Sun, Feb 15, 2009 at 8:43 PM, Amandeep Khurana <ama...@gmail.com> > > wrote: > > > > > I dont know much about Hadoop streaming and have a quick question here. > > > > > > The snippets of code/programs that you attach into the map reduce job > > might > > > want to access outside resources (like you mentioned). Now these might > > not > > > need to go to the namenode right? For example a python script. How > would > > it > > > access the data? Would it ask the parent java process (in the > > tasktracker) > > > to get the data or would it go and do stuff on its own? > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Sun, Feb 15, 2009 at 8:23 PM, Matei Zaharia <ma...@cloudera.com> > > wrote: > > > > > > > Nope, typically the JobTracker just starts the process, and the > > > tasktracker > > > > talks directly to the namenode to get a pointer to the datanode, and > > then > > > > directly to the datanode. > > > > > > > > On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana <ama...@gmail.com> > > > > wrote: > > > > > > > > > Alright.. Got it. > > > > > > > > > > Now, do the task trackers talk to the namenode and the data node > > > directly > > > > > or > > > > > do they go through the job tracker for it? So, if my code is such > > that > > > I > > > > > need to access more files from the hdfs, would the job tracker get > > > > involved > > > > > or not? > > > > > > > > > > > > > > > > > > > > > > > > > Amandeep Khurana > > > > > Computer Science Graduate Student > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia <ma...@cloudera.com > > > > > > wrote: > > > > > > > > > > > Normally, HDFS files are accessed through the namenode. If there > > was > > > a > > > > > > malicious process though, then I imagine it could talk to a > > datanode > > > > > > directly and request a specific block. > > > > > > > > > > > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana < > > ama...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Ok. Got it. > > > > > > > > > > > > > > Now, when my job needs to access another file, does it go to > the > > > > > Namenode > > > > > > > to > > > > > > > get the block ids? How does the java process know where the > files > > > are > > > > > and > > > > > > > how to access them? > > > > > > > > > > > > > > > > > > > > > Amandeep Khurana > > > > > > > Computer Science Graduate Student > > > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia < > > ma...@cloudera.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > I mentioned this case because even jobs written in Java can > use > > > the > > > > > > HDFS > > > > > > > > API > > > > > > > > to talk to the NameNode and access the filesystem. People > often > > > do > > > > > this > > > > > > > > because their job needs to read a config file, some small > data > > > > table, > > > > > > etc > > > > > > > > and use this information in its map or reduce functions. In > > this > > > > > case, > > > > > > > you > > > > > > > > open the second file separately in your mapper's init > function > > > and > > > > > read > > > > > > > > whatever you need from it. In general I wanted to point out > > that > > > > you > > > > > > > can't > > > > > > > > know which files a job will access unless you look at its > > source > > > > code > > > > > > or > > > > > > > > monitor the calls it makes; the input file(s) you provide in > > the > > > > job > > > > > > > > description are a hint to the MapReduce framework to place > your > > > job > > > > > on > > > > > > > > certain nodes, but it's reasonable for the job to access > other > > > > files > > > > > as > > > > > > > > well. > > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana < > > > > ama...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Another question that I have here - When the jobs run > > arbitrary > > > > > code > > > > > > > and > > > > > > > > > access data from the HDFS, do they go to the namenode to > get > > > the > > > > > > block > > > > > > > > > information? > > > > > > > > > > > > > > > > > > > > > > > > > > > Amandeep Khurana > > > > > > > > > Computer Science Graduate Student > > > > > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana < > > > > > ama...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Assuming that the job is purely in Java and not involving > > > > > streaming > > > > > > > or > > > > > > > > > > pipes, wouldnt the resources (files) required by the job > as > > > > > inputs > > > > > > be > > > > > > > > > known > > > > > > > > > > beforehand? So, if the map task is accessing a second > file, > > > how > > > > > > does > > > > > > > it > > > > > > > > > make > > > > > > > > > > it different except that there are multiple files. The > > > > JobTracker > > > > > > > would > > > > > > > > > know > > > > > > > > > > beforehand that multiple files would be accessed. Right? > > > > > > > > > > > > > > > > > > > > I am slightly confused why you have mentioned this case > > > > > > separately... > > > > > > > > Can > > > > > > > > > > you elaborate on it a little bit? > > > > > > > > > > > > > > > > > > > > Amandeep > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Amandeep Khurana > > > > > > > > > > Computer Science Graduate Student > > > > > > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia < > > > > > ma...@cloudera.com > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > >> Typically the data flow is like this:1) Client submits a > > job > > > > > > > > description > > > > > > > > > >> to > > > > > > > > > >> the JobTracker. > > > > > > > > > >> 2) JobTracker figures out block locations for the input > > > > file(s) > > > > > by > > > > > > > > > talking > > > > > > > > > >> to HDFS NameNode. > > > > > > > > > >> 3) JobTracker creates a job description file in HDFS > which > > > > will > > > > > be > > > > > > > > read > > > > > > > > > by > > > > > > > > > >> the nodes to copy over the job's code etc. > > > > > > > > > >> 4) JobTracker starts map tasks on the slaves > > (TaskTrackers) > > > > with > > > > > > the > > > > > > > > > >> appropriate data blocks. > > > > > > > > > >> 5) After running, maps create intermediate output files > on > > > > those > > > > > > > > slaves. > > > > > > > > > >> These are not in HDFS, they're in some temporary storage > > > used > > > > by > > > > > > > > > >> MapReduce. > > > > > > > > > >> 6) JobTracker starts reduces on a series of slaves, > which > > > copy > > > > > > over > > > > > > > > the > > > > > > > > > >> appropriate map outputs, apply the reduce function, and > > > write > > > > > the > > > > > > > > > outputs > > > > > > > > > >> to > > > > > > > > > >> HDFS (one output file per reducer). > > > > > > > > > >> 7) Some logs for the job may also be put into HDFS by > the > > > > > > > JobTracker. > > > > > > > > > >> > > > > > > > > > >> However, there is a big caveat, which is that the map > and > > > > reduce > > > > > > > tasks > > > > > > > > > run > > > > > > > > > >> arbitrary code. It is not unusual to have a map that > opens > > a > > > > > > second > > > > > > > > HDFS > > > > > > > > > >> file to read some information (e.g. for doing a join of > a > > > > small > > > > > > > table > > > > > > > > > >> against a big file). If you use Hadoop Streaming or > Pipes > > to > > > > > write > > > > > > a > > > > > > > > job > > > > > > > > > >> in > > > > > > > > > >> Python, Ruby, C, etc, then you are launching arbitrary > > > > processes > > > > > > > which > > > > > > > > > may > > > > > > > > > >> also access external resources in this manner. Some > people > > > > also > > > > > > > > > read/write > > > > > > > > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive > > > security > > > > > > > > solution > > > > > > > > > >> would ideally deal with these cases too. > > > > > > > > > >> > > > > > > > > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana < > > > > > > ama...@gmail.com > > > > > > > > > > > > > > > > > >> wrote: > > > > > > > > > >> > > > > > > > > > >> > A quick question here. How does a typical hadoop job > > work > > > at > > > > > the > > > > > > > > > system > > > > > > > > > >> > level? What are the various interactions and how does > > the > > > > data > > > > > > > flow? > > > > > > > > > >> > > > > > > > > > > >> > Amandeep > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > Amandeep Khurana > > > > > > > > > >> > Computer Science Graduate Student > > > > > > > > > >> > University of California, Santa Cruz > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana < > > > > > > > ama...@gmail.com > > > > > > > > > > > > > > > > > > >> > wrote: > > > > > > > > > >> > > > > > > > > > > >> > > Thanks Matei. If the basic architecture is similar > to > > > the > > > > > > Google > > > > > > > > > >> stuff, I > > > > > > > > > >> > > can safely just work on the project using the > > > information > > > > > from > > > > > > > the > > > > > > > > > >> > papers. > > > > > > > > > >> > > > > > > > > > > > >> > > I am aware of the 4487 jira and the current status > of > > > the > > > > > > > > > permissions > > > > > > > > > >> > > mechanism. I had a look at them earlier. > > > > > > > > > >> > > > > > > > > > > > >> > > Cheers > > > > > > > > > >> > > Amandeep > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > Amandeep Khurana > > > > > > > > > >> > > Computer Science Graduate Student > > > > > > > > > >> > > University of California, Santa Cruz > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia < > > > > > > > > ma...@cloudera.com> > > > > > > > > > >> > wrote: > > > > > > > > > >> > > > > > > > > > > > >> > >> Forgot to add, this JIRA details the latest > security > > > > > features > > > > > > > > that > > > > > > > > > >> are > > > > > > > > > >> > >> being > > > > > > > > > >> > >> worked on in Hadoop trunk: > > > > > > > > > >> > >> https://issues.apache.org/jira/browse/HADOOP-4487. > > > > > > > > > >> > >> This document describes the current status and > > > > limitations > > > > > of > > > > > > > the > > > > > > > > > >> > >> permissions mechanism: > > > > > > > > > >> > >> > > > > > > > > > >> > > > > > > > > > > > > > > > http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html > > > > > > . > > > > > > > > > >> > >> > > > > > > > > > >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia < > > > > > > > > ma...@cloudera.com > > > > > > > > > > > > > > > > > > > >> > >> wrote: > > > > > > > > > >> > >> > > > > > > > > > >> > >> > I think it's safe to assume that Hadoop works > like > > > > > > > > MapReduce/GFS > > > > > > > > > at > > > > > > > > > >> > the > > > > > > > > > >> > >> > level described in those papers. In particular, > in > > > > HDFS, > > > > > > > there > > > > > > > > is > > > > > > > > > a > > > > > > > > > >> > >> master > > > > > > > > > >> > >> > node containing metadata and a number of slave > > nodes > > > > > > > > (datanodes) > > > > > > > > > >> > >> containing > > > > > > > > > >> > >> > blocks, as in GFS. Clients start by talking to > the > > > > master > > > > > > to > > > > > > > > list > > > > > > > > > >> > >> > directories, etc. When they want to read a region > > of > > > > some > > > > > > > file, > > > > > > > > > >> they > > > > > > > > > >> > >> tell > > > > > > > > > >> > >> > the master the filename and offset, and they > > receive > > > a > > > > > list > > > > > > > of > > > > > > > > > >> block > > > > > > > > > >> > >> > locations (datanodes). They then contact the > > > individual > > > > > > > > datanodes > > > > > > > > > >> to > > > > > > > > > >> > >> read > > > > > > > > > >> > >> > the blocks. When clients write a file, they first > > > > obtain > > > > > a > > > > > > > new > > > > > > > > > >> block > > > > > > > > > >> > ID > > > > > > > > > >> > >> and > > > > > > > > > >> > >> > list of nodes to write it to from the master, > then > > > > > contact > > > > > > > the > > > > > > > > > >> > datanodes > > > > > > > > > >> > >> to > > > > > > > > > >> > >> > write it (actually, the datanodes pipeline the > > write > > > as > > > > > in > > > > > > > GFS) > > > > > > > > > and > > > > > > > > > >> > >> report > > > > > > > > > >> > >> > when the write is complete. HDFS actually has > some > > > > > security > > > > > > > > > >> mechanisms > > > > > > > > > >> > >> built > > > > > > > > > >> > >> > in, authenticating users based on their Unix ID > and > > > > > > providing > > > > > > > > > >> > Unix-like > > > > > > > > > >> > >> file > > > > > > > > > >> > >> > permissions. I don't know much about how these > are > > > > > > > implemented, > > > > > > > > > but > > > > > > > > > >> > they > > > > > > > > > >> > >> > would be a good place to start looking. > > > > > > > > > >> > >> > > > > > > > > > > >> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana > < > > > > > > > > > >> ama...@gmail.com > > > > > > > > > >> > >> >wrote: > > > > > > > > > >> > >> > > > > > > > > > > >> > >> >> Thanks Matie > > > > > > > > > >> > >> >> > > > > > > > > > >> > >> >> I had gone through the architecture document > > online. > > > I > > > > > am > > > > > > > > > >> currently > > > > > > > > > >> > >> >> working > > > > > > > > > >> > >> >> on a project towards Security in Hadoop. I do > know > > > how > > > > > the > > > > > > > > data > > > > > > > > > >> moves > > > > > > > > > >> > >> >> around > > > > > > > > > >> > >> >> in the GFS but wasnt sure how much of that does > > HDFS > > > > > > follow > > > > > > > > and > > > > > > > > > >> how > > > > > > > > > >> > >> >> different it is from GFS. Can you throw some > light > > > on > > > > > > that? > > > > > > > > > >> > >> >> > > > > > > > > > >> > >> >> Security would also involve the Map Reduce jobs > > > > > following > > > > > > > the > > > > > > > > > same > > > > > > > > > >> > >> >> protocols. Thats why the question about how does > > the > > > > > > Hadoop > > > > > > > > > >> framework > > > > > > > > > >> > >> >> integrate with the HDFS, and how different is it > > > from > > > > > Map > > > > > > > > Reduce > > > > > > > > > >> and > > > > > > > > > >> > >> GFS. > > > > > > > > > >> > >> >> The GFS and Map Reduce papers give a good > > > information > > > > on > > > > > > how > > > > > > > > > those > > > > > > > > > >> > >> systems > > > > > > > > > >> > >> >> are designed but there is nothing that concrete > > for > > > > > Hadoop > > > > > > > > that > > > > > > > > > I > > > > > > > > > >> > have > > > > > > > > > >> > >> >> been > > > > > > > > > >> > >> >> able to find. > > > > > > > > > >> > >> >> > > > > > > > > > >> > >> >> Amandeep > > > > > > > > > >> > >> >> > > > > > > > > > >> > >> >> > > > > > > > > > >> > >> >> Amandeep Khurana > > > > > > > > > >> > >> >> Computer Science Graduate Student > > > > > > > > > >> > >> >> University of California, Santa Cruz > > > > > > > > > >> > >> >> > > > > > > > > > >> > >> >> > > > > > > > > > >> > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia > < > > > > > > > > > >> ma...@cloudera.com> > > > > > > > > > >> > >> >> wrote: > > > > > > > > > >> > >> >> > > > > > > > > > >> > >> >> > Hi Amandeep, > > > > > > > > > >> > >> >> > Hadoop is definitely inspired by MapReduce/GFS > > and > > > > > aims > > > > > > to > > > > > > > > > >> provide > > > > > > > > > >> > >> those > > > > > > > > > >> > >> >> > capabilities as an open-source project. HDFS > is > > > > > similar > > > > > > to > > > > > > > > GFS > > > > > > > > > >> > (large > > > > > > > > > >> > >> >> > blocks, replication, etc); some notable things > > > > missing > > > > > > are > > > > > > > > > >> > read-write > > > > > > > > > >> > >> >> > support in the middle of a file (unlikely to > be > > > > > provided > > > > > > > > > because > > > > > > > > > >> > few > > > > > > > > > >> > >> >> Hadoop > > > > > > > > > >> > >> >> > applications require it) and multiple > appenders > > > (the > > > > > > > record > > > > > > > > > >> append > > > > > > > > > >> > >> >> > operation). You can read about HDFS > architecture > > > at > > > > > > > > > >> > >> >> > > > > > > > > http://hadoop.apache.org/core/docs/current/hdfs_design.html > > > > > > > > . > > > > > > > > > >> The > > > > > > > > > >> > >> >> MapReduce > > > > > > > > > >> > >> >> > part of Hadoop interacts with HDFS in the same > > way > > > > > that > > > > > > > > > Google's > > > > > > > > > >> > >> >> MapReduce > > > > > > > > > >> > >> >> > interacts with GFS (shipping computation to > the > > > > data), > > > > > > > > > although > > > > > > > > > >> > >> Hadoop > > > > > > > > > >> > >> >> > MapReduce also supports running over other > > > > distributed > > > > > > > > > >> filesystems. > > > > > > > > > >> > >> >> > > > > > > > > > > >> > >> >> > Matei > > > > > > > > > >> > >> >> > > > > > > > > > > >> > >> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep > > Khurana > > > < > > > > > > > > > >> > ama...@gmail.com > > > > > > > > > >> > >> > > > > > > > > > > >> > >> >> > wrote: > > > > > > > > > >> > >> >> > > > > > > > > > > >> > >> >> > > Hi > > > > > > > > > >> > >> >> > > > > > > > > > > > >> > >> >> > > Is the HDFS architecture completely based on > > the > > > > > > Google > > > > > > > > > >> > Filesystem? > > > > > > > > > >> > >> If > > > > > > > > > >> > >> >> it > > > > > > > > > >> > >> >> > > isnt, what are the differences between the > > two? > > > > > > > > > >> > >> >> > > > > > > > > > > > >> > >> >> > > Secondly, is the coupling between Hadoop and > > > HDFS > > > > > same > > > > > > > as > > > > > > > > > how > > > > > > > > > >> it > > > > > > > > > >> > is > > > > > > > > > >> > >> >> > between > > > > > > > > > >> > >> >> > > the Google's version of Map Reduce and GFS? > > > > > > > > > >> > >> >> > > > > > > > > > > > >> > >> >> > > Amandeep > > > > > > > > > >> > >> >> > > > > > > > > > > > >> > >> >> > > > > > > > > > > > >> > >> >> > > Amandeep Khurana > > > > > > > > > >> > >> >> > > Computer Science Graduate Student > > > > > > > > > >> > >> >> > > University of California, Santa Cruz > > > > > > > > > >> > >> >> > > > > > > > > > > > >> > >> >> > > > > > > > > > > >> > >> >> > > > > > > > > > >> > >> > > > > > > > > > > >> > >> > > > > > > > > > > >> > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >