Hello Matei, Which Tasktracker did you mean here ?
I don't understand that. In general we have mane Tasktrackers and each of them runs on one separate Datanode. Why doesn't the JobTracker talk directly to the Namenode for a list of Datanodes and then performs the MapReduce tasks there. Any help is appreciated. Kang Matei Zaharia-2 wrote: > > Nope, typically the JobTracker just starts the process, and the > tasktracker > talks directly to the namenode to get a pointer to the datanode, and then > directly to the datanode. > > On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana <ama...@gmail.com> > wrote: > >> Alright.. Got it. >> >> Now, do the task trackers talk to the namenode and the data node directly >> or >> do they go through the job tracker for it? So, if my code is such that I >> need to access more files from the hdfs, would the job tracker get >> involved >> or not? >> >> >> >> >> Amandeep Khurana >> Computer Science Graduate Student >> University of California, Santa Cruz >> >> >> On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia <ma...@cloudera.com> >> wrote: >> >> > Normally, HDFS files are accessed through the namenode. If there was a >> > malicious process though, then I imagine it could talk to a datanode >> > directly and request a specific block. >> > >> > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana <ama...@gmail.com> >> > wrote: >> > >> > > Ok. Got it. >> > > >> > > Now, when my job needs to access another file, does it go to the >> Namenode >> > > to >> > > get the block ids? How does the java process know where the files are >> and >> > > how to access them? >> > > >> > > >> > > Amandeep Khurana >> > > Computer Science Graduate Student >> > > University of California, Santa Cruz >> > > >> > > >> > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia <ma...@cloudera.com> >> > wrote: >> > > >> > > > I mentioned this case because even jobs written in Java can use the >> > HDFS >> > > > API >> > > > to talk to the NameNode and access the filesystem. People often do >> this >> > > > because their job needs to read a config file, some small data >> table, >> > etc >> > > > and use this information in its map or reduce functions. In this >> case, >> > > you >> > > > open the second file separately in your mapper's init function and >> read >> > > > whatever you need from it. In general I wanted to point out that >> you >> > > can't >> > > > know which files a job will access unless you look at its source >> code >> > or >> > > > monitor the calls it makes; the input file(s) you provide in the >> job >> > > > description are a hint to the MapReduce framework to place your job >> on >> > > > certain nodes, but it's reasonable for the job to access other >> files >> as >> > > > well. >> > > > >> > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana >> <ama...@gmail.com> >> > > > wrote: >> > > > >> > > > > Another question that I have here - When the jobs run arbitrary >> code >> > > and >> > > > > access data from the HDFS, do they go to the namenode to get the >> > block >> > > > > information? >> > > > > >> > > > > >> > > > > Amandeep Khurana >> > > > > Computer Science Graduate Student >> > > > > University of California, Santa Cruz >> > > > > >> > > > > >> > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana < >> ama...@gmail.com> >> > > > > wrote: >> > > > > >> > > > > > Assuming that the job is purely in Java and not involving >> streaming >> > > or >> > > > > > pipes, wouldnt the resources (files) required by the job as >> inputs >> > be >> > > > > known >> > > > > > beforehand? So, if the map task is accessing a second file, how >> > does >> > > it >> > > > > make >> > > > > > it different except that there are multiple files. The >> JobTracker >> > > would >> > > > > know >> > > > > > beforehand that multiple files would be accessed. Right? >> > > > > > >> > > > > > I am slightly confused why you have mentioned this case >> > separately... >> > > > Can >> > > > > > you elaborate on it a little bit? >> > > > > > >> > > > > > Amandeep >> > > > > > >> > > > > > >> > > > > > Amandeep Khurana >> > > > > > Computer Science Graduate Student >> > > > > > University of California, Santa Cruz >> > > > > > >> > > > > > >> > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia < >> ma...@cloudera.com >> > > >> > > > > wrote: >> > > > > > >> > > > > >> Typically the data flow is like this:1) Client submits a job >> > > > description >> > > > > >> to >> > > > > >> the JobTracker. >> > > > > >> 2) JobTracker figures out block locations for the input >> file(s) >> by >> > > > > talking >> > > > > >> to HDFS NameNode. >> > > > > >> 3) JobTracker creates a job description file in HDFS which >> will >> be >> > > > read >> > > > > by >> > > > > >> the nodes to copy over the job's code etc. >> > > > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) >> with >> > the >> > > > > >> appropriate data blocks. >> > > > > >> 5) After running, maps create intermediate output files on >> those >> > > > slaves. >> > > > > >> These are not in HDFS, they're in some temporary storage used >> by >> > > > > >> MapReduce. >> > > > > >> 6) JobTracker starts reduces on a series of slaves, which copy >> > over >> > > > the >> > > > > >> appropriate map outputs, apply the reduce function, and write >> the >> > > > > outputs >> > > > > >> to >> > > > > >> HDFS (one output file per reducer). >> > > > > >> 7) Some logs for the job may also be put into HDFS by the >> > > JobTracker. >> > > > > >> >> > > > > >> However, there is a big caveat, which is that the map and >> reduce >> > > tasks >> > > > > run >> > > > > >> arbitrary code. It is not unusual to have a map that opens a >> > second >> > > > HDFS >> > > > > >> file to read some information (e.g. for doing a join of a >> small >> > > table >> > > > > >> against a big file). If you use Hadoop Streaming or Pipes to >> write >> > a >> > > > job >> > > > > >> in >> > > > > >> Python, Ruby, C, etc, then you are launching arbitrary >> processes >> > > which >> > > > > may >> > > > > >> also access external resources in this manner. Some people >> also >> > > > > read/write >> > > > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security >> > > > solution >> > > > > >> would ideally deal with these cases too. >> > > > > >> >> > > > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana < >> > ama...@gmail.com >> > > > >> > > > > >> wrote: >> > > > > >> >> > > > > >> > A quick question here. How does a typical hadoop job work at >> the >> > > > > system >> > > > > >> > level? What are the various interactions and how does the >> data >> > > flow? >> > > > > >> > >> > > > > >> > Amandeep >> > > > > >> > >> > > > > >> > >> > > > > >> > Amandeep Khurana >> > > > > >> > Computer Science Graduate Student >> > > > > >> > University of California, Santa Cruz >> > > > > >> > >> > > > > >> > >> > > > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana < >> > > ama...@gmail.com >> > > > > >> > > > > >> > wrote: >> > > > > >> > >> > > > > >> > > Thanks Matei. If the basic architecture is similar to the >> > Google >> > > > > >> stuff, I >> > > > > >> > > can safely just work on the project using the information >> from >> > > the >> > > > > >> > papers. >> > > > > >> > > >> > > > > >> > > I am aware of the 4487 jira and the current status of the >> > > > > permissions >> > > > > >> > > mechanism. I had a look at them earlier. >> > > > > >> > > >> > > > > >> > > Cheers >> > > > > >> > > Amandeep >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > Amandeep Khurana >> > > > > >> > > Computer Science Graduate Student >> > > > > >> > > University of California, Santa Cruz >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia < >> > > > ma...@cloudera.com> >> > > > > >> > wrote: >> > > > > >> > > >> > > > > >> > >> Forgot to add, this JIRA details the latest security >> features >> > > > that >> > > > > >> are >> > > > > >> > >> being >> > > > > >> > >> worked on in Hadoop trunk: >> > > > > >> > >> https://issues.apache.org/jira/browse/HADOOP-4487. >> > > > > >> > >> This document describes the current status and >> limitations >> of >> > > the >> > > > > >> > >> permissions mechanism: >> > > > > >> > >> >> > > > > >> >> > > > >> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html >> > . >> > > > > >> > >> >> > > > > >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia < >> > > > ma...@cloudera.com >> > > > > > >> > > > > >> > >> wrote: >> > > > > >> > >> >> > > > > >> > >> > I think it's safe to assume that Hadoop works like >> > > > MapReduce/GFS >> > > > > at >> > > > > >> > the >> > > > > >> > >> > level described in those papers. In particular, in >> HDFS, >> > > there >> > > > is >> > > > > a >> > > > > >> > >> master >> > > > > >> > >> > node containing metadata and a number of slave nodes >> > > > (datanodes) >> > > > > >> > >> containing >> > > > > >> > >> > blocks, as in GFS. Clients start by talking to the >> master >> > to >> > > > list >> > > > > >> > >> > directories, etc. When they want to read a region of >> some >> > > file, >> > > > > >> they >> > > > > >> > >> tell >> > > > > >> > >> > the master the filename and offset, and they receive a >> list >> > > of >> > > > > >> block >> > > > > >> > >> > locations (datanodes). They then contact the individual >> > > > datanodes >> > > > > >> to >> > > > > >> > >> read >> > > > > >> > >> > the blocks. When clients write a file, they first >> obtain >> a >> > > new >> > > > > >> block >> > > > > >> > ID >> > > > > >> > >> and >> > > > > >> > >> > list of nodes to write it to from the master, then >> contact >> > > the >> > > > > >> > datanodes >> > > > > >> > >> to >> > > > > >> > >> > write it (actually, the datanodes pipeline the write as >> in >> > > GFS) >> > > > > and >> > > > > >> > >> report >> > > > > >> > >> > when the write is complete. HDFS actually has some >> security >> > > > > >> mechanisms >> > > > > >> > >> built >> > > > > >> > >> > in, authenticating users based on their Unix ID and >> > providing >> > > > > >> > Unix-like >> > > > > >> > >> file >> > > > > >> > >> > permissions. I don't know much about how these are >> > > implemented, >> > > > > but >> > > > > >> > they >> > > > > >> > >> > would be a good place to start looking. >> > > > > >> > >> > >> > > > > >> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana < >> > > > > >> ama...@gmail.com >> > > > > >> > >> >wrote: >> > > > > >> > >> > >> > > > > >> > >> >> Thanks Matie >> > > > > >> > >> >> >> > > > > >> > >> >> I had gone through the architecture document online. I >> am >> > > > > >> currently >> > > > > >> > >> >> working >> > > > > >> > >> >> on a project towards Security in Hadoop. I do know how >> the >> > > > data >> > > > > >> moves >> > > > > >> > >> >> around >> > > > > >> > >> >> in the GFS but wasnt sure how much of that does HDFS >> > follow >> > > > and >> > > > > >> how >> > > > > >> > >> >> different it is from GFS. Can you throw some light on >> > that? >> > > > > >> > >> >> >> > > > > >> > >> >> Security would also involve the Map Reduce jobs >> following >> > > the >> > > > > same >> > > > > >> > >> >> protocols. Thats why the question about how does the >> > Hadoop >> > > > > >> framework >> > > > > >> > >> >> integrate with the HDFS, and how different is it from >> Map >> > > > Reduce >> > > > > >> and >> > > > > >> > >> GFS. >> > > > > >> > >> >> The GFS and Map Reduce papers give a good information >> on >> > how >> > > > > those >> > > > > >> > >> systems >> > > > > >> > >> >> are designed but there is nothing that concrete for >> Hadoop >> > > > that >> > > > > I >> > > > > >> > have >> > > > > >> > >> >> been >> > > > > >> > >> >> able to find. >> > > > > >> > >> >> >> > > > > >> > >> >> Amandeep >> > > > > >> > >> >> >> > > > > >> > >> >> >> > > > > >> > >> >> Amandeep Khurana >> > > > > >> > >> >> Computer Science Graduate Student >> > > > > >> > >> >> University of California, Santa Cruz >> > > > > >> > >> >> >> > > > > >> > >> >> >> > > > > >> > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia < >> > > > > >> ma...@cloudera.com> >> > > > > >> > >> >> wrote: >> > > > > >> > >> >> >> > > > > >> > >> >> > Hi Amandeep, >> > > > > >> > >> >> > Hadoop is definitely inspired by MapReduce/GFS and >> aims >> > to >> > > > > >> provide >> > > > > >> > >> those >> > > > > >> > >> >> > capabilities as an open-source project. HDFS is >> similar >> > to >> > > > GFS >> > > > > >> > (large >> > > > > >> > >> >> > blocks, replication, etc); some notable things >> missing >> > are >> > > > > >> > read-write >> > > > > >> > >> >> > support in the middle of a file (unlikely to be >> provided >> > > > > because >> > > > > >> > few >> > > > > >> > >> >> Hadoop >> > > > > >> > >> >> > applications require it) and multiple appenders (the >> > > record >> > > > > >> append >> > > > > >> > >> >> > operation). You can read about HDFS architecture at >> > > > > >> > >> >> > >> > > http://hadoop.apache.org/core/docs/current/hdfs_design.html >> > > > . >> > > > > >> The >> > > > > >> > >> >> MapReduce >> > > > > >> > >> >> > part of Hadoop interacts with HDFS in the same way >> that >> > > > > Google's >> > > > > >> > >> >> MapReduce >> > > > > >> > >> >> > interacts with GFS (shipping computation to the >> data), >> > > > > although >> > > > > >> > >> Hadoop >> > > > > >> > >> >> > MapReduce also supports running over other >> distributed >> > > > > >> filesystems. >> > > > > >> > >> >> > >> > > > > >> > >> >> > Matei >> > > > > >> > >> >> > >> > > > > >> > >> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana < >> > > > > >> > ama...@gmail.com >> > > > > >> > >> > >> > > > > >> > >> >> > wrote: >> > > > > >> > >> >> > >> > > > > >> > >> >> > > Hi >> > > > > >> > >> >> > > >> > > > > >> > >> >> > > Is the HDFS architecture completely based on the >> > Google >> > > > > >> > Filesystem? >> > > > > >> > >> If >> > > > > >> > >> >> it >> > > > > >> > >> >> > > isnt, what are the differences between the two? >> > > > > >> > >> >> > > >> > > > > >> > >> >> > > Secondly, is the coupling between Hadoop and HDFS >> same >> > > as >> > > > > how >> > > > > >> it >> > > > > >> > is >> > > > > >> > >> >> > between >> > > > > >> > >> >> > > the Google's version of Map Reduce and GFS? >> > > > > >> > >> >> > > >> > > > > >> > >> >> > > Amandeep >> > > > > >> > >> >> > > >> > > > > >> > >> >> > > >> > > > > >> > >> >> > > Amandeep Khurana >> > > > > >> > >> >> > > Computer Science Graduate Student >> > > > > >> > >> >> > > University of California, Santa Cruz >> > > > > >> > >> >> > > >> > > > > >> > >> >> > >> > > > > >> > >> >> >> > > > > >> > >> > >> > > > > >> > >> > >> > > > > >> > >> >> > > > > >> > > >> > > > > >> > > >> > > > > >> > >> > > > > >> >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > > -- View this message in context: http://www.nabble.com/HDFS-architecture-based-on-GFS--tp22026711p22237160.html Sent from the Hadoop core-user mailing list archive at Nabble.com.